SCOPe: Manual Curation and Artifact Removal in the Structural Classification of Proteins – extended Database

: SCOPe (Structural Classification of Proteins – extended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP) database. SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases. SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. SCOPe continues high quality manual classification of new superfamilies, a key feature of SCOP. Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy. SCOPe 2.06 contains 77,439 PDB entries, double the 38,221 structures classified in SCOP.

database.SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships.SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases.SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe.
SCOPe continues high quality manual classification of new superfamilies, a key feature of SCOP.Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy.SCOPe 2.06 contains 77,439 PDB entries, double the 38,221 structures classified in SCOP.

Background
Nearly all proteins have structural similarities with other proteins and, in many of these cases, share a common evolutionary origin.The Structural Classification of Proteins (SCOP) database [1][2][3][4] is a manually curated hierarchy of domains from proteins of known structure, organized according to their structural and evolutionary relationships.Work on the SCOP version 1 series concluded in 2009 with the release of SCOP 1.75.To continue its development, we created the SCOPe (SCOP-extended) database, which provides ongoing updates and classification of new protein structures [5].The initial version of SCOPe imported the SCOP 1.75 classification to build upon.We use a combination of manual curation and a rigorously validated software pipeline [5] to add new structures from the Protein Data Bank (PDB) [6,7], and we have also developed software to identify errors in SCOP, which are then corrected in new releases of SCOPe.SCOPe is backward compatible with SCOP, providing the same parseable files and a history of changes between all stable SCOP and SCOPe releases.
The SCOPe hierarchy, inherited from SCOP, classifies domains from experimentally determined protein structures.The hierarchy comprises the following levels: Species, representing a distinct protein sequence and its naturally occurring or artificially created variants; Protein, grouping together similar sequences of essentially the same functions that either originate from different biological species or represent different isoforms within the same species; Family containing proteins with similar sequences but often distinct functions; and Superfamily bridging together protein families with common functional and structural features inferred to be from a common evolutionary ancestor.Near the root, the basis of classification is purely structural: structurally similar superfamilies are grouped into Folds, which are further arranged into Classes based mainly on their secondary structure content and organization.
SCOPe incorporates and updates the ASTRAL compendium [8][9][10].ASTRAL is a collection of software and databases used to aid in the analysis of the protein structures classified in SCOP and SCOPe, particularly through the use of their sequences.ASTRAL provides sequences and coordinate files for all SCOPe domains, as well as sequences for all PDB chains that are classified in SCOPe.Chemically

A C C E P T E D M A N U S C R I P T
ACCEPTED MANUSCRIPT 4 modified amino acids are translated back to the original sequence.Because the vast majority of sequences in the PDB are very similar to others, ASTRAL provides representative subsets of proteins that span the set of classified protein structures or domains in order to alleviate bias towards well-studied proteins.The highest quality representative in each subset is chosen using AEROSPACI scores [10], which provide a numeric estimate of the quality and precision of crystallographically-determined structures.
Since our initial publication describing SCOPe [5] [11], focusing on a classification consistent with that developed over the past 22 years, while maintaining outstanding classification accuracy, and being as comprehensive as possible.

Manual Curation
Manual curation of superfamilies is a key feature of SCOPe, in which proteins with similar threedimensional (3D) structure and no recognizable sequence similarity are divided into homologs and possible analogs at the superfamily level on the basis of the expert biological insight of human curators.Like SCOP, SCOPe is unique among current structural classification databases in that the hierarchy above the Species level is completely defined by expert curators, with automation used only to identify newly structurally characterized members of existing groups.Based on a study of 571 recent articles that cited SCOP [11], we found that our largest category of users are biologists who use SCOP or SCOPe as a "gold standard" for benchmarking computational algorithms, or to create training sets to aid in setting algorithmic parameters.For these users, manual curation consistent with SCOP standards is necessary in order to include newly structurally characterized protein families in the classification without compromising the utility of SCOPe-derived benchmark datasets.To date, computational methods alone have not been able to classify structures with sufficient accuracy: even specialized methods designed to classify new structures into SCOP, such as SCOPmap [12], proCC [13] and SUPERFAMILY [14], report that they fail to classify between 5% and 12% of domains in the correct SCOP superfamily [12,13,15].Manually curated structures are also used as a basis for further classification by our automated tools, and the resulting increased classification of PDB structures, together with rapid synchronization with PDB releases, benefit all our users [11].
Several other resources also classify a large fraction of protein structures using partial manual curation.CATH [16] and ECOD [17] are similar to SCOPe, but rely more heavily on automated classification tools to assign protein domains and place them in the hierarchy.Classifications in CATH have been compared to SCOP [18][19][20], with the conclusion that while the majority of assignments are consistent, there are significant inconsistencies caused by deliberate design differences.The ECOD authors also note that their domain partition strategy is different from SCOPe, resulting in alternative domain assignments for some structures common to both databases.We compared ECOD (version develop146, 12

A C C E P T E D M A N U S C R I P T
have consistent domain partitioning (which we define as the same number of domains, with no N-or Cterminal domain boundary changed by more than 10 residues), perhaps because early versions of ECOD were partially derived from SCOP [17].However, 25,606 protein chains have inconsistent partitioning between SCOPe and ECOD, with ECOD defining an average of 2.6 domains for these chains vs. 1.3 domains per chain in SCOPe.One example of a manually curated SCOPe superfamily that is not consistent with ECOD (bulge domains from archaeal A-type ATP Synthase) is discussed below.We expect that that a more thorough comparison of annotations from independently curated databases may be valuable for identifying highly confident annotations (e.g., [20]), and for distinguishing philosophical design differences from errors.
In addition to adding new superfamilies, manual curation can also involve other changes to the SCOPe hierarchy.If two distinct superfamilies are later discovered to be related, for example on the basis of a newly discovered structure of an evolutionary intermediate, our curator would merge the two superfamilies into one.
Manual curation is also used to make changes to domain boundaries, including splitting a single domain into multiple domains.This is because SCOPe defines a domain as an evolutionarily conserved unit (as opposed other common definitions of a domain, e.g., based on structural compactness), so a superfamily composed of large domains may be split into multiple superfamilies of smaller domains if these domains are discovered in other evolutionary contexts.Examples of merging superfamilies and splitting domains are discussed below.
We prioritized manual curation of new structures by focusing on Pfam [21] families with the most structures not classified in SCOP or SCOPe.To identify such families, we used HMMER 3 [22] to identify Pfam (version 28.0) families in all protein chains in the PDB.We considered only matches that scored at or above the trusted cutoff for each Pfam family, for which the alignment comprised at least 75% of the Pfam model.We found 2,433 Pfam families had been structurally characterized but not yet classified in

A C C E P T E D M A N U S C R I P T
F o F 1 -ATPases function as ATP synthases in mitochondoria, chloroplasts, and bacteria, by coupling proton gradients to ATP hydrolysis or synthesis through a rotary catalytic mechanism.The alpha and beta subunits of the water-soluble F 1 part of ATP synthase have been classified in SCOP since the earliest version with stable identifiers; each contains three domains.A structure of the V 1 subunit of vacuolar-type ATPase, which regulates the acidic environments of cells and compartments in a variety of organisms, was recently solved [27].Top and side views of V 1 and F 1 complexes are shown in Fig. 2A.The V 1 ATPase A and B subunits are clearly homologous to the alpha and beta subunits of F 1 , except for the insertion of a "bulge" domain between the first two conserved (with F 1 ) domains in the catalytic A subunits.The bulge domain is structurally similar to other structures in the "Barrel/sandwich hybrid" fold that contain 8 βstrands (the unique SCOPe concise classification string identifier, or sccs, for this fold is b.84; the b indicates this fold is in the all-β class).However, there is no evidence of homology with members of the four other superfamilies in that fold.Therefore, the V 1 bulge domain was classified as a new superfamily in SCOPe 2.06.We also classified structures of A 1 subunits of archaeal A-type ATP synthase, which have domains homologous to the "bulge" domain, but lack domains homologous to the N-terminal domain of F 1 ATPase subunits [28].We note that ECOD does not classify the "bulge" domains consistently: in A 1 structures, these domains are merged into the ATP-binding central domain, while in V 1 structures they are split from the central domain.

Example of superfamily merging
Van Itallie and colleagues solved the first structure of the C-terminal domain of Clostridium perfringens enterotoxin (CPE), a common cause of food poisoning [29].They reported that their CPE

Example of domain splitting
The Anthrax Protective Antigen (APA, pdb code 1acc) is a multi-domain protein that has been classified in its own fold (f.11, "Anthrax protective antigen") in SCOP since the earliest version with stable identifiers.Although described in the SCOP curator's comments as having four domains, no homologs of the other domains were ever classified in SCOP, since no structures of these domains in other contexts were available prior to the last release of SCOP.However, the N-terminal domain of APA, called Protective Antigen 14 (PA14), has been observed in a wide variety of bacterial toxins, enzymes, adhesins, and signaling molecules [30]; some of these have recently been structurally characterized.In building SCOPe, but would be split in the future should that occur.

Artifact Removal
We moved cloning artifacts (e.g., expression tags) that we could identify to a new class (l: Artifacts) in order to separate them from the homology-based curations in the rest of the SCOPe hierarchy.
Including such artifacts can result in spurious similarity between non-homologous protein sequences.We identified 21,876 tags that were experimental observed in protein structures, with lengths ranging from 1 to 28 residues, and separated them from the SCOPe domains to which they had originally been attached.
Where possible, we kept the same stable identifiers for the trimmed domains.
N-terminal and C-terminal tags were primarily identified using PDB metadata (SEQADV records) referring to cloning or expression tags at the beginning or end of each chain; a full list of these tags is available on the help page of our website.We annotated additional tags using exact sequence matches to these tagged chains, and to terminal tag sequences at least 5 residues long that were not otherwise annotated in the PDB metadata (DBREF records) as belonging to the reference protein sequence associated with the PDB chain.
We also generated a new set of full-length ASTRAL chain sequences based on PDB SEQRES records, with tags removed, as well as nonredundant subsets of this set.The removal of tags also resulted in changes to all nonredundant sets that were built using fixed E-value or % identity thresholds: in some cases, removal of a tag caused pairwise sequence similarity to fall below the threshold, while in other cases, removal of dissimilar tags caused similarity of the "natural" parts of the proteins to increase.For example, among the sets of PDB chain sequences we created for ASTRAL 2.06 with a 95% sequence identity threshold (see [8] for details), the tagless set contains

Automated Classification Protocol
Our automated classification algorithm and benchmarking protocol are described in detail in a previous manuscript [5].We previously found that the error rate for manually classified entries in SCOP was as low as 0.08%, mostly as the result of typos in entering domain boundaries [5].We have undertaken studies to demonstrate that we can liberalize some parameters of our automated protocol while retaining the same accuracy.We introduced these changes starting with SCOPe 2.04 in order to classify more PDB entries.As with our prior algorithm, we have validated our current automated classification method against all manually curated versions of SCOP, finding no cases in which the superfamily was predicted incorrectly, or any predicted domain boundary differed from the correct boundary by more than 10 residues.While our current pipeline is very accurate, these strict requirements still limit its application to about 50% of newly solved structures.Changes made since our previous publication include:  Removing prohibitions against automatically classifying low-resolution, NMR, and ribosomal structures (low-resolution and ribosomal structures are still limited to being classified in the applicable sections of the SCOPe hierarchy).
 Allowing PDB chains with any number of domains to be classified (was previously limited to two domains).
 Increasing the number of residues by which we extend BLAST annotations to chain ends or gaps, from 10 to 15 residues.
 Removing the requirement that multiple BLAST hits from a query PDB chain being automatically classified must be to different target SCOPe domains.

A C C E P T E D M A N U S C R I P T
In order to improve the usability of the SCOPe website on tablet and mobile phone browsers, we rebuilt the front end using Bootstrap (http://getbootstrap.com).The new site is "responsive," meaning that the layout and navigation controls automatically adjust based on the browser size, making the site convenient to use on wide range of devices, from desktop computers to tablets and phones.The website also supports SSL (i.e., encrypted connections using the HTTPS protocol).Like our previous website, the new SCOPe website can display data from all versions of SCOPe, SCOP, and ASTRAL since release 1.55.
All data are stored in a relational (MySQL) database back end, which is also available for download.The angle of the blue baseline between releases reflects the degree of divergence from comprehensive and fully manually curated releases.SCOP2 [31] is a major redesign of SCOP that enables curators to annotate a richer set of evolutionary relationships between proteins, providing a more precise and accurate characterization of protein relationships.SCOP2 is currently available as a prototype that classifies 995 proteins.A dashed line indicates that the SCOP2 prototype is partially based on SCOP 1.75.
Classification of Proteinsextended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP)

structure, a 9 -
stranded β-sandwich (PDB code 2quo), revealed unexpected structural similarity to several other bacterial toxins: ColG collagenase from Clostridium histolyticum (PDB code 1nqd) and Cry4Ba toxin from Bacillus thuringiensis (PDB code 1w99).Of these structures, only 1nqd had been classified in SCOP, starting with version 1.65, in the "Collagen-binding domain superfamily," sccs b.23.2 under the "CUBlike" fold, b.23.Although Cry4Ba had not been classified in SCOPe, other toxins from the Cry family are: for example, Cry3A from B. thuringiensis (PDB code 1dlc) has been classified in SCOP since the earliest A C C E P T E D M A N U S C R I P T version with stable identifiers, with its C-terminal β-sandwich domain in the "Galactose-binding domain superfamily," sccs b.18.1, under the "Galactose-binding domain-like" fold, b.18.All four structures are shown in Fig 2B.Three pieces of evidence from the Van Itallie study convinced us that the superfamilies b.23.2 and b.18.1 were in fact homologous, despite having originally been classified in separate SCOP folds.First, the authors of the CPE study showed that when the CPE, ColG, and Cry4Ba structures are aligned, analogous positions in the core β-strands have similar sequences (albeit insufficiently significant to be identified without the benefit of structural alignment).Second, all four structures have identical β-sheet topologies, and are more similar to other structures in the b.18 fold (where most structures have 9 β-strands) than to structures in the b.23 fold (where most structures have 10 β-strands, with several of the strands typically being longer, or distorted).Third, the proteins have similar functional roles, as bacterial toxins.We therefore merged the Collagen-binding domain superfamily b.23.2 into the Galactose-binding domain superfamily b.18.1 in SCOPe 2.05, making it into a new family under the existing superfamily.CPE Cterminal domain-like proteins were classified as another new family within the same superfamily.

SCOPe 2 .
04, we classified several members of the GLEYA domain family (Pfam PF10528), which are homologous to PA14.We therefore split the APA entries into two parts: the N-terminal domain, and the remaining C-terminal domains.The N-terminal fold contains the PA14 superfamily, which in turn contains A C C E P T E D M A N U S C R I P T ACCEPTED MANUSCRIPT 10 PA14 and GLEYA families.Structures of several PA14 representatives are shown in Fig 2C.The remaining C-terminal domains of APA still do not have structurally characterized homologs classified in

Fig. 2 :
Fig. 2: Examples of manual curation in SCOPe A) Top and side views of F 1 and V 1 ATP synthase subunits; the "top" view is oriented towards the membrane.Conserved N-terminal, middle, and C-terminal subunits of the alpha and beta subunits of F 1 (A and B subunits of V 1 ) are shown in blue, orange, and green, respectively.The "bulge" domain of V 1 , which represents a new SCOPe superfamily in 2.06, is shown in red.Other subunits of F 1 and V 1 are shown in light grey.B) Four homologous domains from bacterial toxins are colored in a spectrum ranging from blue at the N-terminal end to red at the C-terminal end.Other domains from the structures are shown in grey.C) Two homologous domains from the PA14 superfamily are colored in a spectrum ranging from blue at the N-terminal end to red at the C-terminal end.Other domains from the structures are shown in grey.PA14 was part of the Anthrax Protective Antigen fold (the entire structure shown on the left) in versions of SCOP and SCOPe prior to SCOPe 2.04.
[23]P TSCOPe.This large backlog is a consequence of the fact that SCOP has not comprehensively classified every protein in the PDB since SCOP 1.71, which classifies all proteins released by the PDB prior to 18 January 2005.Although some new families were manually classified in SCOP versions 1.73 and 1.75, none were classified between the release of SCOP 1.75 in June 2009 and the release of SCOPe 2.04 in June 2014.Recent advances in high resolution cryo-electron microscopy have contributed to this backlog; for example, the structure of the yeast spliceosome[23]represented the first structural characterization of 15 different Pfam families.We prioritized manual classification of the largest unclassified Pfam families for two reasons: first, because having at least one manually classified structure from a Pfam family allows our automated tools to work on many other members of that family, and second, because a large number of solved structures is a crude proxy for scientific impact, and therefore we expect larger families to be of potentially greater interest to SCOPe users.
[25,26]ucing SCOPe versions 2.04 -2.06, we curated structures from the 126 largest Pfam families not classified in SCOP, using the same principles previously employed by the SCOP curator to identify domains and classify them in the hierarchy[24].As we expected, the relationship between Pfam families and SCOPe families (or superfamilies) is not 1:1.Among the classified structures from 103 nonribosomal protein families, 28 (27%) had at least one domain classified into a new SCOPe fold, 24 (23%) into a new superfamily in an existing fold, 29 (28%) into a new family within an existing superfamily, and 22 (21%) as new proteins within an existing family.These results are similar to the novelty of newly classified structures in SCOP ten to twenty years ago, for structures that did not have significant sequence similarity with previously classified structures[25,26].Since ~50% of newly classified Pfam families correspond to a new SCOPe fold or superfamily, we project that over 1,000 new folds and superfamilies are harbored in the more than 2,000 Pfam families that are structurally characterized but still unclassified in SCOPe.