UniProt: a worldwide hub of protein knowledge

Abstract The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.


INTRODUCTION
The proteins expressed in a cell at any moment of time determine its function, its topology, how it reacts to changes in environment and ultimately its longevity and well-being. Improvements in experimental techniques are providing ever deeper information on the structure and function of individual proteins, whilst large-scale sequencing efforts are driving increased coverage of the complete proteomes of the breadth of organisms that populate the tree of life. Our challenge is to capture the growing depth and breadth of information and make it easily available and interpretable to our users. The UniProt Knowledgebase (UniProtKB) combines reviewed UniProtKB/Swiss-Prot entries, to which data have been added by our expert biocuration team, with the unreviewed UniProtKB/TrEMBL entries which are annotated by automated systems including our rulebased systems. The output from large-scale sequencing projects form the vast majority of the ∼120 million entries in UniProtKB/TrEMBL. Improved metagenomic assembly and binning tools are resulting in an increasing number of high-quality metagenomic assembled genomes (MAGs) being represented in the database. Additionally, we provide the UniRef databases that cluster sequence sets at various levels of sequence identity and the UniProt Archive (UniParc) that delivers a complete set of known sequences, including historical obsolete sequence.
We describe the major developments that we have made since our last update published in this journal in 2017 (1) with a focus on how we are positioning the UniProt database to address the challenges of the increased volume of sequence data entering the database.

Growth of sequences in UniProt
The UniProtKB Proteomes portal (https://www.uniprot. org/proteomes/) provides access to proteomes for over 84 thousand (84 387, release 2018 07) species with completely sequenced genomes. The majority of these proteomes are based on the translation of genome sequence submissions to the INSDC source databases--ENA, GenBank and the DDBJ (2). To ensure comprehensiveness, complementary pipelines have been developed to supplement these with genomes sequenced and/or annotated by groups such as Ensembl (3), Vectorbase (4) and WormBase ParaSite (5). This has been extended to RefSeq genomes, allowing us to import key genomes of special interest annotated by NCBI's Eukaryotic Genome Annotation Pipeline. Release 2018 03, for example, saw the inclusion of a set of primate proteomes and a dozen genomes for marine mammals were imported from RefSeq for release 2018 08. Taxa may be specifically targeted by curators to fill gaps in the taxonomic space and additional proteome import can be requested by individual users.
The continuing growth of sequenced genomes is a challenge for databases as much of this growth is being driven by sequencing of very similar and almost identical strains (90% of proteins have >90% identity) of the same species. We continue to see exponential growth in many of our datasets, see Figure 1. We are managing this growth by a number of processes. Most critically, a redundancy removal process was first introduced in 2015. This process identifies and removes almost identical proteomes of the same species before their inclusion in UniProtKB (https://www.uniprot. org/help/proteome redundancy) and places their sequences in UniParc. Currently this process has removed ∼38% all complete proteomes (∼241 million proteins) from UniPro-tKB. As can be seen in Figure 1, the redundancy reduction both greatly reduced the size of UniProtKB as well as made its growth more scalable. This approach has now been extended from prokaryotes to fungi, resulting in the deprecation of ∼1 million fungal protein records in release 2016 08.

Reference proteomes
For the remaining proteomes we provide a Reference Proteome set (∼9% of total proteomes) selected by the research community and supplemented with selected proteomes from a computational clustering (6) to provide the best annotated proteome in their cluster. Recently, we have added virus Reference Proteomes (described below) to this list. The growth of Reference Proteome sets is shown in Figure  2. Programmatic access to the non-redundant proteomes included in UniProtKB is provided via the Proteins API (https://www.ebi.ac.uk/proteins/api/doc/) (7) whilst all proteomes reference, non-reference and redundant in UniPro-tKB and UniParc are accessible via the Proteome section of the UniProt website (https://www.uniprot.org/proteomes/). UniProt reference proteomes are used by user communities such as the Quest for Orthologs (QfO) group as the 'gold standard' reference proteome dataset for orthologue benchmarking (8). The QfO reference proteomes datasets are a compiled subset of the UniProt reference proteomes, comprising well-annotated model organisms and organisms of interest for biomedical research and phylogeny, with the intention to provide broad coverage of the tree of life whilst maintaining a low number of proteomes for the benchmark. Such datasets have been generated annually from the UniProt Knowledgebase (UniPro-tKB) database for the past seven years. To this end, a genecentric pipeline has been developed and enhanced over the past year by UniProt. These complete, non-redundant reference proteomes are publicly available at ftp://ftp.ebi.ac.uk/ pub/databases/reference proteomes/QfO. The datasets are provided either in SeqXML format or as a collection of FASTA files and include genomics coordinates for most proteins. In the last release for QfO in April, the number of species increased from 66 to 78 comprising 1,311,679 non-redundant protein sequences (48 Eukaryotes and 30 bacteria-archaea).
At the time of our previous publication, the number of virus Reference Proteomes in UniProtKB stood at 428--practically an order of magnitude less than the num-ber of known viral species (which stands at 4405 according to the International Committee on Taxonomy of Viruses, or ICTV). To improve Reference Proteome Coverage of viruses in UniProt we have undertaken a concerted effort to curate Complete Proteomes and to use these as input for the computational selection of Reference Proteomes. We compiled and curated complete viral genome (proteome) data from INSDC (which feeds into UniProt) as well as a range of other specialized viral genome resources including ViPr (9), IRD (10), the HIV database (https://www.hiv.lanl. gov/), Papillomavirus Episteme (11) and the HCV database (https://hcv.lanl.gov/) and HBV database (12). This curation effort included the identification and removal of segmented viruses lacking one or more segments, and increased the number of the Complete Proteomes of viral origin in UniProt to 86 951 (UniProt release 2018 07). Computational clustering of this enhanced viral Complete Proteome set (13) produced 5,887 viral Reference Proteomes. Note that redundancy removal procedures are not currently applied to viruses, due to the challenges posed by the high number of variants in the small genomes of these species.
Technological advances have enabled the sequencing of the genetic material from all the microorganisms in a particular environment without the cultivation of any of the community members. Concurrent advances in bioinformatics have enabled the rapid assembly of genomes from metagenomes (MAGs) and a corresponding input of sequence data into the UniProt database with >4800 assembled proteomes in the database (4839, release 2018 07). The existing data input pipeline is currently based on those submissions to the INSDC which fulfil certain threshold criteria but future plans are to move to using the EBI Metagenomics resource, MGnify (14), as the main source of metagenome derived assemblies. We will include only those MAGs that show a high level of completeness and a low level of contamination.

Expert curation progress
Expert curation of the literature is critically important to the UniProt databases. The information extracted from scientific publications is stored in the UniProtKB/Swiss-Prot section of the UniProt Knowledgebase and describes functional information both in the form of human readable freetext/controlled syntax summaries and via structured vocabularies such as the Gene Ontology (GO) (15) or ChEBI (16). Expert curation is labour intensive, with curators assimilating and evaluating multiple lines of evidence from the text and figures of relevant publications, but this has repeatedly proven to be the most efficient method of extracting all relevant data from a paper. We have previously demonstrated that UniProtKB expert biocuration captures between 35 and 45% of the curatable literature for any given species, rising to 50% of the curatable literature for Homo sapiens (17). UniProtKB/Swiss-Prot entries serve as a source of functional data for the development and enhancement of bioinformatics prediction tools, so we prioritize the capture of functional data that cannot currently be correctly predicted by computational methods, such as proteins which predictive protein signatures would describe as being part of an enzyme family but are in fact non-functional due to D508 Nucleic Acids Research, 2019, Vol. 47, Database issue  the loss of specific amino acid residues. An example of this is PLC1 (UniProtKB:Q15111), a member of the phosphoinositide phospholipase C family (InterPro:IPR001192) in which the existence of an asparagine residue in the active site instead of the conserved histidine residue suggests a non-catalytic role for this protein. This has broader applications than the annotation of just one protein as infor-mation on protein function is computationally transferred from UniProtKB/Swiss-Prot to sequence-related, but less well studied, proteins in UniProtKB/TrEMBL. Therefore, it is important to ensure we have a broad enough collection of annotated proteins in UniProtKB/Swiss-Prot to add value to new entries as the taxonomic range of fully sequenced proteomes continues to expand. Once a record has been moved into UniProtKB/Swiss-Prot from UniProtKB/TrEMBL, there is a concerted effort to ensure that it is regularly updated to ensure the current understanding of a protein's activity reflects that in the scientific literature. This task can be difficult, as our knowledge of protein function continues to evolve and finer grained experimental techniques provide new knowledge that may appear to contradict previous observations. An example of this is provided by the proteins involved in N 6 -methyladenosine (m 6 A) modification which takes place on both coding and non-coding RNAs containing a Pu[G>A]m 6 AC[A/C/U] (Pu = purine) sequence, the nucleotide A3 of which becomes N6-methylated. The modification acts as a key regulator of mRNA stability: methylation is completed upon the release of mRNA into the nucleoplasm and affects various processes, such as mRNA stability, processing, translation efficiency and editing. The enzymes that catalyse this process were believed to have been fully characterized some years ago, but more recent data has changed our understanding of how these molecules work. m 6 A methylation is performed by a heteromeric WMM N 6 -adenosine-methyltransferase complex, originally thought to be a heterotrimer of METTL3 (UniPro-tKB:Q86U44), METTL14 (UniProtKB:Q9HCE5) and WTAP (UniProtKB:Q15007), but additional component subunits have recently been identified (18). Initial work indicated that METTL3 and METTL14 form a tight heterodimer in order to perform catalysis on the preferred motif sequence. Both proteins are members of the MT-A70 family and are classified as S-adenosyl-L-methioninedependent methyltransferases by sequence prediction resources. Methyltransferase activity has been reported for both (19). Subsequent structural studies have now shown that that only one protein, METTL3, constitutes the catalytic subunit (20)(21)(22). The other subunit, METTL14, has a degenerate active site that is unable to accommodate donor and acceptor substrates and plays a non-catalytic role in maintaining complex integrity and substrate RNA binding (22). WTAP appears to serve as a regulatory subunit. A newly identified subunit of the WMM methyltransferase complex, VIRMA (UniProtKB:Q69YN4) plays a role in mediating mRNA m 6 A methylation in 3 UTRs and near stop codons and in recruiting the methyltransferase core components.
Expert biocuration has been essential in keeping the relevant UniProtKB/Swiss-Prot entries up to date with each step in the pathway to reflect our understanding of the function of these proteins and the molecules they associate with. Every piece of knowledge we capture is associated with evidences (see https://www.uniprot.org/help/evidences for further details) to indicate the source of information and we have added a CAUTION comment to the METTL14 entry to both provide some background and inform users that the initially reported methyltransferase activity is an unsafe observation (Figure 3). The correct annotation on the UniProtKB/Swiss-Prot entry will ensure that misleading data is not computationally added to orthologous proteins in UniProtKB/TrEMBL on the basis of an over-predictive protein signature, again highlighting the importance of expert biocuration. Future plans for the manual curation activities in UniProtKB include the development of mecha-nisms to identify and highlight contradictory information in existing protein entries in order to improve rigor and reproducibility. This example also illustrates the collaborative nature of the UniProt Consortium in that the molecular interactions involved have also been added to the IMEx Consortium (www.imexconsortium.org) dataset by UniProt curators (23), the protein complexes involved curated into the Complex Portal (https://www.ebi.ac.uk/complexportal/) (24) and both the proteins and complexes have been further annotated using GO (http://www.geneontology.org). New terms were added to the GO to enable this and others were updated. All of these data are subsequently reimported back to enrich the appropriate entries in UniProtKB/Swiss-Prot but are stored and maintained by these domain-specific resources.

Automatic annotation progress
UniProt's automatic annotation pipelines enrich the unreviewed records in UniProtKB/TrEMBL with classification and functional annotations. InterPro is used to classify sequences at superfamily, family and subfamily levels and to predict the occurrence of functional domains and important sites. InterPro integrates predictive models of protein function, so-called 'signatures', from a number of member databases. In UniProtKB/TrEMBL entries, domains from the InterPro member databases PROSITE, SMART or Pfam are predicted and used to automatically provide domain annotations. All automatic annotations are labelled with their evidence/source. UniProt has developed two complementary rule-based prediction systems, UniRule and the Statistical Automatic Annotation System (SAAS) to automatically annotate UniProtKB/TrEMBL in an efficient and scalable manner with a high degree of accuracy. These prediction systems can annotate protein properties such as protein names, function, catalytic activity, pathway membership and subcellular location, along with sequence-specific information, such as the positions of posttranslational modifications and active sites. We continue to increase the number of Rules used for annotation and this has now grown to over 6000 in total as shown in Figure 4.

Ongoing developments in automatic annotation
We are evaluating new methods for computational annotation of function based on based specifically on their domain architecture (DAAC, or Domain Architecture Alignment and Classification) (22,25) or any combination of sequence properties and features (ARBA, or Association-Rule-Based Annotator).
The Domain Architecture Alignment and Classification system (DAAC, (25)) is based on the assumption that not only the presence of domains is important for the protein to perform its function but also their architecture i.e. their arrangement and order. It performs the prediction in four major steps. Firstly, it generates the different ordered combinations of InterPro domains, these combinations are termed domain architectures (DA). Secondly, it calculates the pairwise similarity between DAs using alignment. The third step is that of training. In this phase, clusters of proteins are created, where each cluster represents a single annotation. A threshold of belonging is determined for each cluster while maximising the classification performance, and then annotations corresponding to low performance clusters are discarded. In the final step, the clusters along with their corresponding thresholds serve as models to annotate other query proteins. The fact that DAAC takes into account the arrangement of domains within proteins allows it to make more accurate predictions for specific cases where proteins can have the same set of domains but with different arrangements and hence could perform different functions. Association rule mining and selection techniques can be used effectively as computational methods for functional prediction. ARBA (Association-Rule-Based Annotator) is based on these techniques and can be used to enhance the quality of automatically generated annotations as well as annotating proteins with unknown functions. It learns on data from UniProtKB/Swiss-Prot and uses InterPro signatures and organism taxonomy as attributes to predict most of the protein functional annotations including GO terms, metabolic pathways, EC numbers, etc. With respect to certain quality measures, ARBA finds all rules which would define significant relationships between attributes and functional annotations in UniProtKB/Swiss-Prot entries. The set of extracted rules represent the comprehensive knowledge which could explain protein functions. However, these rules comprise redundant information and their high number makes it infeasible to apply them on large sets of data such as UniProtKB/TrEMBL. To address this issue, ARBA puts these rules into a fast competition process called SkyRule based on two concepts, namely dominance and comparability. Rules are then elegantly and considerably reduced in number and aggregated to form concise prediction models that assign functional annotations to UniProtKB entries.
In order to share our knowledge in computational annotation and our rule-based systems, we are working on developing standard formats for rule annotation and tools to apply them. UniFire (the UNIprot Functional annotation Inference Rule Engine) is a standalone tool to apply the UniProt annotation rules on third party data. This engine is based on a Business Rules Management System (BRMS) named Drools and is developed in conjunction with an exchangeable format for UniProt rules and the protein data to be annotated. we termed it URML (Uniprot Rule Markup Language). URML respects the definition of business rules, and is therefore executable. When applied on third party proteins, UniFire will transform the data according to the URML format and execute the UniProt rules on them to generate functional predictions. We would like to work with the scientific community in this development and encourage users to register their interest in the links provide  A new change in UniRef to provide GO annotation was based on user feedback. The UniRef databases (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase and selected UniParc records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Our study indicated functional annotations are generally preserved in UniRef clusters due to the intra-cluster homogeneity (26).
The GO terms of member protein sequences are highly consistent at UniRef90/50 levels in each of the three GO domains: Molecular Function, Biological Process and Cellular Component. We have started to compute GO annotations last year for UniRef90 and UniRef50 clusters: A GO term is assigned to a cluster when it is found in every UniProtKB member that has GO annotation in this cluster, or when it is a common ancestor of at least one GO term from each such member. We evaluated the GO term prediction by comparing the protein sequences that newly acquired GO annotation with the ones that GO terms were assigned by UniRef in previous releases. The overall rate of prediction precision is higher than 97%, in all three GO domains, indicating the confidence to propagate UniRef GO annotation to member sequences. Thus, UniRef provides an annotation source to our users. They can search against UniRef and then obtain GO annotation from their results.
GO annotations of UniRef can be found in the UniRef XML files from the UniProt ftp site: ftp://ftp.uniprot.org/ pub/databases/uniprot/uniref/uniref50/.

UniProt bibliography
UniProt compiles additional bibliography from external sources to complement the curated literature set in UniProtKB/Swiss-Prot with additional publications and to add relevant literature to UniProtKB/TrEMBL entries not yet curated. The sources of literature are of two types: biological databases (currently 22 sources, including model organism, structure, post-translational modification, function, disease and interaction databases) where the litera- ture mapped to UniProt entries are added in a collaborative manner, and secondly, text mining (currently PubTator Tmvar for literature about human variants in disease (27) and pGenN (28) for literature related to plant proteins). Altogether these sources add 902 956 unique papers covering 363 591 entries, with a total of 36 147 401 UniProt accession-PubMed identifier (accession/PMID) pairs (Release 2018 08).
In the entry publication section, publications providing evidence for a specific annotation type are organized across different categories, such as function, interaction and expression, based on the type of data they contain. This categorization facilitates navigation through the protein entry literature by offering users quick access to literature of interest related to a given protein-topic. The classification of the UniProt references is automatically done based on the topics found in the flat file RP lines that are linked to the references, whereas the computationally mapped bibliography has to be categorized in a different way.
As a first approach, we rely on the information provided by the underlying database sources for this categorization (e.g. the literature provided by iPTMnet (29) will fall into PTM/Processing category and literature provided by Intact will be classified as Interaction). However, this approach is not optimal for all literature, thus we are now systematically classifying the articles into the UniProt main categories present in the entry using UPCLASS, a neural network based pipeline to classify publications for UniPro-tKB protein entries, developed in SIB Text Mining group (https://www.sib.swiss/ruch-patrick/patrick-ruch-sub). UP-CLASS has been trained and evaluated with the curated lit-erature and categories in UniProt entries (with precision, recall and F-score of 0.80, 0.62 and 0.70, respectively). The model used for UniProt classification considers the protein names corresponding to the accession-PMID pair, resulting in a more accurate protein-centric classification. Implementation of UPCLASS has enabled the classification of 32 196 740 UniProt accession-PMID pairs that have been previously displayed as unclassified, representing 254 038 protein entries and 768 505 PMIDs (Release 2018 08).

Outreach and training
The training of our users to make the best use of our data, whether accessed via the web, the API or downloaded from the FTP site, is a key mission of the UniProt Consortium. We have played an active role in the provision of handson training workshops across the globe, supporting both early-career researchers and domain specialists such as clinicians and proteomic scientists. Whilst face-to-face training will remain an important route for disseminating very detailed information, it is not always the most efficient method of reaching out to large numbers of users or of penetrating new user groups. We have therefore scheduled a series of webinars, with basic modules on searching protein sequence and function repeated regularly and more specialist units, for example on protein structure, machine learning protein function and linking protein and genome annotation delivered less often. The webinars are recorded, and are subsequently made available online (https://www. ebi.ac.uk/training/online/) where they are supported by related online training materials and YouTube videos (https: //www.youtube.com/user/uniprotvideos/). In order to reach new communities, the webinars are widely advertised, making pro-active use of social media forums such as Twitter (@uniprot) and FaceBook (https://www.facebook.com/ uniprot.org/) as well as established mailing lists.

Website developments
During the last year we have added three new visualizations to the UniProt website. Firstly, we have added a method for viewing molecular interactions, secondly a method for viewing the subcellular localization of the proteins and finally we have added a molecular structure viewer. Together these enable users to rapidly understand the molecular context of UniProt entries.
For UniProtKB entries that include an Interaction section, we show details of the protein's binary interactions with other proteins, using a high-quality dataset supplied by the IMEx Consortium. The binary interactions of a protein are now shown as a matrix that shows the interaction partners of your protein and also shows which of those partners interact with each other. For example, the interaction matrix for the human E3 ubiquitin-protein ligase parkin protein (UniProtKB:O60260) is shown in Figure 5.
One of the sections on the protein entry pages is Subcellular Location. This section provides information on the location and the topology of the mature protein in the cell. We now allow users to visually explore the subcellular location in UniProtKB entries. The visualization presents image templates from COMPARTMENTS (https: //compartments.jensenlab.org/) (28,30) combined with protein location data from UniProt (expert annotation, rulebased automatic annotation) and imported from GO annotation. Figure 6 below shows the subcellular location view from the Human Copper-transporting ATPase 2 protein (UniProtKB:P35670).
Structural information is important in understanding the molecular mechanisms that allow proteins to perform their specific functions. UniProt now provides a protein structure viewer in the 'Structure' section of the entry view of the website as well as in the ProtVista protein viewer (see Figure  7). The structures are rendered using the Litemol viewer. This innovation helps users to connect protein information in UniProt with structural data.

CONCLUSION
UniProt continually develops its processes and procedures to efficiently provide a global collection of protein sequences and annotations. Over the past two years we have seen significant growth in numbers of genomes and proteins D514 Nucleic Acids Research, 2019, Vol. 47, Database issue sequences. We have continued to organize that data and provide it to our users in a variety of user-friendly ways. We have recently also updated the terms of our license to a CC-BY (4.0) making it easier for UniProt data to be reused by others.
A critical component for UniProt is to connect papers to the relevant entries. In this paper we have described how this is carried out by our expert curators as well as how we supplement these curated papers by connecting papers from other databases and text mining tools. You can help enable curators and text mining tools to connect your scientific papers to UniProt and other molecular biology databases through tagging mentions of proteins with UniProt identifiers. We request that authors use the following format (UniProtKB:P68369) to describe a protein within the text of a paper. This formatting follows the compact identifier representation that has been recently proposed to enable uniform resolution of biomedical resource identifiers (31). Using this format can also provide a simple mechanism to refer to other data resources. Some journals already have specific formatting requirements for such citations to accessions and these should always be given precedence.
We greatly value the feedback and annotation updates from our user community. Please send your feedback and suggestions to the e-mail address help@uniprot.org or via the contact link on the UniProt website.