OGP: A Repository of Experimentally Characterized O-glycoproteins to Facilitate Studies on O-glycosylation

Numerous studies on cancers, biopharmaceuticals, and clinical trials have necessitated comprehensive and precise analysis of protein O-glycosylation. However, the lack of updated and convenient databases deters the storage of and reference to emerging O-glycoprotein data. To resolve this issue, an O-glycoprotein repository named OGP was established in this work. It was constructed with a collection of O-glycoprotein data from different sources. OGP contains 9354 O-glycosylation sites and 11,633 site-specific O-glycans mapping to 2133 O-glycoproteins, and it is the largest O-glycoprotein repository thus far. Based on the recorded O-glycosylation sites, an O-glycosylation site prediction tool was developed. Moreover, an OGP-based website is already available (https://www.oglyp.org/). The website comprises four specially designed and user-friendly modules: statistical analysis, database search, site prediction, and data submission. The first version of OGP repository and the website allow users to obtain various O-glycoprotein-related information, such as protein accession Nos., O-glycosylation sites, O-glycopeptide sequences, site-specific O-glycan structures, experimental methods, and potential O-glycosylation sites. O-glycosylation data mining can be performed efficiently on this website, which will greatly facilitate related studies. In addition, the database is accessible from OGP website (https://www.oglyp.org/download.php).


Introduction
Comprehensive and precise analysis of O-glycoproteins would potentially further the current understanding of their roles in many physiological and pathological phenomena, such as intercellular communication [1], hereditary disorders, immune deficiencies, and cancers [2−4]. Great efforts have been made to analyze the complexity of O-glycosylation. Recent technological advancements in many fields, especially in mass spectrometry (MS), lead to impressive data on O-glycoproteins [5−14]. However, the lack of up-to-date and curated databases hinders the archive, query, and utilization of emerging O-glycoprotein data.
Numerous studies have attempted to develop glycosylation-related databases [15−28]. However, most of these databases are focused on N-glycoproteins. Only a few databases contain data on O-glycoproteins. The most extensively used repository, UniCarbKB [16], provides massive N-glycoprotein data and limited O-glycoprotein records. The dbPTM [18,19] is an integrated resource containing over 130 types of post-translational modifications (PTMs). However, it does not provide information regarding site-specific O-glycosylation. O-GLYCBASE [15] provides information regarding both glycans and glycosylation sites and is the most widely used database in O-glycosylation studies. Nevertheless, it has not been updated since 2002. Besides, it contains merely 189 O-glycoproteins and 2142 O-glycosylation sites, lagging behind current O-glycoproteomicdata. In short, current O-glycoprotein databases are less satisfactory with notable issues, including insufficient records, unknown data confidence, outdated data, and userunfriendly interface (Table S1)

Construction of the OGP repository
The OGP knowledgebase was constructed by integrating experimentally verified O-glycoproteins reported between 1998 and 2018 and other existing O-glycoprotein databases [15] ( Figure 1A). All proteins were manually curated, aligned with UniProt entries, and merged. Detailed methods of information extraction from literatures are described in File S1. In total, 9354 O-glycosylation sites and 11,633 sitespecific O-glycans mapping to 2133 O-glycoproteins of different species have been recorded in the database ( Figure  1B). The distribution of species in OGP shows that 69% (1476/2133) O-glycoproteins and 75% (7038/9354) O-glycosylation sites belong to Homo sapiens ( Figure 1C), indicating a prevailing O-glycosylation study in Homo sapiens. The scale of the OGP repository is more than 20-fold bigger than the existing O-GlycBase v6.0 ( Figure 1D and E). This database will also be updated periodically with newly published data in the future.
The database records data such as proteins, peptide sequences, O-glycosylation sites, and site-specific O-glycans. For each site and site-specific O-glycan, detailed experimental information, such as sample sources, digestion enzymes, enrichment methods, and analytical methods, is integrated. Besides, all O-glycoproteins recorded in the database have been aligned with their UniProt entries. Thus, additional data, including protein sequence annotation, subcellular location, and other PTMs, can be conveniently obtained. To better obtain topological information regarding O-glycans, a linear coding method (File S2) has been used in this database to record site-specific O-glycan structures. Furthermore, analytical strategies for each O-glycopeptide, such as immunoprecipitation, gel filtration, and MS methods, were manually extracted, verified, and recorded in the database. These data are easily retrievable from the OGPbased website.

Development of an O-glycosylation site prediction model
Since O-glycosylation is highly complex but important, it is significant to better understand glycosylation patterns [29 −32]. As a meaningful trial, an O-glycosylation site prediction model was developed using O-glycosylation sites, which were meticulously selected from OGP database. The rule of the selection was that the sites must be identified by at least one solid method to confirm the reliability and unambiguousness. The site prediction model was generated through three primary steps (Figure 2A; File S3): 1) construction of a dedicated training set; 2) optimization of parameters; 3) evaluation of site prediction performance. Through systematic optimization, a dedicated training set was established with a 1:1 ratio of positive to negative instances (1754 positive site-central sequences and 1754 negative site sequences) ( Figure 2B; File S3). Sequences with 11 amino acid residues were considered preferable ( Figure 2C; File S3). Thereafter, the performance of different algorithms on O-glycosylation site prediction was compared using Weka 3.8 as a data mining tool. The random forest (RF) algorithm displayed the best performance ( Figure 2D and E; File S3) and was used to construct the prediction model. Ten-fold cross validation indicated

Construction of the OGP-based website
Based on the OGP database, a dedicated website was constructed using hypertext markup language (HTML), cascading style sheet (CSS), JavaScript (JS), and professional hypertext preprocessor (PHP). The design of the website is shown in Figure 3A. It contains three repositories in the underlying database layer: OGP, prediction model, and data submission. OGP repository is the core database that stores O-glycosylated protein sequences, sites, site-specific O-glycans, corresponding experimental data, and references. The prediction model contains a model file and an inherent training set. Data submission is designed to preserve user-uploaded information. By performing a set of actions including protein query, prediction model training, and data uploading in the operation layer, the website outputs four modules: statistical analysis, database search, site prediction, and data submission. The website is supported by most common web browsers such as Internet Explorer, Mozilla Firefox, Google Chrome, Safari, and Opera.

Utility and the interface of the OGP website
The OGP-based website, equipped with a user-friendly graphical interface, is already available at http://www.oglyp. org/ and comprises four main modules: statistical analysis, database search, site prediction, and data submission. Furthermore, other functions, including database downloading, latest literature displaying, and useful database accesses (UniProt, UniCarbKB, and O-GlycBase), are also provided. The homepage of this website is shown in Figure 3B. Furthermore, the website provides detailed instructions and frequently asked questions (FAQ) to facilitate users.
The "statistical analysis" module provides an overview of the OGP repository, including the scale of total O-glycoproteins, O-glycosylation sites, and site-specific O-glycans ( Figure S1A), taxonomic distribution of  Figure S1B), database-scale comparison between OGP and O-GlycBase v6.0 ( Figure S1C), O-glycoprotein data-related analyses by ingenuity pathway analysis (IPA) (Figure S1D-F). Furthermore, extra information can be fetched from this module. For example, more than 95% of the reported O-glycosylation sites are present in mammalians, 75% of which are present in Homo sapiens, indicating that O-glycosylation in other species warrants further analysis. All statistical information would be updated in real-time with the expansion of the OGP database.

O-glycoproteins and O-glycosylation sites (
In the "database search" module, users can retrieve O-glycoproteins flexibly by specifying the gene name, protein name, UniProt accession No., or glycan structure ( Figure  S2). Figure 4 shows a webpage returned from a query of fibrinogen gamma chain (OGP database search accession No.: P02679). These results comprise well-structured data on protein O-glycosylation, including basic protein information (i.e., protein name, UniProt accession No., and species, Figure 4A), protein sequences and all recorded O-glycosylation sites highlighted in pink ( Figure 4B), all experimentally verified O-glycopeptides and site-specific O-glycans ( Figure 4C), and corresponding experimental methods, identifiers, and source references ( Figure 4D and E).
The site prediction model developed herein has also been incorporated into the website to enable O-glycosylation site prediction. As is shown in Figure S3A, users can either fill out the template file with aligned site-central sequences as instructed or simply upload a typical protein FASTA-format file and click on "predict". The prediction results for each site can be then displayed directly on the right side of the webpage ( Figure S3B). Prediction scores range between 0 and 1; scores higher than 0.5 indicate positive sites, while those less than or equal to 0.5 indicate a highly probably non-O-glycosylation site. The higher the score, the greater the probability of a site being O-glycosylated and vice versa. The results can also be downloaded, as shown in Figure S3B.
The "data submission" module enables users to upload new data into the OGP database or submit feedbacks. All the new submitted data and feedbacks are carefully recorded in a backend database and will be revised manually by experts at regular intervals. Both a template form and an online form are accepted during a submission. What's more, when users upload the data by file, there will be a real-time feedback shown below to inform users of those O-glycoproteins already in OGP database.
In addition, the database is accessible from OGP website. Downloading pages can be found in the drop-down menu of tools on OGP homepage (http://www.oglyp.org/download. php). The detailed top 500 entries could be directly downloaded. Besides, there is a basic version of the database, which provides all the O-glycoprotein accessions and the corresponding O-glycosylation sites for users to download freely. The whole database could also be provided if users apply for it through E-mail request. The applying method is illustrated on the website (http://www.oglyp.org/download. php).