GTDB: an integrated resource for glycosyltransferase sequences and annotations

Abstract Glycosyltransferases (GTs), a large class of carbohydrate-active enzymes, adds glycosyl moieties to various substrates to generate multiple bioactive compounds, including natural products with pharmaceutical or agrochemical values. Here, we first collected comprehensive information on GTs, including amino acid sequences, coding region sequences, available tertiary structures, protein classification families, catalytic reactions and metabolic pathways. Then, we developed sequence search and molecular docking processes for GTs, resulting in a GTs database (GTDB). In the present study, 520 179 GTs from approximately 21 647 species that involved in 394 kinds of different reactions were deposited in GTDB. GTDB has the following useful features: (i) text search is provided for retrieving the complete details of a query by combining multiple identifiers and data sources; (ii) a convenient browser allows users to browse data by different classifications and download data in batches; (iii) BLAST is offered for searching against pre-defined sequences, which can facilitate the annotation of the biological functions of query GTs; and lastly, (iv) GTdock using AutoDock Vina performs docking simulations of several GTs with the same single acceptor and displays the results based on 3Dmol.js allowing easy view of models.

with the same single acceptor and displays the results based on 3Dmol.js allowing easy view of models.
Database URL: https://www.biosino.org/gtdb/ Introduction Glycosyltransferases (GTs) are an important group of enzymes that can catalyze the transfer of activated sugar residues onto a wide range of carbohydrates or noncarbohydrate acceptors, generating a remarkable amount of structural diversity in biological systems (1). These enzymes commonly exist in different species from prokaryotes to eukaryotes. For prokaryotes, heptosyltransferases I-IV sequentially add sugar moieties to generate the core of lipopolysaccharides in the cell-surface components of Gram-negative bacteria (2). In Arabidopsis thaliana, AtGLCAT14A can modify beta-1,6-linked galactan and beta-1,3-linked galactan present in type II arabinogalactan (3); AtGAUT1 can catalyze the transfer of galacturonic acid from uridine 5 -diphosphogalacturonic acid onto the polysaccharide homogalacturonan in pectin biosynthesis (4); UGT78D1, UGT78D2 and UGT78D3 can catalyze the first 3-O-glycosylation in the biosynthesis of the flavonoids kaempferol, quercetin and isorhamnetin (5). In humans, GnT-V can catalyze the formation of β1,6-GlcNAc branching, GnT-III can transfer the GlcNAc to the β-mannose residue of N-glycans and Fut8 can transfer a fucose moiety from GDP-β-L-fucose to the innermost GlcNAc residue in an N-glycan, which all play key roles in cancer progression and treatment (6). Overall, GTs are involved in a variety of critical biological activities, including cell wall construction, natural product formation, cancer metastasis and suppression (7). At present, the number of GTs has increased rapidly over the past 10 years because of high-throughput transcriptomic sequencing technologies ( Figure 1A), but functional information about these novel proteins remains to be further annotated ( Figure 1B). It appears that the gap between protein sequences and identified functions often leads to much trouble for researchers studying GTs (8)(9).
To date, several comprehensive resources regarding GTs have been constructed and have facilitated the application of GTs to a variety of problems. Carbohydrate-active enzymes (CAZy) (http://www.cazy.org) depicts families of enzymes that degrade, modify or create glycosidic bonds, including GTs, glycoside hydrolases, polysaccharide lyases, carbohydrate esterases, auxiliary activities and carbohydrate-binding modules (9). The CAZy database focuses on the classification of GTs based on their sequence similarity and displays basic information about entries, including the protein names, organism, enzyme commission (EC) numbers and external database accessions. However, CAZy does not have a centralized sequence search function nor does it provide direct download access for CAZy sequences or annotations, limiting its utility. KEGG GLYCAN (https://www.kegg.jp/kegg/glycan/) (10), a resource for carbohydrate structures where entries are organized by functional ortholog groups rather than single specific GT, covers a subset of GTs classified by their synthesis of glycosyl bonds. Moreover, PlantCyc (https:// www.plantcyc.org) (11), dbCAN-seq (http://bcb.unl.edu/ dbCAN_seq/index.php) (12), the Rice GT Database (http:// ricephylogenomics.ucdavis.edu/cellwalls/gt/index.shtml) (13) and CSDB_GT (http://csdb.glycoscience.ru/gt.html) (14-15) concentrate on several specific species yet cannot be applied to all living domains. Therefore, an integrated GT resource for multiple species is needed and could help researchers access sequences and annotation data in a centralized location rather than using various scattered databases.
Considering the above points, we therefore constructed the GTs database (GTDB), which has combined a variety of contents from well-known public databases including CAZy, UniProt (16), KEGG (17) and MetaCyc (18), as well as pre-computed annotations derived using DIAMOND (double index alignment of next-generation sequencing data) (19), eggNOG-mapper (v2.0.0) (20), HMMER (v3.2.1) (21), interactive tools with BLAST (v2.7.1) (22) and docking modeling using AutoDock Vina (v1.1.2) (23). GTDB displays detailed information on each GT from multiple aspects, including sequences, structures, protein family classifications, protein functions, enzyme reactions and external links. If a protein annotation is predicted in silico, these results are labeled with corresponding bioinformatics methods. Thus, GTDB provides search methods via different data sources (third-party database, predictions in GTDB) to meet users' distinct demands. It also supplies batch download service for annotated data grouped by several characteristic sets. In addition, GTDB is free and available at https://www.biosino.org/gtdb/.

Data sources
In order to broad our understanding of GTs, GTDB merged diverse and valuable data from several well- defined databases listed in Table 1. Basic information on GTs, including their protein names, organism, protein Genbank accessions, UniProt accessions, EC numbers, PDB codes, GT families and mechanisms, was obtained from the CAZy (http://www.cazy.org/) database, and protein names were kept in the original style in this database that uses a combination of trivial names, gene identifications and locus tags. Gene information on GTs, including gene symbol and gene ID, was integrated using data from NCBI Gene (https://www.ncbi.nlm.nih.gov/ gene/) (24). The protein sequences and related CDS were from NCBI Protein (https://www.ncbi.nlm.nih.gov/protein/) and Nucleotide (https://www.ncbi.nlm.nih.gov/nuccore), respectively. The species information, namely taxonomy ID, came from NCBI Taxonomy (https://www.ncbi.nlm. nih.gov/taxonomy/) (25). The information on references, e.g. author lists, article titles and PMID, was gleaned from NCBI PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) (26). We then extracted protein characteristics of GTs, which included tissue specificity, subcellular locations and kinetics information from UniProt knowledgebase (https://www.uniprot.org/). The 3D structure information, including experimental methods used, resolution, ligand and specific references, was acquired from the RCSB Protein Data Bank (http://www.rcsb.org/) (27). Pathway, enzyme kinetic parameters (KM, KCAT, VMAX), optimum conditions (PH, temperature) and catalytic reactions were from MetaCyc (https://metacyc.org/). All protein sequences from MetaCyc act as a reference dataset when we applied DIAMOND. EC numbers of the GTs in GTDB were mostly from BRENDA (https://www.brenda-enzymes.org) (28), which is a comprehensive enzyme information system. Information on protein domains was extracted from the Pfam database (http://pfam.xfam.org) (29), which is a large collection information grouped by protein family. Also, the Pfam A set functioned as a target database for the use of HMMER3 ( Figure S1).

Function annotations
By means of the precomputed eggNOG database clusters and phylogenies analysis, eggNOG-mapper can annotate large sets of proteins via fast orthology assignments (20). Here, 554 892 GT sequences were annotated using eggNOG-mapper (v2.0.0) and the eggNOG database (v5.0.0) (30). Due to the large sequence sets and limited time schedule, DIAMOND mode was selected, using 0.001 e-value threshold, two threads and three alignments reported.
The high-throughput program DIAMOND can align DNA reads or protein sequences against a protein reference database with high speed and high sensitivity (19). It was implemented to annotated enzymology information and metabolic pathways of GTs in GTDB with an e-value cutoff of 0.001 and a maximum number of target sequences reported of 1. For EC number annotation, all protein  Figure S1).

Tools developed
We developed BLAST and GTdock tools based on known GTs in GTDB. BLAST was carried out via NCBI-BLAST 2.7.1, with default expect threshold of 10 and a maximum of 250 aligned sequences (31). The entire set of protein sequences in GTDB was used as a target database. Other optional reference databases are also available, including datasets grouped by organism or obtained from different identification means (UniProt Swiss-Prot or TrEMBL). To start a BLAST search, users can paste or upload no more than 10 query sequences at a time.
To predict the possible molecular interactions between GTs and a given glycosyl-acceptor, we developed GTdock tools. To use this function, users should first acquire structure files of macromolecular proteins and small molecular acceptors. For small molecular acceptors, users can select from any of 44 confirmed glycosyl-acceptors provided by UniProt or upload a custom acceptor in SDF format. For macromolecular proteins, the maximum number of structural files allowed is 10. Candidate proteins for docking should only be those with identities >40% and an expected value less than 0.01 in a BLAST comparison with all GTs in GTDB. Finally, Autodock Vina performs molecular docking, where the parameter of each protein center was calculated using the central position of the corresponding molecule and the box dimensions in x, y, z were all set to 40. Once the above analysis was finished, the docking results for each GT with highest interactive score are automatically e-mailed to users.

Website design and database backend
GTDB merges diverse and heterogeneous datasets coming from distinct communities and deposits them in MongoDB after Extract-Transform-Load processes. Compared with traditional relational database management systems, MongoDB provides more flexible ways for necessary future expansion. GTDB was developed using java SpringBoot and integrated a couple of enhanced utilities for providing visual presentations of GTs data. For instance, 3D structures of GTs are represented in 3Dmol.js (http://3dmol.csb.pitt. edu/) (32), and a set of statistical charts were developed based on ECharts (http://echarts.baidu.com/index.html). In addition, we encapsulated Blast and GTdock services as independent docker applications (https://www.docker. com/), which enabled easier deployment and quick updates.

Results
Overview and data summary GTDB was built to offer an integrated resource that incorporates GT sequences and annotations across multiple species, and a user-friendly web interface that allows data retrieval using different types of identifiers, searching against pre-organized protein sequences by BLAST and performing molecular docking of one acceptor with some GTs via GTdock (Figure 2). GTDB collects a large number of sets of GT data that in total consist of 520 179 GTs with 554 892 protein sequences, approximately 21 647 species, 33 720 genes, 394 full EC numbers, 4788 gene ontology (GO) terms and 10 271 references. It covers 105 GT families according to the CAZy database (version 17 April 2019), where GT0 indicates non-classified sequences and GT2 harbors the most abundant number of GTs (159521). In terms of annotated content, 8.9% of GTs have GO terms from thirdparty database and 35.3% GTs' GO terms were further predicted in GTDB. 'Transferase activity (GO:0016740),' which is one of the main characteristics of GTs, accounts for 44.1% and 89.5% in the two previously mentioned sources, respectively. For KEGG orthology (KO) annotations, 3.2% of GTs have KO from knowledge-based resources, and 68.3% GTs' KO were widely annotated in GTDB. Notably, there are five KOs-k02563, k05366, k00688, k07011 and k03814-that are all associated strongly with carbohydrate transport and metabolism, making up a large proportion in the total number of KOs from different methods. For domain prediction, 9.3% of GTs' Pfam data were extracted from UniProt and 79.6% of these domains were extended predictions. All relevant data can be viewed on the 'Statistics' page of GTDB.

Search
Keyword search and sequence search were both implemented in GTDB, where keyword search consists of three accesses. First, users can input a GT name, protein Genbank accession, or UniProt accession on the top query bar using pre-defined search criteria. Second, users can click on any GT name directly in a word cloud on 'Home' page to start a quick investigation of interested GTs. Lastly, an advanced search on the 'Search' page allows more sophisticated searching, where various query items are pre-classified into several categories, including protein information (GT name, Genbank accession, UniProt accession or PDB ID), gene information (gene ID and gene symbol), taxonomy (taxonomy ID or organism), tissue, functions (EC, GO, KO, Pfam accession or pathway) and reaction information (reaction, KCAT, KM, VMAX, PH-opt and temperatureopt). Queries of 'function' and 'reaction' provide the label of the data source (third-party database or predictions in GTDB), which all assist the users to address specific search concerns accurately. An example of advanced search result is shown in Figure 3A. Sequence searches using BLAST can be helpful for analyzing new or similar GTs in GTDB. For instance, GT (Ureaplasma urealyticum) (Genbank accession: QDI63889.1) is new for GTDB because it was newly created on 9 July 2019 in the NCBI Protein resource. While submitting QDI63889.1 as a query sequence and choosing 'All' as the target database, a sorted table of results with links of each aligned entry is generated. As the best alignment results from this search (NP_078340.1 identities = 93% expect = 0) (33) indicates GT (U. urealyticum) may belong to GT2 family and has a catalytic mechanism of inverting. The BLAST results are shown in Figure 3B.

Implementation of GTdock
Considering that GT structure can provide suitable catalytic sites and the special microenvironment for glycosylation reactions, we included binding mode prediction software in GTDB. AutoDock Vina has the advantages of both high speed and high accuracy in molecular docking, so we integrated Autodock Vina and BLAST to build the GTdock pipeline. GTdock can fulfill molecular docking with an acceptor and several GTs at one time and display those results based on 3Dmol.js that provides an interactive interface to view the binding modes.
Here, we used UDT51 as an example to illustrate the function of GTdock. In Saccharomyces cerevisiae S288c, UGT51 can catalyze its natural substrateergosterol-with very high activity. Researchers have found that UGT51 can also catalyze protopanaxadiol (PPD) with ∼13% of the conversion ratio as compared with ergosterol. They then reported mutant M7_1 (S801A/L802A/V804A/K812A/E816K/S849A/N892D), which had an 1800-fold activity improvement for the unnatural substrate PPD (34). A 3D structure of M7_1 was then obtained from Phyre2 with normal mode (35). The 3D structure of PPD (PubChem CID: 9920281) in SDF format and UGT51 (PDB code: 5GL5_B), M7_1 in PDB format were all submitted to GTdock. At last, each binding mode with its lowest energy score is displayed on the result page ( Figure 3C). In Autodock Vina, the more negative the score, the better the docking results. In this case, the molecular docking results indicated that a uniquely changed interaction network in this enzyme may have an effect on its substrate preference (36). Although GTdock can provide some help for studying the binding mode between GTs and acceptors, users should also pay attention to distinguish the relative correct conformation according to their own research purposes and knowledge for further analysis (More examples of GTdock running are shown in Figure S2).

Discussion
In summary, GTDB is an integrated repository with various information on GTs from multiple species. It includes information such as protein sequences, tertiary structures, catalytic activities and function annotations. GTDB not only harbors the common contents from third-party databases but also includes predictions in silico. Moreover, BLAST and GTdock as useful tools are incorporated into GTDB that can be used to explore the related characteristics of GTs from sequences and structures, respectively. GTDB is an easy-to-use website and is convenient for users wishing to search and download pre-defined datasets from the database. Nevertheless, GTDB has some shortcomings, for example, GTdock needs PDB files from extra protein structure modeling websites if there are no existing 3D structures, and bulk operations on more than 10 sequence and structure files are also required.
In the future, we will regularly update the GTDB datasets based on the latest versions of other well-known databases described in the 'Data source' section. Meanwhile, we will consider integrating other crucial data, including additional organism contents, transcriptome data and protein sequence characteristics of GTs. For catalytic reactions of GTs, we would like to sort and correct predicted catalytic reactions (239 417, 46%) manually based on publications. We will also improve the GTdock tool to achieve more accurate docking results by integrating an improved center finding algorithm. To conclude, GTDB will facilitate further identification of GTs and understanding of the vital roles that GTs play in glycobiology, synthetic biology, drug design and development.

Supplementary data
Supplementary data are available at Database online.

Author contributions
C.Z. performed the analyses, data collection and most web design and drafted the paper. Q.X. set up the database and participated in web design and S.H. and W.Y. constructed the GTdock tool pipeline and participated the website design. R.C. contributed to data collection. P.W. took part in the database building. Y.L., X.Y. and Q.W. edited the final manuscript. G.Z. conceived the study and edited the final manuscript. All authors read and approved the final manuscript.