Using Databases and Web Resources for Glycomics Research

Many databases of carbohydrate structures and related information can be found on the World Wide Web. This review covers the major carbohydrate databases that have potential utility for glycoscientists and researchers entering the glycosciences. The first half provides a brief overview of carbohydrate databases and web resources (including a history of carbohydrate databases and carbohydrate notations used in these databases), and the second half provides a guide that can be used as an index to determine which resources provide the data of most interest to the user.

Many databases of carbohydrate structures and related information can be found on the World Wide Web. This review covers the major carbohydrate databases that have potential utility for glycoscientists and researchers entering the glycosciences. The first half provides a brief overview of carbohydrate databases and web resources (including a history of carbohydrate databases and carbohydrate notations used in these databases), and the second half provides a guide that can be used as an index to determine which resources provide the data of most interest to the user. It was not until the 1990s, when CarbBank was developed, that a carbohydrate structure database was made available to the public. When the CarbBank project ended, many researchers found the need to somehow continue the development of carbohydrate-related databases, and so GLYCOSCIENCES.de, KEGG GLYCAN, and the Consortium for Functional Glycomics databases emerged. Since the development of these databases, many more carbohydraterelated web resources have been developed, to the point where it is becoming difficult to keep track of them all. Moreover, the use of different carbohydrate structure notations to represent the data in each database makes it difficult for users to utilize them all effectively. Therefore, in addition to a description of various carbohydrate-related databases and web resources, we also present briefly in this review each carbohydrate structure representation, along with suggestions of tools that can be used to convert one to another.
In the main section of this review, we summarize many of the well-known carbohydrate-related databases and web resources that are now publicly available. Because many of these resources have already been published in detail elsewhere, only a summary is presented here. The final section presents carbohydrate-related data from the researcher's perspective; that is, various categories of data are listed (e.g. three-dimensional structures, taxonomy, experimental data, etc.), and the pertinent databases and web resources are listed within each category such that they can be assessed and compared with one another.
History: CarbBank-The very first carbohydrate database was called CCSD, for Complex Carbohydrate Structure Database (1,2). However, because of the tool used to access this database, called CarbBank, it is more commonly known as CarbBank. This database was initially developed by the Complex Carbohydrate Research Center at the University of Georgia. It was the first attempt to accumulate such data from the research community, and it thus became a valuable resource for current carbohydrate databases. Unfortunately, funding for CCSD ceased in 1996, but the data, consisting of almost 45,000 entries of glycan structures, annotation, and literature information, were still made available to the public.
There has been some criticism of the quality of the data in CarbBank (3). This can be attributed to the fact that there was no curation of the accumulated data, which resulted in many inconsistencies in terms of duplicated carbohydrate structures and inconsistent notations to represent carbohydrates (or, more specifically, monosaccharides) in the database. However, the same can be said of GenBank (4), and users should be aware of the potential errors in all publicly available databases.
History: Carbohydrate Structure Notation-As many glycan structure databases were developed in the absence of a standard notation for representing glycan structures, many database providers developed their own notations. A list of these formats and corresponding examples for representing the N-linked glycan core structure are given in Table I. Detailed explanations of these formats have been presented elsewhere (5), but one can see from this table that there are a wide variety of carbohydrate structure formats now in use, each with its own advantages and disadvantages. Although many databases provide different ways to input glycan structures as queries using innovative user interfaces, not all databases return the results of these queries in multiple formats. That is, a query might be returned in a single format that will need to be converted to another format in order for it to be used as a query for another database, for example.

Summary of Currently Available Glycomic Databases and Web Resources-
GLYCOSCIENCES.de-GLYCOSCIENCES.de is one of the oldest web portals for glycomics research; it was originally developed at the German Cancer Research Center and now is maintained by Justus-Liebig University Giessen (10). It includes glycan structure references to CarbBank, automatically generated three-dimensional coordinates, 1 H and 13 C NMR spectra including 1 H and 13 C NMR-shift lists, masses of theoretically calculated glycan fragments, ligand data from the Protein Data Bank (PDB), 1 and three-dimensional conformational maps for glycosidic linkages. Recently, a Glyco-CD database has also been incorporated that consists of clusterof-differentiation antigens to aid in the classification of various cell surface macromolecules.
In addition to databases, GLYCOSCIENCES.de also provides a number of tools to aid in carbohydrate structure analysis. The available tools are categorized into three major groups: three-dimensional-structure-related tools, tools related to structure representation, and mass spectrometry tools. The three-dimensional-structure-related tools are further categorized into the following three groups: detection/ validation of carbohydrates in PDB files, statistical analysis of carbohydrates in PDB, and modeling. Table II provides references and brief descriptions of each tool.
KEGG GLYCAN-KEGG GLYCAN is a part of the Kyoto Encyclopedia of Genes and Genomes (KEGG) resource developed by Kanehisa Laboratories in Japan. Details regarding each of the data sets available in KEGG GLYCAN are available elsewhere (11); a brief summary is described in this section. KEGG GLYCAN contains glycan structures extracted from CarbBank and manually curated glycan structures from the literature. The Composite Structure Map (12) was built from these structures to provide an overview of the structures in the database and the relationships among them. This map provides links to the glyco-gene and to glycan structure data in KEGG.
The Glycan Binding Protein data resource provides detailed information about glycan-binding proteins (GBPs), including DNA and protein sequences, binding specificities, biological functions, etc. Experts in various fields of GBP research have provided the summary information and references for each GBP. The Glycosyltransferase resource serves as a portal to relevant CFG data and other glycan-related databases. Each enzyme is categorized according to the carbohydrate structure with which it is involved. A graphic display of a combined glycan structure is also provided so that users can click directly on the glycosidic linkage to which the enzyme of interest is related.
The glycan array data resource provides all the results that have been obtained by the CFG from screening samples for glycan binding specificity (17). A large variety of GBPs such as lectins, antibodies, pathogens, and cells have been screened, and each experimental result can be viewed graphically in a web browser. The glycan profiling resource provides all the results that have been obtained by the CFG from MALDI-MS and other analyses to identify and characterize the glycans in human and mouse tissues and cells (18,19). Each of the GBPs and glycans in the results from these experiments is also linked to other data resources and external databases.
The microarray data resource provides gene expression data from experiments using human and mouse samples on the glycogene microarray chip developed by the CFG (20), as well as some other commercial gene chips. Each experimental result has been processed to provide signals, present/ absent calls, and p values for each gene. The mouse phenotyping resource (21) provides the data obtained from knockout gene experiments on a variety of mouse strains. The details regarding each experimental protocol, the raw data obtained, and processed data, including a summary of the experimental results, are all provided.
The CFG also provides Paradigm Pages, which is presented in a wiki format and describes detailed information about exemplary GBPs that are considered "paradigm GBPs." CFG investigators volunteer to make contributions to these pages, which can also be updated as needed.
JCGGDB-The Japan Consortium for Glycobiology and Glycotechnology DataBase (JCGGDB) is a database of glycoscience data accumulated in Japan. It includes an integrated search function for many glycoscience databases throughout Japan. Table III lists the representative databases that are accessible from JCGGDB. In addition to a keyword search function that works across all the integrated databases, JCGGDB also provides many data resources of interest to glycobiologists, such as glycan-related diseases and experimental protocols. Many data have also been accumulated from the large amount of experimental data obtained at the National Institute of Advanced Industrial Science and Technology, which houses JCGGDB. These include mass spectrometry data, lectin affinity data, glycoprotein data, and glyco-gene information, as described in Table III.
BCSDB-The Bacterial Carbohydrate Structure Data Base (BCSDB) (22) is a database of carbohydrate structures found in bacteria or obtained via the modification of those found in almost 5000 bacterial organisms, curated from the literature. In addition to structural information, each entry provides information about references, biological sources, keywords, methods used to elucidate the structure, bioactivity, NMR assignment tables, etc.
GlycomeDB-GlycomeDB (23) is a web portal for all of the major glycan structure databases described above, as well as some others. Specifically, GlycomeDB has integrated glycan structures from the following databases: CarbBank, GLYCOSCIENCES.de, KEGG GLYCAN, CFG, JCGGDB, BC-SDB, GlycoBase (Dublin), GlycoBase (Lille), GlyAffinity, and EuroCarbDB. Much effort went into ensuring that a consistent namespace was used in mapping structures between different databases. GlycoCT is used as the central carbohydrate representation format for integrating the structures in these databases and storing them in GlycomeDB. In addition to the glycan structures and links to their original database sources, taxonomic information is also included in each record.
GlycoSuiteDB-GlycoSuiteDB (24) is an annotated and curated database of glycan structures taken from the scientific literature published between 1999 and 2005. Originally developed commercially, it has since been made available publicly through the ExPASy server. For each glycan structure, detailed information is provided on native and recombinant sources (i.e. tissue and/or cell type, cell line, strain, and disease state), and links to Swiss-Prot/TrEMBL are provided when available. Search functions for the following items are also provided: • Mass or mass range. • Attached protein by name, keyword, or Swiss-Prot/ TrEMBL accession number.
• Taxonomy selected from a list or by keyword.
• Composition of monosaccharides and substituents.
• Tissue or cell type selected from a list or by keyword.
• Glycosidic linkage (i.e. N-linked, O-linked, or C-linked) and/or reducing-terminal sugar selected from a list.
• Disease selected from a list or by keyword.
• Literary reference by author name(s) or publication year(s).
• Structure using their own structure-drawing tool.  (33) Plots of the frequency of appearance of amino acids found surrounding particular carbohydrate residues can be generated from this tool, based on the latest PDB data. GlyTorsion (33) Plots of the frequency of ranges of torsion angles of carbohydrate components, as found in the latest PDB data, can be generated. GlySeq (33) Plots of the frequency of amino acid compositions around glycosylation sites can be generated, based on the data derived from PDB or SwissProt. Modeling Sweet-II When the user inputs a glycan structure using the customized input form, threedimensional structures of the glycan can be generated in a variety of formats, such as a PDB file, using JMol, VRML, Tinker, and Babel. GlyProt (34) Given a PDB ID or file, potential glycosylation sites, their accessibility, and an in silico generation of the glycosylated protein in three dimensions can be generated. GlycoMapsDB (35) This is a database of pre-calculated conformational maps of various oligosaccharides found in N-and O-linked glycans. Carbohydrate-notation related LINUCS (36) Given a carbohydrate structure in CarbBank format, this tool generates the LINUCS code format.

LiGraph
Given a carbohydrate structure in CarbBank format, this tool generates two-dimensional images of the structure. sumo Given a carbohydrate structure in LINUCS or IUPAC format, this tool extracts known structural motifs, such as Lewis antigens or core structures. Mass spectrometry Glycofragment Given a carbohydrate structure in CarbBank format, this tool attempts to find the fragments that can be expected to occur in MS spectra for the structure. (25) is an initiative that extends from the EuroCarbDB project (9), which was a European Union-funded initiative for the development of a framework for storing carbohydrate structures and their experimental evidence, namely, mass spectrometry, HPLC, and NMR data. UniCarbKB aims to further integrate annotations and adopt common standards to provide a knowledgebase for glycomics data. UniCarbKB includes UniCarb-DB (26), a platform for structural and analytical storage and the retrieval of data from glycomics experiments. The glycan structures in UniCarbKB are generally derived from GlycoSuiteDB, and all structures are linked to experimental evidence, which includes annotations regarding experimental conditions, biological sources, etc.

UniCarbKB-UniCarbKB
MonosaccharideDB-MonosaccharideDB was developed alongside GlycomeDB and the EuroCarbDB project as a central resource for monosaccharide information. This database is crucial in that it allows scientists to obtain mappings between different representations of monosaccharides, which might be defined differently depending on the research field. For example, N-acetylglucosamine can be written as GlcNAc or b-D-GlcpNAc, or even as 2-acetamido-2-deoxy-beta-Dglucopyranose. This database ensures that all of these representations refer to the same monosaccharide. Moreover, this database provides web services and on-the-fly web interfaces to make the data available as efficiently and accurately as possible. In addition to the major glycan structure representations listed in Table I, MonosaccharideDB also re-turns graphic notations, including three-dimensional representations, of monosaccharides in the database.
PolySac3DB-PolySac3DB is a database of three-dimensional structures of polysaccharides (27) accumulated from the scientific literature. The data are provided in a hierarchical manner in categories representing polysaccharide families, including agaroses, amyloses, celluloses, chitins, glycosaminoglycans, pectins, xylans, etc. Each record is annotated regarding the way in which the structure was obtained, references, the three-dimensional representation of the repeating unit, diffraction diagrams, etc. Information regarding polysaccharide structure determination techniques (x-ray crystallography, electron and neutron diffraction, and molecular modeling) is also provided for beginners in the field.
Lectin3D-Lectin3D, previously known as Lectines, is a database of three-dimensional structures of lectins categorized into different lectin families, including plant lectins, animal lectins, and viral lectins. Each record is annotated with the PDB identification, resolution, species, references, etc. Detailed structural data can also be viewed as images using Jmol or downloaded as PDB files.
Reference Guide to Glycomics Databases and Web Resources-Because of the variety of information provided by the databases and web resources described in this review, we provide this section as a guide to help users find the most appropriate resources for their research focus of interest.
Genomics Data-Glyco-gene Data- GlycoGene is a database that includes genes associated with glycan synthesis such as glycosyltransferases, sugar nucleotide synthases, sugar-nucleotide transporters, and sulfotransferases. All of the nearly 200 human glycogenes have been identified, cloned, and characterized (28). It also includes substrate specificity information.

LfDB (Lectin Frontier
DataBase) LfDB provides basic information on lectins, as well as interaction data obtained via the frontal affinity chromatography-fluorescence detection system (37,38).
GlycoProtDB is a database of N-glycoproteins that have been experimentally identified in C. elegans N2 and mouse tissues (strain C52BL/6J, male).

GMDB (Glycan Mass
Spectral DataBase) GMDB currently stores MS2, MS3, and MS4 spectra of N-and O-linked glycans and glycolipid glycans and their fragments (40). LipidBank LipidBank (41) is a freely available database of natural lipids including fatty acids, glycerolipids, sphingolipids, steroids, and various vitamins. It is the official database of the Japanese Conference on the Biochemistry of Lipids. GlycoEpitope The GlycoEpitope database provides information on carbohydrate antigens such as glycoproteins that express carbohydrate antigens, glycolipids of which the partial structure is a carbohydrate epitope, enzymes that take part in the synthesis and degradation of glycoepitopes, the time and site of expression of carbohydrate epitopes, diseases to which carbohydrate epitopes are related, etc. GALAXY (Glycoanalysis by the three axes of MS and chromatography) GALAXY contains data on approximately 500 different pyridylamino-glycans, including the structures, HPLC elution positions expressed in glucose units on ODS and amide-silica columns, relative molecular mass, code numbers, sources of samples, and references (42).
GlycoPOD is a collection of experimental protocols used for glycobiology research. Each protocol has been written by experts in glycoscience.
(a) JCGGDB provides a list of all experimentally verified glyco-genes in humans (28). Each record is annotated with literary references and provides a graphical view of substrate specificity.
(b) KEGG provides a list of glycosyltransferases and other glyco-enzymes in their KEGG ORTHOLOGY and KEGG BRITE data resources (29).
(c) The CFG provides not only genomic information of glyco-genes, but also expression data as obtained from their microarray experiments.
Pathways-(a) The KEGG resource provides the basic biosynthesis pathways for glycan structures, including N-glycans, O-glycans, glycosaminoglycans, glycosylphosphatidylinositol anchors, glycolipids, lipopolysaccharides, and peptidoglycans. KEGG PATHWAY also provides information on glycan-relevant pathways such as signaling and interactions, cell communication, and the immune system.
Proteomics Data-GBPs-(a) The CFG provides Molecule Pages for GBPs that include genomic, proteomic, and glycomic details for each record and are annotated by experts in the field. Links to CFG resource data (glycan arrays and profiling data) are also provided where available.
(b) The Lectin Frontier DataBase, a part of JCGGDB, provides structural, taxonomic, literary, protein sequence, and glycan-binding affinity data for a variety of lectins.
(c) Lectin3D provides structural, taxonomic, and literary information regarding lectins for which structures are known (as found in PDB). PDB files can be downloaded and structures can be visually modified using Jmol at this web site.
Glycoproteins-(a) UniProt is a major protein database that also includes glycosylation-site annotations. These annotations are marked as predicted from computer simulations or experimentally verified.
(b) GlycoProtDB, a part of JCGGDB, provides N-linked glycoprotein data from C. elegans and mouse tissues. Both predicted and experimentally confirmed glycosylation-site information is provided, along with links to external databases such as GenBank and Swiss-Prot where available.
(c) GlycoSuiteDB provides information about proteins known to be glycosylated as confirmed in the literature. Experimental methods, taxonomic and biological source information, references, and glycan structures are provided for each record.
Glycan Structure Profiling Data-(a) UniCarbKB provides the glycan structures identified using various experimental technologies, including MS and HPLC. The record for each experiment includes the biological source, structure, retention time, experimental conditions, and protein information when available.
(b) The CFG provides glycan structures that have been identified using MALDI-MS on human and mouse tissues and cells. They are available under the "CFG Data" section of the CFG.
(c) The RINGS resource of glycomics analysis tools (6) has recently developed the GlycomeAtlas (30), which is a tool that visualizes the glycan profiling data of the CFG. It also has the functionality to visualize custom-defined glycan profiling data.
Taxonomy Data-(a) For each structural record in GlycomeDB, the organism(s) in which that structure has been found (as presented in the original carbohydrate database) is included. Thus, a search function for all structures that are found in a particular organism is provided.
(b) BCSDB has made its bacterial carbohydrate data searchable by genus, species, and/or strain or serogroup. Options to search by taxonomic families, using NCBI taxonomy IDs, and host organisms are also available.
(c) GlycoSuiteDB provides a search function for queries by species or class by listing all available species or classes in the database, from which the user can select his or her item of interest. It is also possible to enter a (partial) keyword, which can be a species name (e.g. Homo sapiens), a common name (e.g. human), or a class.
Experimental Data-Protocols-GlycoPOD, a part of JCGGDB, provides an organized classification of experimental protocols for glycomics experiments. Experts selected by the steering committee of this database provide the details for each protocol, which are also checked by the committee for consistency.
Raw Experimental Data-(a) UniCarb-DB, a part of UniCarbKB, provides mass spectrometric data and structural assignments based on fragmentation data. Glycan structures can be queried individually, and the results include all experimental evidence for the selected structure. Biological source information, experimental conditions, and annotated peak lists are also provided.
(b) The data from glycan array, glycan profiling, glyco-gene microarray experiments, and knockout mouse experiments are all provided in the CFG Data section. Processed and summarized data, as well as the original raw data files, can be downloaded.
Pathological data-(a) KEGG includes a resource called KEGG DISEASE and DRUG that might contain links, albeit indirect, to KEGG GLYCAN records.
(b) JCGGDB provides a listing of glycan-related diseases in their newly developed Glyco-Disease Genes Database and Tumor Markers Reference Database. Each database has been curated from the literature and provides references for each record.
Structural Data-Three-dimensional Data-(a) GLYCOSCIENCES.de provides a variety of modeling tools for carbohydrates, and with them computationally predicted three-dimensional structures of glycoproteins and glycans can be generated.
(b) MonosaccharideDB contains three-dimensional structures of monosaccharides that can be downloaded in PDB format.
(c) PolySac3DB is a database of three-dimensional polysaccharide structures extracted from the literature. Educational information regarding polysaccharides and experimental procedures is also provided.
(d) Lectin3D is a database of three-dimensional lectin structures as found in PDB.
Epitopes-(a) GlycoSuiteDB provides a search function for glycan epitopes.
(b) GlycoEpitopeDB is a database of carbohydrate epitopes and antibodies that has been manually curated from the literature.
Mass Data-(a) GLYCOSCIENCES.de provides a search function that allows the user to search for carbohydrates using a list of mass peaks.
(b) KEGG GLYCAN includes mass information for each glycan structure, computed from the composition.
(c) GlycoSuiteDB provides a search function allowing the user to search for glycans with a particular isotopic mass, a list of masses, or a range of masses.
(d) UniCarb-DB provides a search function allowing the user to search the experimental data by mass.
NMR Signal Data-(a) GLYCOSCIENCES.de and BCSDB both provide search functions allowing the user to search NMR data based on atoms, peaks, and chemical shifts.
Summary-In addition to the databases and web resources described in this review, new resources are constantly being developed, and it would be impossible to comprehensively cover them all. This review focused on those that are well known in the glycoscience community and provide sufficient annotation information for researchers to find their data of interest. Because of the variety of tools and interfaces that are provided by each resource, this review also summarized these resources by data type in the hopes of helping the reader determine which is the most pertinent resource for his or her research. It is expected that in the future, these resources will become better integrated with one another so that a consistent interface can be provided, allowing users to most efficiently take advantage of the valuable knowledge that can be gained from them. ‡ To whom correspondence should be addressed: Tel./Fax: ϩ81-42-691-4116; E-mail: kkiyoko@soka.ac.jp.