The glycan alphabet is not universal: a hypothesis

Several monosaccharides constitute naturally occurring glycans, but it is uncertain whether they constitute a universal set like the alphabets of proteins and DNA. Based on the available experimental observations, it is hypothesized herein that the glycan alphabet is not universal. Data on the presence/absence of pathways for the biosynthesis of 55 monosaccharides in 12 939 completely sequenced archaeal and bacterial genomes are presented in support of this hypothesis. Pathways were identified by searching for homologues of biosynthesis pathway enzymes. Substantial variations were observed in the set of monosaccharides used by organisms belonging to the same phylum, genera and even species. Monosaccharides were grouped as common, less common and rare based on their prevalence in Archaea and Bacteria. It was observed that fewer enzymes are sufficient to biosynthesize monosaccharides in the common group. It appears that the common group originated before the formation of the three domains of life. In contrast, the rare group is confined to a few species in a few phyla, suggesting that these monosaccharides evolved much later. Fold conservation, as observed in aminotransferases and SDR (short-chain dehydrogenase reductase) superfamily members involved in monosaccharide biosynthesis, suggests neo- and sub-functionalization of genes led to the formation of the rare group monosaccharides. The non-universality of the glycan alphabet begets questions about the role of different monosaccharides in determining an organism’s fitness.


Figure S1
Number of organisms with different number of strains sequenced Figure S2 Biosynthesis pathways Figure S3 Bit score distribution plots for hits of various pairs of profiles Figure S4 Proteome sizes for different number of monosaccharides Figure S5 Prevalence of monosaccharides in species versus that in genomes Table S1 Tools and databases used in this study References References cited in Table S1  Table S2 Comparison of the precursor and nucleotide used for the biosynthesis of two enantiomers of a monosaccharide Flowchart S1 Procedure used to generate HMM profiles Flowchart S2 Precedence rules for assigning annotation to proteins that are hits to two or more profiles and/or BLASTp queries References References to the research articles which describe the pathways (or enzymes of the pathways) of monosaccharide biosynthesis. These formed the basis for generating HMM profiles and choosing BLASTp queries. 15 MS-EXCEL file provided separately: Supplementary Data.xlsx 16 17 Worksheet1 Details of HMM profiles Worksheet2 Details of BLASTp queries Worksheet3 Prevalence of monosaccharides in genomes / species Worksheet4 Abbreviated names of monosaccharides Worksheet5 Enzyme types, enzymes and monosaccharide groups Worksheet6 Precursors of various monosaccharides 18 Figure S1 The number of species for which different number of strains are sequenced.

19
Six or fewer strains are sequenced for most of the species. On the other hand, more than 20 50 strains are sequenced for 29 species. Escherichia coli and Salmonella enterica have 21 the highest number of sequenced strains (714 and 602, respectively). Genus and species 22 names are not known for 45 endosymbionts; only their host name is known e.g., 23 Legionella endosymbiont. Each such case is considered as a distinct species. Retention of config @ C5

Figure S3
Setting bit score thresholds for HMM profiles with varying substrate 56 specificities. TrEMBL database was scanned using the profiles shown along the X-and 57 Y-axes in the above scatter plots; for these scans, default values set by HMMer were 58 used for all the parameters. Hits that are common to a pair of profiles (shown along X-59 and Y-axes) were chosen and bit scores of such hits were plotted against each other. Bit 60 score thresholds (indicated by red lines) were chosen such that a protein is a hit for only 61 one of the two profiles. Threshold was revised for GPE05331 set to exclude PdeG. 62 63 Figure S4 Variations in the proteome size of organisms which encode the same number 64 of monosaccharides. Only the smallest and largest proteome sizes are shown. As can be 65 seen, the number of monosaccharides used by an organism is independent of the 66 proteome size. For instance, Helicobacter pylori PNG84A (proteome size = 1353) uses 67 the same number of monosaccharides (7) as Sorangium cellulosum So0157-2 (proteome 68 size = 10480). analyzed in this study (viz., 12939) and the number of species covered by these genomes 72 (viz., 3384; Figure S1(a)).

111
Flowchart S1 Procedure used to generate HMM profiles 112

Generation of Exp dataset and Exp profile, and setting
Step 1a Consider only those enzymes which are characterized by direct enzyme activity assay Step 1b Remove redundancy (80% sequence identity cutoff) and obtain a multiple sequence alignment (MSA) Step 1c Use the MSA as input to generate an HMM profile Step 1d Score Exp dataset sequences against this HMM profile Step 1e Set the bit score of the lowest scoring sequence as the bit score threshold for Exp dataset, 2. Generation of Extend dataset and Extend profile, and setting Step 2a Add sequences that meet any of the following criteria to the Exp dataset (i) SwissProt entries satisfying the threshold (ii) SwissProt entries scoring < provided they show conservation of active site residues. Active site residues were collated based on site directed mutagenesis studies or ligand-bound 3D structures (iii) TrEMBL entries for which molecular function has been inferred from experiments other than direct enzyme assays viz., complementation assays, phenotypic studies, etc. (iv) TrEMBL entries with solved 3D structure (v) FunFam members (CATH database) but only in the case of CDP-glucose 4,6-dehydratase (FunFam 20603) and phosphomannoisomerase family 3 (FunFam 54112) Step 2b Remove redundancy (80% sequence identity cutoff) and obtain a multiple sequence alignment (MSA) Step 2c Use the MSA as input to generate an HMM profile Step 2d Score Extend dataset sequences against this HMM profile Step 2e Set the bit score of the lowest scoring sequence as the bit score threshold for Extend dataset, 113 # For a protein that is a hit for GPE00620 and GPE00720 # GPE00620 is used only in combination with GPE00720 # A protein can be a hit for GPE00620 or GPE00720, but not for both (nonorthologous) IF (hit for GPE00620 AND hit for GPE00720) THEN Alert: Hit for GPE00620 and GPE00720 ENDIF # Case 12 of 14 # Isomerases: GPE07030, GPE07130, GPE07230, and GPE07330 # A protein can be a hit for any one of the above four profiles (non-orthologous) For GPE07030, GPE07130, GPE07230 and GPE07330 IF (hit for more than one) Alert: Hit for (list all profiles which appear as hits from above list)] # Case 13 of 14 # For a protein that is a hit for GPE00430 and GPE00530 # GPE00430 is used only in combination with GPE00530 # A protein can be a hit for GPE00430 or GPE00530, but not for both (nonorthologous) IF (hit for GPE00430 AND hit for GPE00530) THEN Alert: Hit for GPE00430 and GPE00530 ENDIF # Case 14 of 14 # For a protein that is a hit for GPE05332 and Q81A42:1-328 # GPE05332 is used only in combination with Q81A42:1-328 # A protein can be a hit for GPE05332 or Q81A42: