Exploration of O-GlcNAc transferase glycosylation sites reveals a target sequence compositional bias

O-GlcNAc transferase (OGT) is an essential glycosylating enzyme that catalyzes the addition of N-acetylglucosamine to serine or threonine residues of nuclear and cytoplasmic proteins. The enzyme glycosylates a broad range of peptide sequences and the prediction of glycosylation sites has proven challenging. The lack of an experimentally verified set of polypeptide sequences that are not glycosylated by OGT has made prediction of legitimate glycosylation sites more difficult. Here, we tested a number of intrinsically disordered protein regions as substrates of OGT to establish a set of sequences that are not glycosylated by OGT. The negative data set suggests an amino acid compositional bias for OGT targets. This compositional bias was validated by modifying the amino acid composition of the protein fused in sarcoma (FUS) to enhance glycosylation. NMR experiments demonstrate that the tetratricopeptide repeat region of OGT can bind FUS and that glycosylation-promoting mutations enhance binding. These results provide evidence that the tetratricopeptide repeat region recognizes disordered segments of substrates with particular compositions to promote glycosylation, providing insight into the broad specificity of OGT.

Intracellular O-linked-β-N-acetylglucosamine (O-GlcNAc) is an essential posttranslational modification (PTM). Since discovery of the modification more than 3 decades ago (1), various proteomic studies have identified thousands of proteins with the modification (2)(3)(4)(5)(6). Underscoring the importance of the modification, the single enzyme responsible for adding O-GlcNAc onto proteins, O-GlcNAc transferase (OGT), is required for the viability of dividing mammalian cells and embryogenesis (7,8). O-GlcNAc is thought to play essential roles in nutrient sensing and stress response with implications for diabetes, cancer, and diseases of aging, including neurodegenerative disease (reviewed in (9)). O-GlcNAc-modified proteins implicated in neurodegenerative diseases include amyloid precursor protein (10), tau (11), α-synuclein (12), and superoxide dismutase (13). Mutations in OGT have also been implicated in intellectual disability (14). Unlike other glycan modifications, which are often multimeric (15), O-GlcNAc modification involves addition of a single GlcNAc moiety to serine or threonine hydroxyls. O-GlcNAc significantly affects protein thermodynamic and solvation properties and modulates protein thermal stability (16) and aggregation propensity (11,17,18). In addition, O-GlcNAc modification has been recently demonstrated to modulate protein phase separation, based on experimental work with Ewing sarcoma protein (EWS), CAPRIN1, and SynGAP/PSD-95, as well as broader bioinformatics results (19)(20)(21).
Given the significant biological impact of this PTM, several studies have grappled with defining OGT sequence specificity (22)(23)(24)(25)(26) using O-GlcNAc-modified sites identified in cell extracts by mass spectrometry (MS) or by utilizing highthroughput assays performed on peptide or protein microarrays (22,27). O-GlcNAc status in vivo is a convolution of multiple factors. These include the specificity of OGT and the specificity of the enzyme that removes O-GlcNAc moieties, O-GlcNAcase (28). The efficiency and specificity of OGT is also influenced by the expressed splice isoform (29), since different OGT isoforms contain different numbers of tetratricopeptide repeat region (TPR) repeats and TPR repeats are involved in peptide substrate recognition (24,26,(30)(31)(32). OGT may also be recruited to substrates by adaptor proteins (33)(34)(35). The concentration of glucose and insulin as well as tissue type and developmental stage also contribute (36,37). Finally, OGT modification sites are primarily found in extended loops or intrinsically disordered regions (IDRs) (5) that can access the catalytic site.
Several closely related OGT recognition sequences have been identified. For example, Pathak (22). Nevertheless, a scan of O-GlcNAcylated peptide sequences in the PhosphoSite database (38) indicates that most substrates fall outside of this definition, as well as definitions put forward by other groups. Computational methods that make use of machine learning or neural networks to predict sites of O-GlcNAc modification (23,(39)(40)(41)(42)(43)(44)(45)(46) have been used to address this shortcoming. These computational methods take both sequence and amino acid combinations into account when making their predictions. Predictors include YinOYang (43), OGTSite (42), O-GlcNAcscan, (46) and O-GlcNAcPRED-II (41), although not all of these predictors are still available online. Evaluating the effectiveness of these predictors is very challenging, in part because of the lack of experimentally verified negative sites. Particular attention must be paid to sensitivity when evaluating these O-GlcNAc predictors, because only a small minority of serines and threonines are expected to be modified. While many proteins can be modified by OGT, the proportion of individual serines and threonines that are modified may be as low as 1.4% (45). The small number of modified residues makes it easy to achieve a high level of accuracy (combined proportion of correctly identified positive and negative sites) by setting a very high threshold for positive-site identification. A high threshold enables correct identification of negative sites, which represent the vast majority of sites. The trade-off is that many positive sites will be missed, yielding a low sensitivity (proportion of positive sites correctly identified). Thus, when evaluating O-GlcNAc site predictors, sensitivity is a critical parameter. Taking sensitivity into consideration, O-GlcNAcPRED-II seems to outperform other prediction methods (41,45). Despite the extensive effort put into these computational methods, they yield many false positive and false negative sites, so experimental validation is still necessary (47,48).
The inability to clearly define OGT specificity is the result of at least two contributing factors. First, many of these attempts to define specificity have focused on peptide regions in the immediate vicinity of the glycosylated region, roughly the length of peptide than can be accommodated in the catalytic site of the enzyme. It is now known that efficient substrate recognition can involve a more extended peptide region than what is incorporated into existing predictors (31). For example, efficient glycosylation of the RNA Polymerase C-terminal domain (RNA Pol CTD) requires more than 20 heptad repeats or over 140 residues (30). The largest isoform of OGT (ncOGT) is comprised of a catalytic domain preceded in sequence by 13 TPR repeats (24,49,50). The TPR region forms a large superhelix that has been shown to influence recognition of extended lengths of substrate peptides (30,51,52). Thus, improving predictors will require consideration of a broader sequence context. Secondly, the inability to clearly define OGT specificity is due to the lack of an experimentally verified negative dataset for prediction purposes. Existing predictors have used protein regions not annotated as glycosylated, but found in proteins that are glycosylated by OGT, as a negative dataset (39,42), or alternatively human proteins from the UniProt database (53) that are not explicitly known to be glycosylated or predicted to be glycosylated (40). Taking protein regions not annotated as glycosylated but found in proteins that are glycosylated has the advantage that the proteins are known to exist spatially and temporally near to OGT, an important consideration (54). Nevertheless, the assumption that the absence of MS data supporting glycosylation at a specific site is evidence that the site is not glycosylated is a poor one. MS studies often do not achieve full coverage for a particular protein and certainly not for the entire proteome (55). Coverage of the proteome in higher organisms typically does not exceed 10%. In one study where over 1500 O-GlcNAc-modified proteins were identified, modification sites could only be assigned in 80 proteins (3).
Even when a high degree of coverage is achieved, the identification of PTM sites depends heavily on the abundance of a particular protein (5), especially when the sites are substoichiometrically modified as is often the case for O-GlcNAc. Finally, identification of specific sites is hindered by the lability of the O-GlcNAc modification in MS/MS experiments (56,57). For these reasons, construction of a negative dataset based on sites that are not annotated as glycosylated is not a good strategy. In one recent study the majority of proteins identified as being O-GlcNAc modified were not previously known to be modified (16). The case of Lamin A serves to illustrate the point. Prior to 2018, Lamin A was shown to have two glycosylated sites (57); more recently, an additional nine sites were identified (47). Only two of the Lamin A sites (S612 and T643) are currently listed in the PhosphoSite O-GlcNAc database. Interestingly, the double mutant S612A/T643A of Lamin A is as robustly O-GlcNAc modified as the WT protein. Inclusion of the nine additional sites in the negative database would clearly impair prediction. The EWS protein is another excellent illustration, since it is a well-documented OGT substrate (58) but is not present in the PhosphoSite O-GlcNAc database which was used as the basis for most of the existing predictors.
Here, we explored the utility of an experimentally verified negative dataset for the prediction of likely glycosylation-target sites for the longest isoform of OGT. Initial work on glycosylation of the three FET family proteins, fused in sarcoma (FUS), EWS, and TATA-binding protein-associated factor 15 (TAF15), provided insight and suggested a way forward for site prediction (58). As in previous examples, we used the Phos-phoSite O-GlcNAc database as our positive set but extended the length of the peptide region considered. To obtain a negative dataset, we purified a set of IDRs and subjected them to optimized glycosylation reactions with purified recombinant OGT, then used whole protein MS rather than MS/MS to identify proteins that were not being glycosylated. We computationally optimized a scoring matrix to distinguish between the positive set and our experimental negative dataset. The scoring matrix suggests that OGT substrates have an amino acid compositional bias that extends beyond the polypeptide region that can be accommodated in the catalytic site of the enzyme. Compositional mutants of the FUS N-terminal low-complexity region (LCR N ) support an OGT compositional bias. We verified that the TPR region of OGT (OGT-TPR) can bind to FUS and an enhanced-glycosylation mutant of FUS, leading us to speculate that the TPR region enables OGT to glycosylate substrates that are not optimally recognized by the catalytic site.

Glycosylation stoichiometry of the FET proteins
Previously, Kamemura reported that, of the three FET proteins FUS, EWS, and TAF15, only EWS is glycosylated with high stoichiometry by OGT (58). Since the FET proteins are homologous, this suggested that further analysis of FET protein glycosylation might inform our understanding of OGT specificity. We assayed glycosylation levels on undigested protein samples using electrospray ionization MS following in vitro glycosylation of purified human FUS, EWS, and TAF15 fragments (Fig. 1, A-C). Specifically, we measured glycosylation of the LCR N of these FET proteins, since this is the region of EWS which is glycosylated (19,59): FUS (aa 1-214), EWS (aa 1-264), and TAF15 (aa 1-210), hereafter referred to as FUS, EWS, and TAF15, respectively. A distribution of glycosylated states with three to ten added O-GlcNAc moieties was observed for EWS. No unglycosylated EWS was observed following the reaction. In contrast, the majority of the TAF15 LCR N protein was not glycosylated, although a small amount of singly glycosylated protein was observed. Modest glycosylation of FUS LCR N was observed, with a median of two sugar groups added. These experiments confirm that EWS can be heavily glycosylated, while FUS is modestly glycosylated and TAF15 is largely unglycosylated in vitro.

Glycosylation sites in EWS
To identify specific EWS glycosylation sites, we subjected the EWS LCR N to chymotrypsin digestion followed by LC-MS/ MS. As observed by others, we found that the O-GlcNAc group is labile and is removed during the peptide fragmentation step. Thus, we could determine which peptides are glycosylated but could not identify specific serines and threonines that are glycosylated. Glycosylation sites were spread across the LCR N as indicated in Table 1. We can determine that there are more than 14 glycosylation sites, though we never observed EWS protein modified at 14 or more sites simultaneously. We speculate that glycosylation at some sites might inhibit glycosylation at nearby sites, thus making it unlikely to observe EWS glycosylated at all possible sites.

Amino acid variation in the FET LC regions
The stark difference in FET protein LCR N glycosylation stoichiometry was surprising, since their sequences share many features. These LCR N are primarily comprised of the amino acids glycine, alanine, serine, threonine, tyrosine, proline, and glutamine, a composition that is similar to the RNA Pol CTD, which is also known to be glycosylated by OGT. However, detailed comparison of the sequences suggests explanations for their differing substrate specificity (Figs. 1D and S1). EWS has a much higher percentage of alanine, proline, and threonine residues than either FUS or TAF15. The fractional content of glycine and serine is lower for EWS than either FUS or TAF15. The RNA Pol CTD is noticeably depleted in glycine and glutamine residues relative to the FET proteins. TAF15 has a notably higher proportion of charged residues including arginine, aspartic acid, and glutamic acid. Observing these differences, we decided to investigate the importance of amino acid composition outside of the immediate vicinity of the glycosylated sites.

Computational optimization of OGT substrate prediction
The inability of OGT to appreciably glycosylate TAF15 suggested that it would be possible to develop an experimentally verified negative dataset to improve substrate prediction. To that end, a number of known IDRs were subjected to glycosylation under optimal conditions followed by intact MS. The EWS LCR N served as a positive control for this experiment. High-quality mass spectra for IDRs of SARA (aa 766-822, human) (60), DDX4 (aa 1-236, mouse) (61), TAF15 (aa 1-210, human), CFTR (aa 654-838, human) (62), and  FMRP (aa 445-632, human) (63) demonstrated that these proteins were not glycosylated to any appreciable extent even after 15 h of reaction time ( Figure 2). Only one of the IDRs that we tested, from the yeast protein Sic1 (aa 1-90) (64), was significantly glycosylated under our reaction conditions (not shown) and thus excluded from our negative dataset. (Hereafter, these IDRs are referred to only by the name of the protein.) Close inspection of the DDX4 and TAF15 spectra showed a small fraction of protein glycosylation on a single site. Nevertheless, we decided to keep DDX4 and TAF15 in our negative dataset, since the site(s) are clearly less than optimal and barely glycosylated even following overnight incubation. All peptides centered on serine or threonine residues were extracted from the relevant TAF15, CFTR, DDX4, FMRP, and SARA sequences to use as a negative dataset. The negative set includes a total of 135 peptides, though many contain overlapping sequences. Because of the small size of the dataset, we did not exclude any of the peptides for subsequent testing of the approach. Our positive dataset consists of the sites listed in the PhosphoSite O-GlcNAc database, which includes 1830 sites primarily from mouse, human, and rat proteins. A simple computational strategy was used to optimize a substrate scoring matrix (Fig. 3, A and B). The substrate scoring matrix had position relative to the potential glycosylation site along one axis and the amino acids along the other axis. Tryptophan and cysteine were excluded from the scoring matrix, as they were considered too rare to properly evaluate. Substrates in the positive and negative sets were scored by summing the value of the appropriate matrix positions for each amino acid in the sequence. Random modifications were made to the matrix and were kept if the distance between the median scores of the positive and negative dataset increased. The optimized matrices converged on similar matrices, despite starting from very different starting matrices. Matrices for peptide lengths of 15, 23, 31, and 39 were evaluated. As some of the observed trends spanned the longest 39 residue peptides, we chose this length for our further work, consistent with longer lengths being required for optimal glycosylation of some substrates.
The optimized matrix (Fig. 3B) suggests that glycosylated peptides have a bias for the methyl group-containing amino acids alanine, valine, methionine, threonine, isoleucine, and leucine; proline was also favorable. Glycosylated peptides were depleted in glycine, glutamine, and asparagine as well as the charged amino acids, glutamate, aspartate, and arginine. Trends for glycine, alanine, proline, asparagine, and glutamine seemed to be consistent across the length of the matrix. The results of the optimization are shown in Figure 3C, which demonstrate that the positive and negative sets are largely separated. Notably, peptide scores for individual IDRs in the negative dataset are quite variable. Peptides derived from lowcomplexity IDRs that readily phase separate, such as DDX4, FMRP, and TAF15, are overrepresented in the negative set due to the bias in protein availability in our lab. The smaller number of peptides derived from CFTR and SARA do not score as poorly as the other peptides in the negative set, indicative of a deficiency in the negative set (see below). The similarities of the matrix values along the length of the peptides suggested that OGT might select IDRs with particular amino acid makeups, rather than exclusively selecting short linear motifs via the active site. We refer to our predictive algorithm that utilizes the compositional bias of 39 residues around the modification site as OGTcomPred.

Testing compositional bias
To measure compositional bias in the positive and negative datasets, we used the program fLPS2.0 (65). The background proportions of amino acid types were those derived from human UniProt records (53) or alternatively a dataset of disordered proteins (Fig. S2). To determine the proportions of amino acids in disordered proteins, we determined their abundance in a MobiDB (66) manually curated version of the DisProt database (67). We further selected only human proteins with greater than 50% fractional disorder. The compositional biases were more evident in the PhosphoSite database, due to the larger size of the database, than the experimental negative set ( Table 2). As expected, there is a bias for inclusion of serine and threonine in the positive dataset, since these residues are present in every peptide in the database that was observed when either the human proteome amino acid composition was used or when the disordered protein amino acid composition was used. In contrast, the negative dataset does not appear biased for threonine in either case, suggesting that threonines might be more favorable for glycosylation. As suggested by the substrate scoring matrix, there is a bias for inclusion of methyl-containing residues like alanine and valine and for isoleucine and methionine when using disordered protein amino acid composition. The positive set also has a bias for prolines, though this disappears when using the disordered protein composition. The negative set has a notable bias for inclusion of glycine and to a lesser extent glutamine and asparagine. Thus, analysis of the training datasets supports a difference in the compositional biases of the positive and negative datasets and a role for amino acid compositional bias in glycosylation target selection.

Compositional mutations
To test the effect of amino acid composition on glycosylation, we used the trends from our optimized matrix incorporated into our OGTcomPred algorithm and the compositional bias measures to predict mutations in the FUS LCR N that would enhance glycosylation. We made six constructs containing various combinations of mutations. Mutations were chosen on the basis of composition with no regard for local sequence motifs, in order to test the role of composition rather the role of specific glycosylation motifs. In total, 13 glycines were mutated to alanine, threonine, or proline. Long stretches of alanines were avoided to prevent α-helix formation. The remaining substitutions were glutamines mutated to threonine or proline, serines mutated to threonines and a single aspartate mutated to threonine. Mut-A contained the mutations Q31T, G34A, Q36A, G40T, Q43P, D46T, and G49A. Mut-B contained the mutations G67A, Q69T, G74A, G76P, G79A, G80P, G82A, S83T, and Q85P. Mut-C contained the mutations G99A, G101T, S107T, S108T, G111A, G114A, and S115T. Mut-D combined Mut-A and B mutations. Mut-E combined Mut-B and C mutations. Mut-F combined mutations from Mut-A, B, and C (Table S2). The total number of serines and threonines increased by less than 10% going from WT to Mut-F. SUMO fusions of the mutants Mut-B through Mut-F were successfully purified, glycosylated, and subjected to LC-MS. We failed to purify Mut-A. Unlike in the initial experiment with the three FET proteins (Fig. 1), SUMO fusions with the LCR N were used in the glycosylation reactions and the MS experiments as we could not consistently get data without the fusion tag in place. However, the SUMO appears to have reduced the level of glycosylation possibly via transient steric inhibition. Under the conditions used in this experiment, we observed only a single O-GlcNAc modification on WT FUS. With an increasing number of mutations, higher O-GlcNAc stoichiometries were observed, with as many as seven on the most highly mutated construct, Mut-F (Fig. 4A). Comparing the maximum number of observed sites with the number of sites predicted by our OGTcomPred algorithm gave a Pearson correlation of 0.84 with a p value of 0.038 (Fig. 4B). In contrast, comparing the number of sites predicted by O-GlcNAcPredII, considered to be the best existing predictor (45), gave a Pearson correlation of 0.59 with a p value of 0.21 for this set. The data for individual mutants showed a distribution in the number of glycosylation sites, matching expectation. However, for mut-D, peptides with one, three and four added sugars were observed but peptides with two added sugars were not observed (Fig. 4A). The explanation for this is unclear, though it is possible that this peptide simply was not detected in the mass spectrometer. In summary, the FUS glycosylation mutations support the hypothesis that OGT can utilize amino acid composition over significant stretches within IDRs to recognize substrates and suggests that the relatively short peptide sequences used by O-GlcNAcPredII to identify glycosylation sites do not fully capture this compositional bias.

NMR evidence for direct interaction between OGT-TPR and FUS
Since it is known that some OGT substrates require the TPR for efficient glycosylation, we hypothesized that the TPR functions by binding to substrates to increase the likelihood of contact with the catalytic domain. NMR is a reliable means of confirming protein interactions involving conformationally flexible IDRs. Therefore, to test whether the TPR can bind to FUS, we generated NMR spectra of 15 N-labeled WT FUS LCR N in the presence and absence of unlabeled TPR region of OGT (OGT-TPR, aa 2-474) fused to a SUMO tag. The 15 N labeling allows us to observe spectral crosspeaks (circular signals in Fig. 5) that correspond to individual bonded amide protonnitrogen pairs in FUS. In the overlay of the WT FUS spectra with and without OGT-TPR (Fig. 5A), we observe that addition of OGT-TPR causes several peaks to largely disappear, specifically, peaks arising from the two SYXGY motifs (motifs found at aa 37-41 and 96-100) in FUS. In Figure 5C, a plot of signal intensity ratios of samples with and without OGT-TPR demonstrates the heterogeneity in peak intensity changes, with intensity losses ranging from none to 90% and an average peak intensity ratio of 0.48 ± 024. The simplest mechanistic explanation is that FUS binds to the OGT-TPR, which is 50 kDa in size and is known to form a dimer, causing the rotational motion of FUS-interacting residues to slow dramatically, and leading to significant NMR signal loss. Residues further from the directly interacting residues experience less restriction in rotational motion and consequently less signal loss. The heterogeneous peak intensity loss provides solid evidence that the OGT-TPR binds to WT FUS in a dynamic manner, with multiple interacting elements of FUS exchanging on and off the surface of the TPR (68) and suggests that some parts of the FUS sequence are preferred binding sites. To test whether the compositional mutations enhance binding to OGT-TPR, we next recorded NMR spectra of FUS Mut-F in the presence and absence of OGT-TPR. The Mut-F overlay shows a more dramatic loss of signal intensity in the presence of OGT-TPR (Fig. 5, B and D), with an average peak intensity ratio of 0.36 ± 0.22, strongly suggesting that the compositional mutations enhance binding to the OGT-TPR. To control for possible binding of SUMO to WT FUS and FUS Mut-F, we repeated the experiment using OGT-TPR not fused to SUMO. The results were qualitatively similar (Fig. S3) providing evidence that changes in the FUS spectra are due to TPR binding and not SUMO binding. However, in the absence of the SUMO, the samples with OGT-TPR phase separated which made quantitative comparison of the apo and plus OGT-TPR samples impossible.

Test case: CREB-binding protein
We next tested our compositional matrix glycosylation site predictor, OGTcomPred, on four known IDRs from human CREB-binding protein (CBP) (69-71) and then measured glycosylation experimentally using MS. The four regions were the ID1 (CBP aa 1-344), ID3 (CBP aa 676-1080), ID4 (CBP aa 1851-2057), and ID5 (CBP aa 2124-2442). We predicted 13, 34, 15, and 6 sites in ID1, ID3, ID4 and ID5, respectively ( Fig. 6), whereas only one site in each of ID1 and ID5 and no sites in ID3 and ID4 are listed in the PhosphoSite O-GlcNAc database. Following overnight glycosylation, MS demonstrated glycosylation at a median of 3 sites in ID1 and a median of 2 sites in ID5 (Fig. 6). ID4 was predominantly unglycosylated, which could be due to secondary structure elements unaccounted for by the prediction (see Discussion). We were unable to obtain MS data on the full ID3, so we digested the glycosylated protein with trypsin and submitted the sample to MS/MS (Table 3). We found glycosylation at S709, with a further three sites between residues 715 and 768 and one between residues 972 and 998. Therefore, there are at least 5 possible glycosylation sites in ID3. These results confirm that OGT can glycosylate more sites than are listed in the PhosphoSite database but indicates that our predictor shares a high false positive rate with previously developed predictors.

Dataset analysis
To gain further insight into OGT substrate recognition, we analyzed matrix plots of position-dependent amino acid frequencies for the positive and negative datasets used here and in the O-GlcNAcPRED-II predictor development (Fig. 7). The O-GlcNAcPRED-II positive dataset (Fig. 7B) and the Phos-phoSite dataset (Fig. 7A) from which it is derived are highly similar. In contrast, the experimental negative set from this study (Fig. 7C) and the negative set for the O-GlcNAcPRED-II study (Fig. 7D) are quite different. Although published details on how the O-GlcNAcPRED-II negative dataset were obtained are limited, it contains approximately 51,000 peptide sequences. This database is large and, at first glance, appears to have a very limited amount of residual sequence-specific information with nearly uniform amino acid frequencies along the length of the peptide. As such the database may have been useful as a way to normalize the positive dataset against expected amino acid frequencies, rather than primarily contributing information on sites that are difficult to glycosylate. Consistent with this, the O-GlcNAcPRED-II negative dataset matrix is very similar to the matrix for all human protein S/T-centred peptides derived from UniProt (Fig. 7E). Examination of the positive datasets demonstrates overall amino acid frequencies that are similar to the O-GlcNAcPRED-II negative set and the human proteome. For example, serines and to a lesser extent prolines, alanines, glycines, and leucines are present with high frequency in the positive datasets and the O-GlcNAcPRED-II negative dataset. However, in the positive sets, one also sees amino acids that are over represented or underrepresented in a position-dependent manner relative to the O-GlcNAcPRED-II negative set. These primarily occur within the 4 residues before and after the serine/threonine glycosylation site and likely represent sequence-specific elements that are recognized by the catalytic domain of OGT. For example, peptides with prolines in the i-3, i-2, and i+2 position seem to be favorably selected by OGT. The presence of many threonines in the i+1 through i+14 indicates a preference for threonines C-terminal to the serine/threonine glycosylation site, an observation that has previously been reported (72). In contrast to the O-GlcNAcPRED-II negative set, the experimental negative set presented here is extremely small, just 135 peptides with overlapping sequences, and is likely not a very good sampling of the OGT negative site proteome. The small size of the dataset results in a rather noisy dataset. Nonetheless, the amino acid frequencies differ significantly from the human proteome as shown by the compositional bias results above. In the matrix representation of the negative set, glycine and to a lesser extent arginine, aspartate, asparagine, and glutamine have a higher relative abundance compared to the other datasets. At the same time alanine, proline, valine, isoleucine, leucine, and methionine are less abundant than in the other datasets. While the small size and biased nature of the negative dataset make strong conclusions unwise, the compositional bias is suggestive. Of note, the noise in the negative dataset precludes discernment of any sequencespecific information.

Comparison to other O-GlcNAcylation predictors
Rigorous comparison of site predictors requires definitive knowledge of both positive and negative sites, since specificity and accuracy cannot be calculated without knowing the number of negative sites. Our knowledge of negative sites is still extremely limited, in part due to a focus on highthroughput approaches, which are better at identifying positive sites. Lamin A is an O-GlcNAcylation target that has been studied in a targeted low-throughput approach, giving more confidence that sites not identified as glycosylated are in fact not glycosylated by OGT (47). We used Lamin A as a test case to crudely compare the different predictors (Table 4). Results from our simple predictor OGTcomPred compare favorably with early predictors, though they are not as good as more sophisticated tools such as O-GlcNAcPRED-II (41). Interestingly, preliminary exploration suggests that adding some sequence specificity back into our predictor improves prediction results while decreasing the ability of our predictor to discriminate between our positive and negative datasets (not shown). This supports our suspicion that some sequence specificity is lost due to the small size of our experimental negative set. Nevertheless, the fact that our predictor OGT-comPred compares well with some of the other predictors, despite this loss, supports our contention that OGT substrates have a compositional bias.

Discussion
It is not yet possible to reliably predict O-GlcNAcylation sites despite there being a significant amount of effort put toward developing predictors for O-GlcNAcylation sites. Here, we developed a small, experimentally tested negative dataset, which suggests that OGT has the ability to distinguish between substrates and nonsubstrates based on amino acid composition over an extended sequence length, a factor that should be a consideration in predicting glycosylation. Specifically, methyl  group-containing amino acids and proline were favorable, while glycine, glutamine, and asparagine, as well as the charged amino acids, glutamate, aspartate, and arginine, seem to inhibit glycosylation. We speculate that the preference for methyl group-containing amino acids arises because the nonpolar groups provide interaction energy without imposing significant geometric constraints (73), which would reduce the probability of correctly positioning serines and threonines in the catalytic site. A mechanism for a bias against glycine residues is unclear but the higher degree of conformational freedom of glycines might impair interactions with the TPR. Data obtained for compositional mutations introduced into the FUS LCR N with no regard for specific active-site recognition motifs support this idea of a compositional bias. Changes in glycosylation stoichiometry were correlated with the number of glycosylation sites predicted by our compositional matrix predictor OGT-comPred. NMR data provide evidence that glycosylationpromoting compositional mutations enhance OGT-TPR binding to the FUS LCR N . Together, these observations support a model in which interactions between intrinsically disordered substrates and the OGT-TPR can facilitate glycosylation of sites that have a suboptimal interaction with the catalytic site, as has been previously observed (74).
There are many factors that affect OGT recognition in vivo. Here we focused on factors that influence the ability of ncOGT (OGT with 13 TPR repeats) to directly recognize and glycosylate other proteins in vitro, including interactions of the substrate with the catalytic site and the TPR (Fig. 8). While some substrates seem to require the full TPR for efficient glycosylation, others can be glycosylated with minimal TPR repeats. An example of the latter is a 12-amino acid substrate peptide derived from the casein kinase II (CKII), which can be glycosylated by a shortened OGT variant that is missing 5.5 TPR repeats relative to the full ncOCT variant (31). Furthermore, addition of TPR in trans does not competitively inhibit glycosylation of CKII (31), suggesting that the TPR region does not contribute significantly to recognition of CKII as a substrate. In contrast, other substrates like TRAK1(31) and the Cterminal domain of RNA polymerase II (30) require all of the TPR repeats for efficient glycosylation. OGT with a full TPR is known to glycosylate a broader range of substrates than OGT isoforms with fewer TPR repeats (26). This is consistent with OGT requiring a threshold level of affinity for efficient glycosylation, with that affinity being achieved either by optimal interaction between a short peptide segment of an IDR and the OGT catalytic region or alternatively by a combination of many weak interactions between a long IDR and the OGT catalytic region and the TPR.

Catalytic site interactions
From a structural perspective, it is not yet well understood how substrates are recognized by the catalytic site. Consistent with the wide array of OGT substrates, there are relatively few contacts between the catalytic site and the side chains of peptide substrates, with crystal structures demonstrating that most contacts involved the backbone of substrate peptides (2,22). Nonetheless, a screen of randomly generated 13-residue peptides shows a highly specific selection of substrates for this class of short peptide, with less than 10% of the peptides in the screen being glycosylated as efficiently as the positive control. Crystal structures of multiple substrates demonstrate a highly constrained peptide backbone in the −3 to +2 region involving hydrogen bonds to the peptide backbone. These structures suggest the presence of size preferences  Figure 8. OGT ligand selection model. The OGT catalytic domain and TPR helix are shown in blue, bound to a peptide ligand that has an optimal fit to the catalytic site and a TPR-interacting region with a compositional bias that promotes interaction with the TPR. Glycosylation is indicated by the orange G. Short peptides with optimal fits for the catalytic site can be glycosylated but short peptides with suboptimal fits are not glycosylated. Extended peptides with suboptimal catalytic site fit can still be glycosylated if the peptide has a compositional bias that is suitable for TPR interaction (green) but not if the compositional bias is less favorable for a TPR interaction (red). TPR, tetratricopeptide repeat region. or steric restrictions in the different positions along the substrate. For example, smaller amino acids are preferred in the −3 and +2 position, while the −2 position seems to disfavor small amino acids such as alanine and glycine. Even dramatic substitutions of single residues, for example, replacing serines and alanines in the +2 position with phenylalanine, merely reduce the efficiency of glycosylation, introducing energetic costs that might be overcome through TPR interactions with longer peptides. However, the additive effects of several unfavorable changes could possibly prevent glycosylation. So, rather than trying to define sequences that interact optimally with the catalytic site, it might be more helpful to look for sequences that prohibit glycosylation. The large number of possible sequence combinations will make uncovering prohibitive sequences difficult but may be a key piece of solving the prediction puzzle. A second piece of the puzzle is trying to define the amino acid preferences for interactions with the OGT-TPR. In the crystal structure of OGT-TPR bound to a peptide derived from HCF-1, the substrate is in an extended conformation in the inside of the helix. A series of conserved asparagine residues arranged on the inside of the TPR helix (24) form bidentate interactions with the substrate peptide backbone (72). Mutation of five of these asparagines selectively inhibited a substantial number of substrates that require the OGT-TPR for efficient glycosylation (52). Since these are backbone contacts, they likely play a minimal role in specificity. Structures with an HCF-1-derived peptide bound inside the helix also show four TPR aspartates forming hydrogen bonds with threonine sidechains in the HCF-1 peptide. This explains the prevalence of threonines C-terminal to the glycosylation site in the Phosphosite and O-GlcNAcPRED-II positive datasets. Our work suggests that glycines are unfavorable, possibly because they are less conformationally restricted, which would impose a greater entropic cost for binding to the asparagine ladder. We also found that small hydrophobic residues such as alanine, valine, and proline are favorable. We speculate that these can make favorable van der Waals interactions with the concave surface of the TPR helix, possibly via transient, dynamic interactions (64) that allow substrate sidechains to be correctly positioned in the catalytic site. In contrast, amino acids with side chains that can form hydrogen bonds seem unfavorable with the exception of threonines and serines. Hydrogen bondforming amino acids such as asparagine, glutamine, and glutamate may introduce geometric constraints that are difficult to satisfy. Glutamate may additionally be unfavorable because of the net excess of acidic residues on the concave surface of the TPR. Consistent with this, OGT constructs with fewer TPR repeats more readily glycosylate substrates with polar uncharged and charged residues such as glutamine, asparagine, lysine, glutamate, and aspartate (26).

Substrate structure impact on glycosylation
Of note, based on existing OGT crystal structures, secondary structure elements and folded domains are predicted to be incompatible with glycosylation by OGT due to steric clashes (22). This could explain our observation that ID4 of CBP does not get glycosylated despite our prediction. In fact, the regions that flank the majority of our predicted target serines and threonines in this intrinsically disordered segment form alpha-helical structures with significant populations (70). We speculate that disruption of these secondary structure elements could promote glycosylation of ID4. Amino acids conducive to maintaining an extended or random coil structure significantly affect the propensity of a peptide to be glycosylated and may explain the preference for prolines and beta-branched residues (2). Although the vast majority of OGT substrates are in IDRs (5), there are a few examples of proteins that are glycosylated in folded or ordered regions, including HBGB-1 (75), H2B (76), and αB-Crystallin (77). Glycosylation of these ordered sites could occur if the TPR is able to move away from the catalytic site, as has been suggested by a recent electron microscopy structure (78). Alternatively, the OGT-TPR might be able to unfold a select group of ordered regions, as the TPR-containing karyopherin proteins are known to behave as chaperones (79). Finally, glycosylation of these ordered regions could occur cotranslationally before the proteins are fully folded (80,81). Since most OGT ligands are IDRs, we intentionally picked IDRs to build our negative dataset. When making predictions for proteins for which the structure is unknown, we also couple the prediction to a disorder predictor. Attaining more accurate predictions will likely require incorporation of structural and steric constraints, which may be facilitated by recent advances in structure prediction (82,83).
How proximal structured elements impact glycosylation is not yet well defined. The range of possible OGT-TPR entry points and the effect of adjacent folded domains on TPR entry are unknown. Examining the TPR structure, it appears that peptides do not need to enter the TPR helix from one end, since there is sufficient space for a peptide to enter the TPR interior from points along the helix. However, larger structural elements would not be able to enter the interior without significant rearrangement of the helix. Thus, our observation that the isolated IDR3 of CBP can be glycosylated in vitro does not mean that it can be glycosylated in vivo since it is flanked by folded regions in the full-length CBP protein. These considerations add further complexity to prediction efforts. Prediction approaches to date have taken a structure-agnostic approach but pushing predictions towards higher accuracy will require addressing these structural issues. Overcoming this complexity is a worthwhile goal given the importance of O-GlcNAc modification for modulating protein thermodynamics, aggregation, and phase separation propensity.

Expression and purification of proteins
All DNA constructs were verified by sequencing. Proteins were expressed in E. coli BL21 (DE3) RIPL cells using LB media, unless otherwise stated. Cell cultures were grown to an optical density of 0.8 and then induced with 0.5 mM IPTG and harvested after 16 h at 18 C. Purifications were carried out at room temperature, unless otherwise stated. Purified protein samples were further verified by MS to ensure that they were the expected molecular weight.

EWS, FUS, and TAF15 LCR N purification
His-tagged SUMO fusions of LCR N fragments of human EWS (aa 1-264), FUS (aa 1-214), and TAF15 (aa 1-210) were expressed, bacteria lysed by sonication and then proteins purified by nickel affinity chromatography using a buffer containing 20 mM CAPS, pH 11, 500 mM NaCl, 4 M guanidinium chloride (GdmCl) with 20 mM imidazole added to the aliquot used for lysis and washing and 280 mM imidazole used in the elution aliquot. Proteins were then subjected to size exclusion chromatography using a buffer comprised of 40 mM arginine, pH 9. A HiLoad Superdex75 HR 16/600 column (Cytiva) was used for all of the size exclusion chromatography described here. Only the purest fractions were retained for glycosylation reactions and MS. ULP1 protease (purified in-house) was used to remove the His-SUMO fusion protein. The LCR N protein was then loaded onto a size exclusion column without first concentrating the protein, since concentrating the protein led to significant loss. The same 40 mM arginine pH 9 buffer was used for this step.

FUS and FUS mutant LCR N purification
We modified the purification to more reliably obtain FUS or mutant FUS LCR N without the SUMO fusion tag. Nickel affinity chromatography and size exclusion chromatography were followed by consecutive purification on a HiTrap Q column (Cytiva) and an 8 ml phenyl Superose column (Cytiva) using buffer with 40 mM arginine, pH 9 and gradients from 50 mM to 1M NaCl and 1 M to 0 M NaCl, respectively. Following cleavage with ULP1 the protein was again purified by phenyl Superose using the same gradient, to yield highly pure FUS LCR N .

OGT purification
Following expression of human ncOGT (full length, 1-1046) in E. coli, cells were resuspended in a buffer containing 25 mM imidazole, 10 % glycerol, 250 mM NaCl, and 25 mM Hepes, pH 7.5, 5 mM β-mercaptoethanol. DNaseI and an EDTA-free protease inhibitor tablet (Sigma) were also added to the lysis buffer. Following lysis by sonication and French press, the protein was purified by nickel affinity chromatography and eluted in the same buffer with 250 mM instead of 25 mM imidazole. Fractions containing pure protein were then dialyzed in 25 mM Hepes, pH 7.5, 40 mM NaCl, 0.5 mM EDTA, and 5 mM β-mercaptoethanol and then loaded onto a 5 ml HiTrap Q-XL column (Cytiva) and purified at 4 C using a gradient from 0.05 to 1.0 M NaCl. Although the protein appeared pure after this anion exchange step, we further purified the protein using a HiLoad Superdex 200 HR16/600 size exclusion column using a buffer containing 40 mM KPO 4 , pH 7.5, 125 mM NaCl, 0.5 mM EDTA, 0.5 mM benzamidine, and 5 mM β-mercaptoethanol to ensure that no contaminating proteases remained.

OGT-TPR purification
A construct containing SUMO fused to residues 2 to 474 of ncOGT representing the TPR region was purified by Ni affinity chromatography as for the full-length ncOGT purification. The SUMO tag was cleaved off using ULP. NaCl was then added to the sample to bring the total NaCl concentration up to 1M. This was followed by purification on a phenyl Superose column in a buffer of 25 mM Hepes, pH 7.5, 5 mM β-mercaptoethanol, using a 1 M to 150 mM NaCl. As a final purification step, the OGT-TPR was subjected to size exclusion chromatography using a buffer of 44 mM KPO 4 , 137.5 mM NaCl, 0.55 mM EDTA, and 0.55 mM benzamidine, pH 7.2. Purification of the OGT-TPR with the SUMO tag cleaved off was similar.

Production of protein for NMR spectroscopy
Isotopically labeled proteins for NMR spectroscopy were expressed in M9 minimal media using 15 N ammonium chloride as the sole source of nitrogen.

OGT reaction conditions
Protein samples were dialyzed into 40 mM K 2 HPO 4 , pH 7.5, 125 mM NaCl, 0.5 mM EDTA, 2 mM beta mercaptoethanol, and 0.5 mM benzamidine. Reactions were performed at a protein concentration of 20 μM. Following addition of 1 μM ncOGT and 1 mM UDP-GlcNAc, samples were incubated at room temperature for 16 h.

Mass spectrometry
MS experiments were carried out at the Structural Genomics Consortium Toronto facility or at The Hospital for Sick Children SPARC Molecular Analysis facility. Samples were prepared by adding formic acid to a final concentration of 0.1% v/v. To determine glycosylation stoichiometry, purified glycosylated proteins and controls were either run on a Thermo-Fisher Orbitrap Q Exactive High Field instrument or on an Agilent ultra-high pressure liquid chromatography-quadrupolar time-of-flight 6545 MS system equipped with a Dual JS electrospray ionization source. Samples were desalted online via a C18 column. Raw data were either processed using Agilent MassHunter software (https://www. agilent.com/en/product/software-informatics/mass-spectrometrysoftware) or Thermo-Fisher software (https://www.thermofisher. com/ca/en/home/industrial/mass-spectrometry/liquid-chromatographymass-spectrometry-lc-ms/lc-ms-software/multi-omics-dataanalysis/biopharma-finder-software.html) and deconvoluted using the maximum entropy algorithm with appropriate mass ranges. The deconvoluted data were then plotted using MATLAB. To identify specific glycosylation sites in the EWS LCR N region or the ID3 region of CBP, the protein was digested with chymotrypsin or trypsin respectively and then subjected to LC-MS/MS on a Thermo-Fisher Orbitrap Q-Exactive mass spectrometer, using higher energy collisional dissociation. To identify glycosylation sites with confidence, we set the following thresholds: the parent ion error had to be less than 1 ppm and the number of fragment ions with a score of less than 7 ppm had to be greater than 8. As the O-GlcNAc modification was lost during the peptide fragmentation step, we were able to identify peptides that were glycosylated (parent ion had modification) but typically unable to identify exactly which residues were glycosylation sites. The MS/MS data were analyzed manually, since the software modification site assignment process assumed that the sugar was still present following the fragmentation step.

NMR spectroscopy
Heteronuclear single quantum coherence experiments (84) were performed at 5 C in a buffer containing 40 mM KPO 4 , pH 7.2, 125 mM NaCl, 0.4 mM EDTA, 0.5 mM benzamidine, 5 mM DTT, and 10% D 2 O. Matched samples were recorded on 20 μM 15 N labeled samples (below the threshold for phase separation) of either WT FUS LCR N or Mut-F FUS LCR N in the absence and presence of 32 μM SUMO-fused OGT-TPR. Spectra were processed with NMRPipe (85) and displayed in CCPNMR (86) software (https://ccpn.ac.uk). Peak intensities were obtained using Sparky (87) software (https://nmrfam. wisc.edu/nmrfam-sparky-distribution/). Peak assignments for FUS LCR N were obtained from the Biological Magnetic Resonance Bank (88,89). However, since our sample conditions differed from the conditions used by Burke et al. (Biological Magnetic Resonance Bank 26672), only peaks in less crowded regions of the spectrum could be assigned. The experiment was repeated using OGT-TPR with the SUMO tagg removed to rule out a significant role for SUMO in the interaction.

Matrix optimization and score calculation
The Phosphosite O-GlcNAcylation database (1829 peptides) was used as a positive dataset to optimize a scoring matrix. The negative dataset consisted of peptides extracted from IDRs that we experimentally determined to not be glycosylated by OGT under optimal conditions, as described in the Results section. It consisted of 135 peptides centered on serines or threonines extracted from these proteins. To start, a matrix was constructed with arbitrary scores for the presence of particular amino acid types at particular positions relative to the potential glycosylation site. All amino acids other than W and C were used in the scoring function; W and C were excluded because they are relatively rare. The matrix consisted of a score for 18 different amino acids at all positions 19 residues prior to and 19 residues following the glycosylation site, yielding a matrix with dimensions 18 by 38. The matrix is then used to score all of the peptides in the datasets. In-house software was then used to optimize the matrix to maximize the scoring difference between the positive and negative datasets, using an iterative process of random changes to the matrix. Random changes that increased the matrix assessment score (MAS) defined as MAS ¼ ðP−PstdÞ−ðNþNstdÞ P , where P and N are the median values of the positive and negative datasets, respectively and Pstd and Nstd are the standard deviations of the scores for the positive and negative datasets, respectively, were kept. The final matrix was used to score glycosylation sites. The threshold for a positive site was set at 160, unless otherwise indicated. (For Table 4, the low, medium, and high thresholds are 123, 148, and 160 respectively). All existing predictors suffer from high numbers of false positive predictions. Setting a relatively high threshold increases the likelihood that a positive prediction is accurate but results in poor sensitivity.

Measuring compositional bias
We measured compositional bias as defined by Harrison and Gerstein (90) and implemented in fLPS2.0 (65). Simply put, this definition measures how likely it is to observe a protein/peptide with a particular quantity of an amino acid given the overall frequency of that amino acid in the set of proteins to which it belongs. Unlikely amino acid makeups are considered to have a compositional bias that is quantified by how unlikely that makeup is. The background proportions of amino acid types were those derived from human UniProt records (53) or from a dataset of disordered proteins determined from the MobiDB (66) manually curated version of the DisProt database (67). We further selected only human proteins with greater than 50% fractional disorder. The Phos-phoSite database was used without modification. However, the negative dataset is composed of overlapping peptides and thus highly redundant, which we thought would significantly comprise the bias calculation. Therefore, we used the sequences of the protein regions containing the peptides in the database, rather than the database itself. Only biases related to the whole datasets are reported here.

Data availability
The LC-MS/MS data on EWS-LC and CBP ID3 glycosylation, an optimized scoring matrix and the script for scoring peptides can be downloaded from https://zenodo.org, https:// doi.org/10.5281/zenodo.6986306.
Supporting information-This article contains supporting information.
Ocsenas and Dr Rhea Hudson are thanked for help with protein purification. We thank Drs Suzanne Walker and Peter Tompa for the kind gifts of plasmids for expression of the OGT and CBP intrinsically disordered region proteins, respectively. Drs Paul Taylor and Craig Simpson and The Hospital for Sick Children SPARC Molecular Analysis facility are thanked for their mass spectrometry services. Drs Cangzhi Jia and Quan Zou are thanked for assistance in providing results of their predictor for select proteins.