Deep learning explains the biology of branched glycans from single-cell sequencing data

Summary Glycosylation is ubiquitous and often dysregulated in disease. However, the regulation and functional significance of various types of glycosylation at cellular levels is hard to unravel experimentally. Multi-omics, single-cell measurements such as SUGAR-seq, which quantifies transcriptomes and cell surface glycans, facilitate addressing this issue. Using SUGAR-seq data, we pioneered a deep learning model to predict the glycan phenotypes of cells (mouse T lymphocytes) from transcripts, with the example of predicting β1,6GlcNAc-branching across T cell subtypes (test set F1 score: 0.9351). Model interpretation via SHAP (SHapley Additive exPlanations) identified highly predictive genes, in part known to impact (i) branched glycan levels and (ii) the biology of branched glycans. These genes included physiologically relevant low-abundance genes that were not captured by conventional differential expression analysis. Our work shows that interpretable deep learning models are promising for uncovering novel functions and regulatory mechanisms of glycans from integrated transcriptomic and glycomic datasets.


INTRODUCTION
Glycosylation is a ubiquitous post-translational modification of proteins. Approximately half of all proteins are glycosylated (Apweiler et al., 1999), influencing their physical properties and biological activities Varki, 2017). Cells utilize glycosylation to control their function and fate. For example, in CD8 + T lymphocytes, b1,6-branched glycans of CD8 + T cell surface proteins are upregulated following T cell activation to prevent overstimulation. Downregulation of sialylated core 1 O-linked glycans of CD8 + T cell surface proteins induces apoptosis in the absence of activation, maintaining T cell homeostasis (Van Dyken et al., 2007;Priatel et al., 2000;Smith et al., 2018). Changes in glycosylation have been mechanistically implicated in cancer, infectious diseases, autoimmune diseases, metabolic disorders, and developmental defects (Demus et al., 2021;Ng and Freeze, 2018;Pinho and Reis, 2015;Qin and Mahal, 2021;Reily et al., 2019;Sun et al., 2016;Theodoratou et al., 2014;Vosseller et al., 2002).
Despite many observational studies reporting glycosylation changes in diseases, much remains unknown about the origin of these changes. Glycan biosynthesis is orchestrated in a non-templated manner by hundreds of enzymes encoded by ''glycogenes'', including glycosyltransferases (add sugar residues), glycosidases (remove sugar residues), sugar-modifying enzymes (synthesize phosphorylated, sulfated, or acetylated glycans), enzymes of sugar metabolism pathways, and sugar transporters Neelamegham and Mahal, 2016). Regulation of glycosylation includes transcriptional and post-transcriptional control of glycogenes, substrate availability, and intracellular trafficking of enzymes (Neelamegham and Mahal, 2016). Therefore, identifying factors driving observed changes in certain glycan structures can be a formidable challenge.
The functional significance of disease-associated glycosylation changes can also be difficult to ascertain, due to the multi-modal influences of glycans. Glycans can influence protein structure, interactions between proteins and receptors, recognition by carbohydrate-binding lectins, resistance to endocytosis and protease degradation, etc Johannes et al., 2018;Mimura et al., 2018). One glycan feature can thus have multiple effects. For example, increased a2,3-sialylation of cancer cells contributes to cancer progression and metastasis through mechanisms including (i) immune system evasion by interacting with a2,3sialic acid-binding, immunosuppressive Siglec receptors (e.g., Siglec-9), (ii) increasing metastatic potential via selectins (specific for a2,3-sialic acid-containing sialyl Lewis x and sialyl Lewis a antigens) displayed on iScience Article et al., 2020). Importantly, the b1,6-GlcNAc linkage is synthesized by alpha-1,6-mannosylglycoprotein 6-beta-N-acetylglucosaminyltransferase A (MGAT5), allowing PHA-L binding to act as a reliable proxy for the activity of this enzyme. Based on transcriptomes, T cells were categorized into 9 major subtypes, including naive/naive-like CD4 + T cells, naive/naive-like CD8 + T cells (may include central memory T cells), early active CD8 + T cells, effector memory CD8 + T cells, terminally exhausted CD8 + T cells (Tex), precursor exhausted T cells (Tpex), regulatory T cells (Treg), T helper 1 cells (Th1), and follicular helper T cells (Tfh) (Figures 1B and 1C). We validated this classification via marker gene expression ( Figure S1). Additionally, TIL composition is highly heterogeneous, comprising activated and regulatory cells. Conversely, the LN pool mainly contained resting T cells ( Figures 1B and 1C). Observed compositions of T cells isolated from different sites are consistent with previous studies (Kumar et al., 2018;Szabo et al., 2019).
Surface glycosylation varies by cell type (Agrawal et al., 2014;Holst et al., 2016;Tao et al., 2008). We observed clear and reproducible differences in surface expression of b1,6-branched glycans across T cell subtypes in both datasets ( Figure 1D). Treg and Tex consistently exhibited the highest expression of branched glycans, while naive CD4 + , naive CD8 + , and Th1 exhibited the lowest. Similarly, previous studies reported greater PHA-L binding to stimulated T cells than to their naive counterparts (Cabral et al., 2017;Smith et al., 2018). Glycan expression was also more variable in TIL, consistent with its more diverse composition ( Figure 1E). As surface branched glycans seemed to distinguish T cells of different characteristics and functions, we hypothesized that we could use DL to uncover the biology behind this differential glycosylation.
Training a deep learning model to predict glycan phenotypes from the transcriptome We set out to use DL to model surface glycosylation from transcriptome-wide gene expression data. We assigned binary labels to cells based on PHA-L data, corresponding to high and low ends of reads. Specifically, we made the top 25% cells ''PHA-L high '' and the bottom 25% ''PHA-L low '' to establish robust labels that are comparable between datasets. This robustly separated biologically distinct populations and enabled subsequent analyses. It also transformed our task into binary classification, with gene expression values as input, and the probabilities for the positive phenotype (PHA-L high ) of the corresponding cells as output.
We developed a neural network classifier comprising four hidden layers (Figure 2A), using data from either the TIL or the LN dataset. In the TIL hold-out test set, this classifier achieved a 92.17% prediction accuracy for the PHA-L high phenotype and 95.01% for the PHA-L low phenotype (AUC: 0.9359, F1 score: 0.9351, Table 1; Figure 2B). In the LN dataset, F1 score was lower, yet still exceeded 0.91. Prediction accuracies for both phenotypes remained greater than 90% (Table 1). This slight decrease in performance was most likely due to less variance in gene expression of the LN data, which likely stems from the less diverse cell composition ( Figure 1B). Robustness of the classifier was also indicated by non-overlapping distributions for the predicted probabilities of PHA-L high ( Figure 2C). We further observed comparable performance across different cell subtypes, despite the inherent imbalance in T cell compositions ( Figure 2D).
Using the TIL data, we trained three alternative models (Convolutional Neural Network (CNN), Random Forest, and AdaBoost) on the same task and compared their performance to the above-mentioned neural network model (Table 2). For all investigated metrics, our neural network outcompeted alternative models. Among the alternative models, Random Forest had the highest prediction accuracies (90.34% for PHA-L high , 95.01% for PHA-L low ) and F1 score (0.9251). In contrast, the CNN model had the lowest prediction accuracies (89.82% for PHA-L high , 91.60% for PHA-L low ) and F1 score (0.9065), although the cross-entropy loss was comparable to the standard network. We thus used the neural network model for all subsequent analyses.
Identification of highly predictive genes using SHAP Next, we used SHAP to identify genes important for prediction. For each input sample (cell), the SHAP algorithm calculates an ''SHAP value'' for each feature (gene) that is reflective of the impact of this feature on the expected model output for this input (Lundberg and Lee, 2017). SHAP values can be positive or negative, corresponding to additive or subtractive effects on model output. The median absolute SHAP value is commonly used to assess global feature importance.  Figure 3B), higher expression tended to favor a PHA-L high prediction, whereas the opposite was true for a smaller subset (e.g., Gm42418, Malat1, and Klf2, Figure 3B). As glycogenes directly control glycan biosynthesis (Neelamegham and Mahal, 2016), we also examined the glycogenes (Table S2). The top 30 glycogenes, also ranked by median SHAP value, are shown in Figure 3C, although not all were among the top 10% SHAP genes.
Next, we performed pathway enrichment analysis on the SHAP genes. For both the TIL and LN model, the most enriched pathways (biological processes) were involved in negative regulation of T cell activity and T cell differentiation (TIL: Figures 3D; LN: S3), matching known roles of b1,6-branched glycans in T cells discussed further below. STRING protein interaction analysis also showed the top 10% SHAP genes to be highly interconnected via functional enrichment, co-expression, or direct interaction, arguing for concerted biology ( Figure 3E).
Next, we investigated the variation of SHAP genes with T cell subtypes, generating cell type-specific lists of SHAP genes (Table S3). As shown in Figure 4A, the majority of highly predictive genes were shared across cell subtypes. For example, Ctla4 (cytotoxic T-lymphocyte-associated protein 4) was consistently highly ranked in all cell subtypes (highest rank-naive CD4 + : 13 or top 0.25%; lowest rank-Tex: 165 or top 3.19%). Ctla4 is a Treg marker, with research primarily focusing on its roles in Treg (Rowshanravan et al., 2018;Sobhani et al., 2021;Zheng et al., 2013). Our analysis showed that Ctla4 expression was predictive of glycan phenotype and might influence biological pathways mediated by branched glycans in T cells beyond Treg. iScience Article Nonetheless, we identified some genes that were highly predictive of glycan phenotype only in certain T cell subtypes. A gene is considered specifically important to a T cell subtype if its ranking percentile in this subtype is (i) among the top 2% and (ii) greater than the average ranking percentile in other subtypes by at least 1.25 standard deviations. This identified gene subsets for each cell subtype ( Figure 4B, TIL). Treg, naive CD4, and Tpex exhibited the highest numbers of type-specific highly predictive genes. Some genes play unique biological roles in their corresponding subtypes and are associated with the biology of branched glycans. For example, our cell type-specific comparison ranked Lrrc32 (leucine-rich repeat containing 32; also known as glycoprotein-A repetitions predominant, Garp) as particularly high in Treg. Lrrc32 controls the expression of latent TGF-b (transforming growth factor b) in Treg, and TGF-b signaling is suppressed in Mgat5 knockout mice (Lehmkuhl et al., 2021;Tran et al., 2009;Zhang et al., 2021). We speculate that Lrrc32 may mediate this loss of TGF-b signaling, explaining its importance for prediction in this context. Jund (transcription factor JunD) and Ifngr1 (interferon gamma receptor 1), highly ranked in naive CD4 + T cells, are important for the activation and differentiation of naive CD4 + cells, usually followed by upregulating b1,6-branched glycans (Afkarian et al., 2002;Meixner et al., 2004;Morgan et al., 2004).
Jund and Ifngr1 expression may reflect differential levels of activation in naive CD4 + cells, which could correlate with branched glycan expression. As most other cell subtype-specific genes lack well-characterized roles in the corresponding subtypes of T cells, their biological associations with branched glycans are unclear and constitute opportunities for future research.
SHAP genes explain the biology of b1,6-branched glycans One of our major goals was to see whether the most predictive SHAP genes, identified by this data-driven approach, were implicated in the biology of branched glycans. A highly predictive gene can be biologically associated with b1,6-branched glycans in several ways: (i) encoding a cell surface protein bearing this glycan; (ii) regulating the expression of MGAT5, the enzyme biosynthesizing the b1,6-branch of N-glycans; (iii) being itself regulated by MGAT5/b1,6-branched glycan levels; (iv) being functionally synergistic with b1,6-branched glycans; (v) influencing the biosynthesis of b1,6-branched glycans through, e.g., changing substrate availability. Indeed, many highly predictive SHAP genes (i.e., SHAP ranking in top 2%) were already implicated in b1,6-branched glycan biology in at least one of these ways ( Figures 5 and 6), arguing for high biological relevance of SHAP genes. We detail our findings below.
We primarily focused on interpreting SHAP genes from the TIL dataset because the corresponding model exhibited better prediction, which is likely translated into more reliable interpretations. This T cell population is also biomedically more interesting as they are directly involved in the tumor microenvironment. First, we investigated whether TIL SHAP genes encode T cell surface proteins bearing b1,6branched glycans, as some glycoproteins are more prone to be modified with this glycoform (Breen, 2002;Li et al., 2013;Wu et al., 2019). Out of the top 50 SHAP genes, Pdcd1 (programmed cell death protein 1, PD-1), Ctla4, Itgb1 (integrin beta 1), and Itga4 (integrin alpha 4) can be modified with b1,6branched glycans ( Figure 5A) (Liu et al., 2020;Przybyło et al., 2007;Zhu et al., 2014). Except for Itgb1, these genes had SHAP values increasing with expression, as expected. Others either lack experimental glycosylation data or encode non-membrane proteins. Identifying proteins modified with b1,6-branched glycans among the most predictive genes may inform future investigations on whether other SHAP genes also encode proteins with this glycoform. iScience Article Next, we investigated SHAP genes regulating MGAT5 expression. Only three transcription factors seem to directly upregulate MGAT5 expression: ETS-1, ETS-2, and AP-1 (dimer of c-Jun and c-Fos) (Chen et al., 1998;Ko et al., 1999;Wang et al., 2018). In our SHAP analysis, Ets1, Jun, and Fos were all ranked among the top 2% genes ( Figure 5B), coinciding with their importance in regulating MGAT5 expression. The ranking of Ets2 was lower, potentially due to lower expression ( Figure S4), but still bordering the top 10% range ( Figure 5B). Intriguingly, SHAP values of Ets1 negatively correlated with expression, potentially due to cell-dependent regulation. Indeed, RNA-seq data showed increased Mgat5 transcripts in T cells of Ets1 À/À mouse (Kim et al., 2018), suggesting ETS-1-mediated downregulation in T cells. Beyond transcription factors, we also investigated more upstream cytokine regulators. Three cytokines, interleukin-2 (IL-2), interleukin-7 (IL-7), and interleukin-10 (IL-10), upregulate b1,6-branched glycans in T cells (Grigorian et al., 2012;Smith et al., 2018). Correspondingly, Il2ra, Il10ra, Il2rg, and Il7r, encoding components of the membrane receptors of IL-2, IL-10, IL-2/IL-7, and IL-7, are highly ranked at top 0.93%, 0.95%, 3.79%, and 4.18%, respectively ( Figure 5B). They were also the top four ranked genes out of the 29 interleukin receptor genes here, indicating their strong association with glycan branching. In aggregate, all transcription factors and cytokine receptors regulating MGAT5/branched glycan levels were highly predictive SHAP genes, showing the potency of SHAP analysis to reveal the mechanisms behind the regulation of glycosylation.
Having identified biological associations between highly ranked SHAP genes and branched glycans, we asked whether we could systematically identify functional roles of glycan branching in T cells. We based iScience Article this on previous observations that cells utilize both proteins and protein glycosylation to fulfill their function. For example, fucosylated glycans and sialyl Lewis x on effector T cell surfaces facilitate homing to tumor sites, as do mechanisms mediated by chemokine receptors and integrins (Alatrash et al., 2019;Sackstein et al., 2017). Work over the past two decades has shown b1,6-glycan branching to play an immunosuppressive role in activated T cells. T cell receptor (TCR) activation upregulates MGAT5, yielding more b1,6-branched glycans on T cells. This promotes binding of the multimeric galectin-3 to T cell surface glycoproteins, forming a localized lattice that prevents T cell-activating protein-protein interactions (e.g., TCR-CD8 interaction) (Demetriou et al., 2001;Lau et al., 2007;Morgan et al., 2004). Mice deficient in Mgat5 developed autoimmune disease due to dampened negative regulation of T cell activities (Grigorian and Demetriou, 2011;Silva et al., 2020). In Treg, surface b1,6-branched glycans were positively correlated with immunosuppressive marker expression and the suppressive potency of Treg (Cabral et al., 2017). Ye et al. identified Mgat5 as one of the four hits in a CRISPR screen for targets that enhance T cell-based cancer therapy (Ye et al., 2019). Correspondingly, functional enrichment of SHAP genes showed enrichment in negative regulation of T cell activation ( Figure 3D). Well-established suppressive immune checkpoint receptors were within the top 50 genes, such as Lag3, Ctla4, Pdcd1, and Tigit ( Figure 5D). We also found other highly ranked SHAP genes known for predominantly immunosuppressive roles in T cells, including (i) genes of the killer cell lectin-like receptor family such as Klrc1 (0.31%) and Klrg1 (0.50%) (Huot et al., 2021;Li et al., 2016); (ii) chemokines and chemokine receptors such as Ccl5 (0.10%), Ccr5 (0.21%), and Ccr2 (0.33%) (Aldinucci and Casagrande, 2018;Matsuo et al., 2021;Tu et al., 2020;Zeng et al., 2022); (iii) other genes such as iScience Article Fgl2 (0.64%) and Lgals1 (1.06%) (Corapi et al., 2018;Hou et al., 2021) ( Figure 5D). Overall, SHAP genes had substantial overlap with genes known to be implicated in the biology of branched glycans.
Finally, we examined the SHAP genes of the LN dataset (Table S1, Figure S2). More than half (116 genes) of the 205 SHAP genes of the LN dataset were also found among the 516 SHAP genes of the TIL dataset. Examples include Cd8b1 (0.04% in LN, 0.14% in TIL), Ikzf2 (0.83% in LN, 0.04% in TIL), Gm42418 (0.13% in LN, 0.06% in TIL), and Ets1 (4.33% in LN, 1.16% in TIL). Notably, the negative correlation between the transcript abundance and SHAP values of Ets1 was also seen in the LN dataset. Shared high-ranking genes could be associated with the biology of b1,6-branched glycan in similar ways as discussed above. Some genes had substantially different rankings, which were anticipated since the two datasets had different compositions of T cell subtypes. For example, the highest-ranking gene, Lag3 (lymphocyte activation gene 3), in TIL was ranked at 746 (36.30%) in LN, due to the minimal expression of Lag3 in resting T cells comprising most of the LN dataset (Grosso et al., 2007). The roles of b1,6-branched glycans in resting T cells are much less understood. Mgat5 À/À mice displayed a lowered T cell activation threshold (Demetriou et al., 2001), suggesting that b1,6-branched glycans are also immunosuppressive in resting T cells, by desensitizing them to stimulus. Therefore, we again hypothesized that the SHAP genes of the LN dataset are primarily immunosuppressive, synergizing with b1,6branched glycans. In line with this, SHAP genes of the LN dataset are functionally enriched in the pathway of negative regulation of T cell activity ( Figure S3). The top 50 SHAP genes of the LN dataset also included immunosuppressive genes, such as Cxcr6, Socs1, Cd7, and Klrd1 (Heesch et al., 2014;Pace et al., 2000;Sempowski et al., 2004;Sheu et al., 2005;Takahashi et al., 2011). Stab1, known for its anti-immunosuppression activity in T cells, was ranked at 31 and had SHAP values decreasing with expression (Beyer et al., 2011;Nü ssing et al., 2019;Stephen et al., 2017). One study showed that IL-7 treatment reduced b1,6-glycan branching in resting T cells, in contrast to its effect in activated T cells (Mkhikian et al., 2011). Aligning with this, the SHAP values of Il7r (rank 2.29%) anti-correlated with mRNA expression in the LN dataset in contrast to the correlation observed in the TIL dataset, underscoring the context-aware nature of our approach.
SHAP genes include low-abundance genes that are not captured by differential expression analysis SHAP identifies gene subsets that were not captured by differential expression analysis (DEA) (Yap et al., 2021). We wondered whether this could also be observed here, and whether any SHAP-exclusive genes were biologically relevant. Using the TIL data, we performed DEA between PHA-L high and PHA-L low populations. Applying a false discovery rate threshold of 0.05, we identified 381 differentially expressed genes (7.4% of all genes; Table S4), slightly less than the number of SHAP genes (516 genes or 10% of all genes). iScience Article DEA is known for biasing toward highly expressed genes (Bui et al., 2020;Yang et al., 2019). Among the DEA genes, 18 (4.72%) were highly transcribed ribosomal genes (Petibon et al., 2020), with two even among the top 50 DEA genes. In contrast, only 8 (1.55%) ribosomal genes were found in the SHAP genes, none of which were in the top 50. Other high-abundance housekeeping genes were also found only in DEA genes, such as Gapdh (rank 23), Pgk1 (rank 60), and Actb2 (rank 71). These observations suggest that SHAP is less prone to biasing toward high-abundance genes. Pointedly, transcription factors Jun and Ets2, known to upregulate b1,6-glycan branching as discussed above, did not appear in DEA genes. Moreover, glycogenes in general were absent from DEA genes. Overall, our results indicate SHAP to be more powerful than the more conventional DEA in identifying low-abundance genes that are biologically important for phenotypic glycosylation differences.

Alternative model explaining method resulted in importance genes similar to SHAP genes
To see whether alternative model explaining method can generate similar results to SHAP, we used the permutation feature importance (PFI) method to determine the gene importance values. PFI is a widely used, generic approach for most classifiers (Rajbahadur et al., 2021). In brief, we randomly shuffled the expression values for each gene, while keeping the labels intact. The model then made predictions using the permuted values. Permutation of important genes should be more disruptive, reflected by increases in prediction loss. We used the mean loss from all permutations as the metric of gene importance by this method and generated a ranked list of genes (Table S5). The top 10% genes (517 genes) from this list largely overlapped with the ones from SHAP analysis, with 318 shared genes (61.5%). Most biologically relevant genes mentioned above, such as Ets1, Jun, and Il2ra, were among the top 10% of the PFI-based gene list. However, only one glycogene (St6gal1) was found in the PFI top 10%.
The similar results using PFI align with our expectation, as PFI and SHAP are two approaches to assess the same model. However, SHAP combines global and local explainability, while PFI is a global method and does not offer information on the importance of a gene for a particular prediction. The conventional PFI method also does not yield the direction of the prediction impact of a gene and is computationally costly.

DISCUSSION
In contrast with the profusion of genomic, transcriptomic, and proteomic data, matching glycomic data remain scarce (Qin and Mahal, 2021). Consequently, glycosylation research has long been lacking integrated multi-omics analysis and has not benefited greatly from rapidly evolving computational tools for mapping molecular interaction networks in health and disease (Kellman and Lewis, 2021). State-of-theart artificial intelligence technologies, used widely for drug response prediction or regulatory molecule identification (Adam et al., 2020;Kim et al., 2021;Zheng et al., 2020), remain rarely used in glycosylation studies. Single-cell RNA-and lectin-seq technologies such as SUGAR-seq have started to provide new opportunities to use DL for studying differential glycosylation, as demonstrated here.
It should be noted that the feasibility of predicting glycan features from single-cell transcriptomics data cannot be considered a foregone conclusion. Most genes that would seem relevant from a domain perspective (glycosyltransferases, sugar transporters, etc.) are typically lowly expressed and frequently absent from single-cell transcriptomics data (Nairn et al., 2008;Qiu, 2020). Nonetheless, we show that it is feasible to use the sparse nature of single-cell data to predict a glycan feature with high accuracy using a neural network model. Models predicting ''cross-omics'', in this case from the realm of the transcriptome toward the glycome, will be important in the future to aid in multi-omics integration and are particularly relevant in the context of glycans, as they are technically outside the central dogma of molecular biology. Our results here, however, yet again demonstrate that glycans can be re-integrated with the rest of the central dogma, by using transcriptional information to successfully predict parts of glycan expression.
Using SHAP, we further showed that, next to predicting glycan feature abundance, our model can also be used to extract, at scale, compelling biological associations between the transcriptome and the glycome. Tellingly, the genes most important for glycosylation phenotype prediction significantly overlapped with genes involved in the biology of b1,6-branched glycans. A direct explanation for the high predictivity of some SHAP genes could be that they encode membrane proteins bearing b1,6-branched glycans. iScience Article Although we only identified four genes with experimental evidence for b1,6-branching glycosylation among the top 50 TIL SHAP genes (Pdcd1, Ctla4, Itgb1, and Itga4), branched glycans are involved in functions for two of them: In persisting exhausted T cells, PD-1 interacts with galectin-9, which binds to branched glycans displayed on PD-1, to inhibit galectin-9-mediated cell death . Integrin beta 1 is an essential component of a series of integrin complexes that are critical mediators of T cell adhesion and signaling, and its activities are regulated by b1,6-branching of glycans (Bellis, 2004;Jankowska et al., 2018). We expect that more highly predictive SHAP genes will be discovered to bear b1,6-branched glycans in immune cells, influencing protein function. Lag3, a heavily glycosylated immune checkpoint receptor implicated in many diseases (Graydon et al., 2021), was the most predictive gene in the TIL dataset.
Although the glycosylation profile of this protein has not yet been determined, LAG3 binds galectin-3 in T cells, an immune system lectin with similar binding specificity as PHA-L (Demetriou et al., 2001;Kouo et al., 2015). This indicates that LAG3 function may be regulated by b1,6-branched glycans. Thus, our DL-based approach can inform future investigations on whether other SHAP genes encode proteins with this glycoform and whether it might impact protein function.
While most SHAP genes do not encode surface glycoproteins, they can still be associated with the biology of b1,6-branched glycans in multiple ways, as discussed above. We note that all known transcription factor-and cytokine-regulators of MGAT5/b1,6-branched glycans, as well as some glycogenes impacting b1,6-branched glycan synthesis, were among the top ranked SHAP genes. This emphasizes the potential of a DL-based approach to identify regulatory mechanisms of glycosylation. Examining the top ranked SHAP genes also yielded new candidates for potential regulators of b1,6-branched glycans. The chemokine receptor CCR2 (C-C chemokine receptor type 2; rank 0.33%), when bound to its ligand CCL2 (C-C motif chemokine 2), upregulates the transcription factor AP-1 that increases MGAT5 expression (Fei et al., 2021;Lin et al., 2012). In turn, AP-1 also enhances CCL2 expression (Deng et al., 2013;Fei et al., 2021;Novoszel et al., 2021). Therefore, the CCL2/CCR2 axis may be an alternative mechanism to regulate MGAT5/b1,6-branched glycans in immune cells. In line with the immunosuppressive role of b1,6-branched glycans, CCL2 secreted by cancer cells contributes to an immunosuppressive tumor microenvironment, and blockade of CCR2 in mice improved the efficacy of immune checkpoint therapy (Matsuo et al., 2021;Tu et al., 2020), suggesting a direct biomedically relevant role of this glycan feature.
Our work showed that (i) glycosylation phenotypes can be modeled by neural networks from transcriptomic data, (ii) biological processes linked to post-translational modifications such as glycosylation can be deciphered by DL model-explaining methods such as SHAP, and (iii) this combined approach may facilitate the discovery of new regulators of branched glycans and downstream effectors of branched glycan-mediated pathways. We note that this single model essentially recapitulates decades of experimental work on this aspect of biology, including generated hypotheses for future work. While we report largely overlapping regulatory associations in our set of immune cell types, future work could also compare these results with the regulation and function of b1,6-branched glycans in other cell types or other species, to develop a global understanding of the diverse roles of this glycan feature. We also note that our method yields regulators of b1,6-branching even in the absence of a clear signal for the main enzyme Mgat5, which had low SHAP values (ranked 37.76% in TIL dataset), due to its very low expression level.
We envision a pipeline in which this explainable DL approach is used to analyze data generated by SUGARseq or similar technologies, to decode the biology of less well-studied glycoforms (e.g., high-mannose glycans, sulfated glycans, and I-branched glycans) and their importance in disease (Chuzel et al., 2021;Dimitroff, 2019;de Leoz et al., 2011;Loke et al., 2016;Sun et al., 2022). For this, future work needs to expand the capabilities of the associated lectin-seq, similar to the already reported Glycan-seq technology (Oinam et al., 2021). We also anticipate the integration of more data modalities besides transcriptomes. In RNAseq, measurement of glycogenes and other low-abundance genes is less accurate, potentially skewing the interpretation of the importance of these genes (Tarazona et al., 2011). Adding proteomic or miRNA data (important regulators of low-abundance genes) to the DL model input may provide a more accurate account of the regulation of glycosylation mediated by low-abundance genes (Schmiedel et al., 2015;Thu and Mahal, 2020). We envision that this combination of systems biology and artificial intelligence will provide fresh insights into the complex, interleaved biosynthesis and functional role of glycans in various biological contexts.

OPEN ACCESS
iScience 25, 105163, October 21, 2022 13 iScience Article Limitations of the study It should be carefully considered that our model was trained on a specific kind of cells, murine immune cells, and a specific lectin, PHA-L. We do not expect the trained model as such to immediately generalize to other cell types or lectins, presenting a limitation of this work. For new cell types and/or lectins, the model has to be trained on new data, albeit we envision the same model architecture to be useful for this purpose. As already mentioned, results of our model (SHAP values etc.) will also likely vary with sequencing depth, as important-but-lowly-expressed genes might not always be measured. Lastly, as pointed out in the case of OGT, we are currently unable to conclusively deduce causal mechanisms from these analyses and view the work presented here rather as a method for hypothesis-generation and the synthesis of existing knowledge about a biological process from a systems-perspective.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Permutation feature importance analysis
Each feature (gene) column was randomly permuted while the labels were kept intact. The model made predictions using the permuted expression values and the prediction loss was computed using the binary cross entropy loss function. 25 permutations were performed and the mean loss from all permutations was computed for each gene. The permutation feature importance for a gene is the mean model loss subtracted by the original model loss (i.e., the model loss when making predictions with the original expression values).

QUANTIFICATION AND STATISTICAL ANALYSIS
For differential gene expression analysis, two-tailed Wilcoxon Ranked Sum test was used to determine p values (see Table S4).