Protocol for constructing glycan biosynthetic networks using glycowork

Summary Glycans, present across all domains of life, comprise a wide range of monosaccharides assembled into complex, branching structures. Here, we present an in silico protocol to construct biosynthetic networks from a list of observed glycans using the Python package glycowork. We describe steps for data preparation, network construction, feature analysis, and data export. This protocol is implemented in Python using example data and can be adapted for use with customized datasets. For complete details on the use and execution of this protocol, please refer to Thomès et al.1


Installation of glycowork and setup of environment
Timing: 10 min.CRITICAL: All code examples in this protocol are intended to be run in a notebook environment such as Google Colaboratory or Jupyter Notebook.If running the code in a regular Python console instead, lines prefaced with a '!' operator need to instead be run in a command line environment (only relevant for the installation of the necessary dependencies).
Optional: To run the notebook code using Google Colaboratory, skip this optional section and continue at Step 1.Alternatively, perform the following steps to install glycowork into a virtual environment in order to run the notebook locally.(i) Download and install Miniconda from https://docs.conda.io/projects/miniconda/en/latest/according to your computer operating system.(ii) Using the command-line interface of your operating system, run the code below to install glycowork and dependencies into a virtual environment and launch the notebook locally.The files protocol_notebook.ipynb and environment.ymlare available as Files S1 and S2, respectively.(iii) Before running the notebook, comment out the code in Step 1 (as glycowork has already been installed) and modify the code in Step 4 to define an appropriate local file path.
Note: If running the code outside of a notebook environment, generated figures will not be displayed in-line.

Install glycowork. (Troubleshooting 1).
Optional: If using the built-in module for drawing glycan structures using GlycoDraw, 2 the glycowork install needs to be performed as below.
2. Import all necessary helper functions from glycowork.This step describes how to select a subset of milk glycans from a given species, and how to build and plot the corresponding biosynthetic network.
4. Create a species-specific DataFrame from the full dataset.
Note: Here, the species Lama pacos is used as an example.In the code, 'Lama_pacos' can be replaced by another species of interest.Available species are stored in the species_list variable previously defined.
This step allows pruning of a given network by comparing it to other processed networks.
Note: Evolutionary pruning requires multiple networks stored in a dictionary.Here, we use pre-calculated milk networks from Thome `s et al. (2023) 1 , obtained in step 5 of the Before you begin -Installation of glycowork and setup of environment section.
7. Load the dictionary containing pre-calculated milk networks.

Note:
The net_dic dictionary is structured with species as keys and network objects of glycan networks as values.If desired, a similar network dictionary can be prepared by updating an existing dictionary object with the appropriate key-value pairs (i.e., custom_net_dic['key_label'] = network_object).
8. Prune the network from the previous example.
Note: During pruning, alternative reaction paths are compared across all networks and a path is pruned if a significant difference in path likelihood (default threshold, p < 0.01) is observed.
For more details, see Thome `s et al. ( 2023). 1 The threshold value can be modified to change the stringency of the statistical comparison, and thus, the network pruning.

Analyzing network features
Timing: 10 min Note: See Table S1: General network features for a full overview of the resulting table.
12. Compare inferred versus observed node degree across networks.
Note: Inferred nodes are referred to as ''virtual'' nodes in the code.This attribute is encoded as a binary label: 0 -experimentally observed node, 1 -inferred/virtual node.
Optional: Statistical tests can be performed to compare a given feature between observed and virtual nodes using the SciPy module, such as here with a one-sided Welch's t-test: > ttest_ind(virtuals, reals, alternative = 'greater', equal_var = False) [1] Protocol Note: With a calculated p-value of p < 10 À9 , there is a statistically significant difference between the average node degree of observed and virtual nodes.
13. Extract biosynthetic modules of glycan communities from a single network and highlight in the original network figure.
Note: The get_communities() function returns a merged dictionary of community : glycans in that community.
14. Calculate communities across all milk networks and calculate pairwise Jaccard distances for clustering.
Note: Communities only consisting of lactose are removed as these originate from understudied species where only lactose has been observed.

Editing and exporting networks in glycowork
Timing: 10 min This step describes how to edit networks to highlight select features using built-in glycowork features.
16. Highlighting network conservation by scaling of node size.a. Construct and prune biosynthetic network using milk glycans from the Diprotodontia order.

Note:
The df_milk table contains glycan information from 172 unique mammalian species.
While we demonstrate highlighting of network features using glycans from the Diprotodontia order, we encourage users to explore the dataset and modify the code to build and annotate different biosynthetic networks.
b. Highlight network conservation across species of the Diprotodontia order.

Note:
The highlight_network() function allows for highlighting of different attributes, including glycan motifs, glycan abundance, glycan conservation (as shown in the above example), and species (for highlighting one species in a multi-species network).See the documentation (https://bojarlab.github.io/glycowork/network.html#highlight_network)for further details.17.Here, files including information on connectivity and annotation are exported for further modeling and visualization using external software.

Note:
The filepath argument should describe a valid path and file name prefix which will be appended with file description and type upon saving.

Protocol
Note: The export_network() function generates two files: ''prefix_node_labels.csv'' containing observed/virtual node label information and ''prefix_edge_list.csv''containing connectivity information.The files generated from this function call are included in Table S2: Exported network files.

Note:
The exported files can be imported into external graph software, such as Gephi or Cytoscape, for further annotation.

Timing: 10 min
This step describes how other glycan types than milk glycans can be used to generate networks, including glycolipids and O-glycans.

EXPECTED OUTCOMES
Successfully running the protocol will result in the generation of a list of files and figures.In steps 1-6, an initial biosynthetic network for milk glycans from Lama pacos is generated (Figures 1 and 2).In steps 7-10, this network is further pruned using evolutionary information obtained from the full dataset (Figure 3).Steps 10-15 describe the analysis of network features, including node degree (Figures 4 and 5), general network statistics (Table S1), and community detection (Figures 6 and 7).
Step 16 showcases how to highlight network features within glycowork (Figure 8), while step 17 allows export of network files for further analysis and annotation using external software (Table S2).

LIMITATIONS
As different glycan types are synthesized through different pathways, a network can only connect glycans from the same type.Therefore, any attempt to construct a network from a heterogeneous list of glycan types will result in a failure of this protocol.
In addition to the available computational power, two factors are strongly influencing the run time when constructing a network: the number of glycans in the initial list, and the maximum metabolic distance between these oligosaccharides.While the first reason may sound obvious, as more glycans will require more calculations and constitute additional entities to handle, the second reason is less straightforward.If few glycans are observed and if they are separated by many biosynthetic steps, meaning that they are of very heterogeneous length and/or composition, computing all possible intermediates to generate a single network will be a more fastidious task, ultimately  resulting in a longer run time.To counteract that, glycowork will not try to connect parts of the network that have more than five missing intermediates in a row (i.e., it would take more than five reactions to connect a large observed structure to any smaller observed structure).This may sometimes result in incompletely connected networks.

TROUBLESHOOTING Problem 1
Failure to install glycowork (Before you begin -Installation of glycowork and setup of environment).

Potential solution
Check the version of Python installed.
In notebook: In console.

Potential solution
It is necessary to mount Google Drive in order to access files through Google Colaboratory.
Make sure that the fp variable is correctly defined (default: fp = 'drive/My Drive/glycan_networks/') and that all necessary files are placed in a folder named ''glycan_networks'' in Google Drive.If running the notebook locally, the fp variable must be edited to point to a local folder containing the necessary files.

Potential solution
Check the list of glycans used for network generation and filter if necessary.The construct_network() function expects a set of related glycan structures, all belonging to the same class.Further, a very large set of glycans and/or a set of several glycans separated by many biosynthetic steps may result in prohibitively long run times depending on the available processing power.

Problem 4
Failure to plot network (step-by-step method details -Build network with inferred intermediate structures).

Potential solution
The plot_network() function requires the optional networkx dependencies pydot and graphviz to be installed.> conda install graphviz > pip install pydot

Figure 1 .
Figure 1.Screenshot of TableS2.csv(df_milk) This table contains milk glycan structures and associated species information.
Finally, steps 18-19 give examples of network generation using alternative glycan classes (Figures 9 and 10).

Figure 2 .
Figure 2. Biosynthetic network of Lama pacos milk glycans Nodes represent glycan structures (blue -observed, orange -inferred).The milk glycan lactose core (Galb1-4Glc) is located at the top of the network.Edges indicate the enzymatic reaction, i.e., the addition of a monosaccharide or chemical modification, separating two nodes.Viewing of the generated network in a notebook format allows for mouse-over of nodes to get the identity of the structure.

Figure 3 .
Figure 3. Pruned biosynthetic network of Lama pacos milk glycans Glycan structures and enzymatic reaction steps are represented as described in Figure 2.

Figure 4 .
Figure 4. Pruned Lama pacos network statistics Degree indicates the number of total connections a node exhibits.Node degree analysis yields insights into overall network connectivity and can provide hypotheses for biosynthetic flux of intermediate structures.

Figure 5 .
Figure 5.Comparison of node degree statistics between observed and virtual nodes across the full dataset Data are depicted as mean values, with box edges showing quartiles and whiskers representing the remaining data distribution.Data points outside the whiskers are outliers and marked with dots.

Figure 6 . 2
Figure 6.Lama pacos network communities (A) Pruned Lama pacos network.(B) Pruned Lama pacos network with node colors indicating community identity.Nodes overlapping multiple communities are colored in light gray.

Figure 7 .
Figure 7. Hierarchical clustering of pairwise Jaccard distances between communities across the full dataset The clustering reveals three highly conserved biosynthetic modules containing (i) N-acetyllactosamine-based MOs, (ii) MOs derived from the progenitor GlcNAcb1-3Galb1-4Glc, and (iii) MOs starting from the progenitor GlcNAcb1-6Galb1-4Glc. Further insights into these three modules are discussed in Thome `s et al. (2023).1

Figure 8 .
Figure 8. Biosynthetic network of milk oligosaccharides from the Diprotodontia order Glycan structures and enzymatic reaction steps are represented as described in Figure 2. Node size is scaled by degree of conservation across the order of Diprotodontia.
These dependencies are bundled with glycowork or come pre-installed when working in a Google Colaboratory environment.When working in a local Python console, install (e.g., pip install ...) the necessary packages prior to importing.4.Set the working file path, assuming the presence of a folder named ''glycan_networks'' in Google Drive.