Protocol for hybrid flux balance, statistical, and machine learning analysis of multi-omic data from the cyanobacterium Synechococcus sp. PCC 7002

Summary Combining a computational framework for flux balance analysis with machine learning improves the accuracy of predicting metabolic activity across conditions, while enabling mechanistic interpretation. This protocol presents a guide to condition-specific metabolic modeling that integrates regularized flux balance analysis with machine learning approaches to extract key features from transcriptomic and fluxomic data. We demonstrate the protocol as applied to Synechococcus sp. PCC 7002; we also outline how it can be adapted to any species or community with available multi-omic data. For complete details on the use and execution of this protocol, please refer to Vijayakumar et al. (2020).


SUMMARY
Combining a computational framework for flux balance analysis with machine learning improves the accuracy of predicting metabolic activity across conditions, while enabling mechanistic interpretation. This protocol presents a guide to condition-specific metabolic modeling that integrates regularized flux balance analysis with machine learning approaches to extract key features from transcriptomic and fluxomic data. We demonstrate the protocol as applied to Synechococcus sp. PCC 7002; we also outline how it can be adapted to any species or community with available multi-omic data. For complete details on the use and execution of this protocol, please refer to Vijayakumar et al. (2020).

BEFORE YOU BEGIN
The generation of a genome-scale view of metabolic activity is a useful step for many biological scientists, requiring the construction of a computational model that can be adapted to suit the purpose of each analysis by integrating omic additional data to simulate specific genetic or environmental conditions (Vijayakumar et al., 2018). Metabolic networks must be converted into a mathematical format that is both amenable to predictive modeling and able to effectively convey the functional state or behavior of the cell at a multi-systems level (Yurkovich and Palsson, 2015). To this end, genome-scale metabolic models (GSMMs) are mathematical representations of all known biochemical reactions and transmembrane transporters that occur within a living system. They provide a comprehensive view of all metabolic processes by recording and quantifying their flux, which can be defined as the rate of metabolic turnover or conversion of reactants into products (Palsson, 2015). Several methods for constraint-based reconstruction and analysis (COBRA) can be used to simulate flux through metabolic networks at the whole-genome scale (Bordbar et al., 2014). Among these, flux balance analysis (FBA) is a technique that utilizes linear programming to predict flux through all reactions in the metabolic network by locating a set of values in the solution space that best satisfies a given objective function representing the main metabolic goal for the cell (Reed, 2012;Dusad et al., 2020).
With the rapid advent of high-throughput technologies, supplementation of GSMMs with multidimensional omic data describing various levels of biological organization can provide the opportunity to trace molecular components across multiple functional states and record their interactions (Blazier and Papin, 2012;Ebrahim et al., 2016;Li et al., 2018). However, the quality of available experimental datasets can severely limit the predictive power of the model (Yurkovich and Palsson, 2018). To this end, there have been many recent studies that combine machine learning analyses with metabolic modeling (Nandi et al., 2017;Yaneske and Angione, 2018;Costello and Martin, 2018;Guebila and Thiele, 2019;Yang et al., 2019;Culley et al., 2020;Zhang et al., 2020a). Given the difficulty of extracting information from multi-omic datasets, machine learning algorithms serve to reduce dimensionality and elucidate cross-omic relationships (Cuperlovic-Culf, 2018). Additionally, machine learning algorithms and constraint-based models share complementary characteristics and common mathematical bases which make them compatible to be combined. On one hand, GSMMs can provide critical data in terms of stoichiometry and the genetic control of biochemical reactions. On the other hand, machine learning can deconstruct biological complexity by extracting relevant outputs from data. Together, they improve omic-based statistical and machine learning analyses by enriching the learning process with biological knowledge and refining phenotypic predictions (Zampieri et al., 2019;Volkova et al., 2020;Kim et al., 2020).
This protocol presents a series of steps that apply the principles of constraint-based metabolic modeling, multi-omic data integration and machine learning to analyze a genome-scale metabolic model of Synechococcus sp. PCC 7002 (summarized in Figures 2 and 3). Following this framework, the main stages comprise regularized flux balance analysis to observe flux response between growth conditions, as well as principal component analysis, k-means clustering, LASSO regression and correlation analysis to reduce dimensionality and extract key features from transcriptomic and fluxomic data. Through this synergistic approach, our goal is to achieve better characterization of metabolic activity across conditions by predicting the phenotypic response. We begin our protocol by presenting a brief summary of the software programs that must be installed prior to completing the main stages of analyses in Installation. Following this, we describe critical steps for the preparation of the chosen genome-scale metabolic model (GSMM) (Preparation of Metabolic Model) and the transcriptomic data (Preparation of Transcriptomic Data) for flux balance analysis. Preprocessing of transcriptomic data involves the conversion of reads per kilobase million (RPKM) into fold change values, which serves two purposes. First, each growth condition is normalized relative to the standard control within its dataset, allowing the integration of profiles relating to each growth condition during FBA. Second, calculating fold changes centered around 1 serves to facilitate comparisons between transcript and flux data when they are concatenated during later stages of analysis (PCA, LASSO and correlation).
Inputs and outputs for datasets used in each analysis are listed in Table 1.

Installation
Timing: 1-2 h All installations can be run using Linux, Mac or Windows operating systems, but this protocol is mainly based on using the Windows platform. For full instructions on installing the COBRA Toolbox in Mac and Linux, we refer the reader directly to follow the steps provided at: https://opencobra. github.io/cobratoolbox/stable/installation.html.
1. If needed, install the latest version of MATLAB.
a. The MATLAB programming language can be downloaded from https://uk.mathworks.com/ downloads/web_downloads/. Following registration, a free 30-day trial can be requested from https://uk.mathworks.com/campaigns/products/trials.html. b. For a permanent installation, an associated license can be purchased for use by commercial or government organizations, degree-granting institutions, or individuals. Several universities and research organizations provide access to MATLAB through a centralized, campus-wide license.

Preparation of metabolic model
Timing: 2 weeks to 1 month Any organism with a baseline GSMM and available transcriptomic data can be analyzed using this protocol. The COBRA Toolbox is a popular module for constraint-based reconstruction and analysis of metabolic networks in MATLAB (Heirendt et al., 2019). In most cases, models are written in the Systems Biology Markup Language (SBML) to ensure compliance with the COBRA modules used for analysis (Keating et al., 2006). In this instance, we convert the model directly into .mat format for analysis in MATLAB using the COBRA Toolbox (the resulting model is shown in Figure 1).
Many GSMMs are publicly available in online repositories such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2016), the Biochemical Genetic and Genomic (BiGG) knowledge-base (Norsigian et al., 2020), the BioCyc collection of pathway/genome databases (Karp et al., 2019), MetaNetX (Moretti et al., 2021) and the ModelSEED and PlantSEED databases (Devoid et al., 2013;Seaver et al., 2014). The preparation of these models for flux balance analysis involves the automated reconstruction of all metabolic reactions taking place in the organism, supplemented by the functional annotation of genes, metabolites and pathways. This is usually followed by extensive manual curation and gap-filling (Prigent et al., 2017), the extent of which is subject to the quality of the initial model reconstruction (Lieven et al., 2020). Furthermore, predictions obtained from GSMMs can be reconciled with in vivo findings and used to identify current gaps in our knowledge of metabolism (Mienda, 2017). However, there are often inconsistencies that must be reconciled between models and experimental data that would otherwise result in outcomes that are falsely predicted by the model (false positives) or experimentally observed outcomes that the model fails to predict (false negatives).

OPEN ACCESS
Note: In order to relate genes, metabolites and reactions during FBA, the GSMM must contain a field of logical gene-protein-reaction (GPR) association rules. These rules record the involvement of every gene in every reaction of the metabolic network and must be adjusted when integrating new data that record differential gene expression under various conditions.
Note: Although the field fbamodel.rules already exists within the model, running com-pute_reaction_expression.m creates the field fbamodel.grRules (a string representation of the GPR rules), which will be solved mathematically at the stage of omic data integration. As these new rules do not contain parentheses, it must be manually ensured that AND is solved before OR when substituting MIN and MAX respectively. This means that in the final expression, the MINs must be calculated before the MAXs. The function asso-ciate_genes_reactions.m called by compute_reaction_expression.m substitutes the ORs first (which become MAXs), and then the ANDs inside the MAXs. This generates an expression that first solves the ANDs (within an internal loop) and then solves the ORs (within an external loop).
6. Create new fields within the model for grRules and two flux objectives (f and g) that will be specified in flux balance analysis: 7. Match the parsing of gene IDs in the transcriptomic data with those listed in fbamodel.genes: CRITICAL: Steps 6 and 7 only apply when creating a new GSMM, as it must be ensured that a new grRules field is written in the model to link gene IDs in the omic dataset with those in the model. When applying the steps to a new model or data, it is important to ensure the consistency of gene names between external data and the GSMM, but modelers wishing to run the analysis for the Synechococcus GSMM only need to load the variables already saved in the code repository.
8. As stated previously, conducting manual curation of all model fields, including genes, reactions, metabolites and subsystems prior to performing FBA is necessary to ensure the verity of biological outputs. Particularly, subsystems within the model may be known by multiple names or annotated inconsistently. It is also possible, as in our case, that several reactions are assigned with multiple subsystems or even none at all. In the case of reactions, we create a new array of subsystem names that are modified to account for reactions classified by more than one subsystem: 9. Since it has been used to differentiate multiple subsystems associated with single reactions in fbamodel.subSystems, the word 'and' can be used as a string delimiter to divide subsystem names across a cell array of separate strings: CRITICAL: Parsing strings at the correct positions within single subsystems and removing any trailing spaces and blank cells after name replacement are essential to ensure consistency and match strings accurately within subsystem names during Pathway-level PCA (optional) and Pathway-level correlation analysis (optional).

Preparation of transcriptomic data
Timing: 2 weeks to 1 month The transcriptomic profiles utilized in this study originate from three studies conducted by Bryant (2011, 2012a,b) that sequenced RNA reads for Synechococcus sp. PCC 7002 cells grown under different conditions (detailed in Table 2). Following their generation via SOLiDä sequencing, the study by Yang et al. (2015) describes how these data have been preprocessed prior to their inclusion in our protocol. The reads obtained from the NCBI Sequence Read Archive (SRA) were filtered to eliminate low-quality reads and aligned against the Synechococcus genome using Burrows-Wheeler Aligner (BWA) software. Following this, the sequences that did not map to the reference genome, those that were mapped to the rRNA-coding regions or those aligned to more than one region were eliminated. The remaining uniquely mapped genes were converted into reads per kilobase million (RPKM) and fold change values. Starting from RPKM values (stored in Datasets 1 and 2), we begin by recalculating fold changes as values centered around 1. As outlined in before you begin, this ensures a more convenient comparison between transcript and flux data when they are concatenated and also between all growth conditions, including the standard controls within each separate dataset, which were averaged over three replicates.

CRITICAL:
In this instance, all transcriptomic reads were obtained from studies conducted in tandem (with the same number of samples). For omic data obtained from multiple sources/studies that require additional normalization, see troubleshooting problem one.

MATERIALS AND EQUIPMENT
Throughout this work, a Lenovo G50-30 80G0 model laptop computer using the Microsoft Windows 10 Home operating system was used. This computer has a 500 GB solid-state hard drive, an Intel Pentium N3530 CPU @ 2.16 GHz (1,333 Mhz memory speed and 4 cores) and 4 GB Random Access Memory (RAM). However, any reasonably up-to-date computer may be used to run all code and any operating system can be used -Windows, Mac OS, or Unix/Linux. MATLAB (MathWorks: https://www.mathworks.com/products/matlab) Alternatives: While the current implementation applies the COBRA Toolbox in MATLAB, the package is extendable to any other platforms that support COBRA (such as Python, Julia, Mathematica as well as Linux, Windows and Mac binaries). A full list is available from: https://opencobra.github.io/.

STEP-BY-STEP METHOD DETAILS
In this section, a comprehensive step-by-step protocol is laid out for running the flux balance analysis of Synechococcus sp. PCC 7002, followed by principal components analysis, k-means clustering, LASSO regression and finally, correlation analysis. Each of these stages comprises a series of inputs and outputs, as well as intermediary processes that transform each type of data (see Figure 2). Critical steps for running the code and troubleshooting are interspersed between these steps and further elaborated in the troubleshooting section. All steps described in the code are case-specific, but they can easily be adapted to any transcriptomic dataset or GSMM that the user wishes to analyze.

Flux balance analysis
Timing: <15 min Note: During flux balance analysis, a single objective is usually specified for optimization within the field fbamodel.c. Using different solvers to perform the same optimization can cause solutions to vary, owing to differences in numerical implementation and the existence of multiple optimal solutions in the solution space. Calculating a unique solution using quadratic optimization is therefore more reliable when the flux distribution is intended for use in further analyses. To this end, minimizing the sum of squared flux values (L2 norm) carried by the metabolic network following maximization of the primary objective guarantees a unique set of flux solutions drawn from a strictly convex space (Angione, 2019). This section lists the major processes and steps for running a regularized flux balance analysis that maximizes pairwise objective functions in a bi-level fashion with a penalty term that considers the norm-2 of the flux vector (Heirendt et al., 2019). Bi-level regularized FBA is conducted in MATLAB using the quadratic programming solver Gurobi to compute flux distributions by selecting pairs of reactions in the GSMM to act as flux objectives (i.e. by selecting reactions within fbamodel.f and fbamodel.g, as detailed in Figure 4). Subsequently, 24 condition-specific growth profiles of Synechococcus sp. PCC 7002 are generated by integrating omics data relating to different environmental conditions, and three pairs of reactions are optimized for each of these profiles, namely: (i) Biomass -ATP maintenance (ii) Biomass -Photosystem I and (iii) Biomass -Photosystem II.
Note: When calculating the flux distribution across conditions, the biomass reaction was chosen as the primary objective, while the secondary objective was set to ATP maintenance, photosystem I or photosystem II reactions in order to reflect the main cellular goals of cyanobacteria. In our case, the carbon-limited biomass reaction has been chosen as a primary objective to represent the maximization of growth rate and cellular yields (Feist and Palsson, 2010;Yuan et al., 2016;Lakshmanan et al., 2019), which is a critical consideration for cyanobacteria as this informs the substrate uptake rates and maintenance requirements that indicate fundamental cellular growth requirements. The chosen secondary objectives are key pathways involved in energy metabolism during photosynthesis. Simulating the cost of ATP maintenance helps to assess the energy required for sustaining metabolic activity even in the absence of growth. The incorporation of the photoexcitation reactions occurring within photosystems I and II serves to characterize how flux under various conditions reflects the light harvesting and energy transfer via photon absorption through these complexes. Thus, solving the quadratic optimization problem for multiple pairs of objectives helped to resolve tradeoffs by considering the conditions and constraints affecting each of these objectives.
It has been established that the activity of biosynthetic and energy-generating pathways increases with the growth rate (Bernstein et al., 2014), which led us to implement multi-level regularized FBA in our pipeline, considering more than one objective function. This allows us to examine the effect of maximizing biomass using regularized flux balance analysis, followed by the maximization of flux through ATP maintenance and photosynthetic reactions. Performing the FBA in this manner has a relatively low computational cost, taking approximately 0.9-1.69 s per growth condition, and 43.53 s to run the entire script.
Note: As an alternative to regularized FBA, we also provide a critical step detailing how users can employ flux variability analysis (FVA) to obtain minimal and maximal flux ranges for each growth condition. The full details for running the analysis are contained in the script RUN_all.m 1. Firstly, we load the required variables within a local directory available to MATLAB: 2. We then specify variables for the genes within the model and those included in the transcriptomic data: 3. This step assigns indices for selecting the objective function(s) to be optimized during flux balance analysis: a. This step assigns indices for selecting the objective function(s) to be optimized during flux balance analysis.
CRITICAL: Although a large number of studies express the maximization of biomass as the only objective when performing FBA, it is important to recognize that, in reality, most organisms have multiple objectives to satisfy. Depending on the goal of the flux simulation, any reactions within the metabolic network reflecting a property of interest that must be optimized by the cell can be selected as objective functions via vector indexing. Within each pair of objectives, the primary flux objective fbamodel.f is fixed as the standard biomass reaction (fbamodel.rxnNames = 735) since it reflects the universal property of cellular growth maintenance, whereas the secondary flux objective fbamodel.g is manually switched between the reactions for ATP maintenance (fbamodel.rxnNames = 70), Photosystem I (fbamodel.rxnNames = 698) or Photosystem II (fbamodel.rxnNames = 697) to examine processes relating to energy metabolism and photosynthesis. As an alternative approach, users may also wish to force flux by increasing the lower bounds of reactions to ensure a minimum flux through pathways of interest, although in general this would not allow the user to find solutions that maximize their usage.
CRITICAL: Before applying gene-expression derived constraints during FBA, additional boundary constraints based on the varying metabolic capability of cells under different growth conditions (stored in bounds.mat) are used to modify lower and upper bounds in the model (fbamodel.lb and fbamodel.ub), thus shrinking the solution space and refining phenotypic prediction of metabolic activity. For all experimental conditions, a series of uptake and secretion rates are adjusted in the GSMM prior to performing FBA, taking into account: (i) composition of growth media limitation/supplementation of trace elements, e.g. nitrogen, sulfur, iron, phosphorus, etc. (ii) optical density at which cells were harvested (OD 730nm = 0.4/0.7/1.0/3.0/5.0), (iii) mode of energy utilization (phototrophy/heterotrophy/mixotrophy), (iv) availability of oxygen/carbon dioxide (low O 2 , low CO 2 , oxic/ anoxic), (v) light intensity (dark or high light), (vi) temperature (22 C, 30 C, heat shock), (vii) salinity (low/high). This enables a more unique characterization of each growth condition.
Note: For example, the bounds adjusted in our model are specified in Table 3, where a list of uptake and secretion rates (i.e. lower and upper bounds recorded in fbamodel.lb and fbamodel.ub respectively) for various exchange reactions are fixed at different values according to the growth conditions under which the Synechococcus cells were cultured and harvested Bryant, 2011, 2012b,a). Apart from glycerol in the mixotrophic condition, lower bounds for other carbon sources (maltohexaose, maltopentaose, maltotriose, maltotetraose, maltose) and carbonate are set to zero for all conditions. g represents the photon exchange reaction, whose lower bounds are determined using the calculation specified in Equation 1.
Note: To specify the variation in light uptake across growth conditions, we calculated a photon uptake rate (P U ) for each growth condition using a method similar to Vu et al. (2012). In this calculation, light consumption (LC) under each condition (mmol) is multiplied by the surface area (SA) of the culture exposed to the light source (m 2 ); the product is subsequently divided by the total available dry cell weight (DCW) of the culture (grams per volume) as follows: In this instance, the surface area of the culture exposed to the light source was calculated using the diameter of the cylindrical culture tube and the volume of the culture medium (Ludwig and Bryant, 2011), but users are advised to consider the shape and capacity of the vessel used to culture the cells in their own experimental setting when calculating this value.
Note: If conducting growth experiments to directly measure light availability and DCW in vivo is not possible, users can refer to the literature to find the closest estimates available for their model species. In our case, we use an approximation for the DCW of marine Synechococci (Myers et al., 2013), which was confirmed to be in the same range of values as other Synechococci (Aikawa et al., 2014;Qiao et al., 2018 (Kato et al., 2017), or a piecewise linear approximation can be adopted to extrapolate the line, calculate its gradient and obtain the growth rate.
CRITICAL: An example of the expected output for running the script RUN_all.m is provided in Figure 5. After flux rates have been calculated for all growth conditions, the results can be plotted as a simple bar chart where they are re-scaled as values between 0-1 (see Figure 6 for sample plotting commands and Figure 7 for the resulting plot). 14. For flux variability analysis, the mean of minimal and maximal flux vectors for different conditions can be calculated as follows:  Creation of multi-omic dataset Timing: < 10 min In our analyses, gene transcripts constitute a vital component of the flux balance analysis since transcriptomic data are integrated into the GSMM to determine condition-specific flux values. Although partially based on transcriptomics, flux rates are additionally subjected to condition-specific GSMM constraints, the steady-state, and their underlying biochemistry. This automatically creates a component of nonredundant information that does not exist in the transcriptomic dataset. Generating flux data supplies more layers of information to further refine phenotypic predictions. It is thus easier to identify important predictors during machine learning analyses; much of the noise in the gene transcript data is no longer present in the flux data, since gene transcripts with low expression have been 'filtered out' as they do not have a large influence on linear constraints in the metabolic model, and consequently they have a smaller effect on the flux rates.
Therefore, if a machine learning model can extract the non-redundant information contained in the flux rates, they can contribute new mechanistic information that is not found in the transcriptomic data. Furthermore, the model itself can act as a tool for ranking and noise reduction since the effect of low importance genes can be 'filtered out' even if their expression is highly variable across conditions. Without the metabolic model, the importance of these genes would be overstated, and they would be used erroneously to differentiate conditions. For example, in our case study, reactions

OPEN ACCESS
involved in succinate dehydrogenation (SUCD1Itlm/SUCD1Icpm), efflux (SUCCt2b) or exchange (EX_succ_E) were found to be positively correlated with growth for all three objective pairs and were also identified among the highest positive correlations when analyzing the concatenated dataset of gene transcripts and Biomass -ATP maintenance flux data . These reactions are encoded by A1094 and A2569, which had relatively low gene expression and variability across growth conditions (ranging between 0.33 to 3.74 and 0.14 to 3.66, respectively). Being unrelated to genes already identified as significant during LASSO and correlation analyses of the single omic (transcriptomic) data, these reactions were only detected as a result of transcriptomic data being used to adjust the constraints for calculating flux rates, showing the importance of the metabolic model in characterizing the phenotype across conditions.
In practice, combining transcript and flux data in a single multi-omic dataset (by converting them into fold change values) provides a direct point of comparison between the two omics and an opportunity to observe in which instances the flux values are more predictive than transcript values. Generally, transcriptomic and fluxomic data produce different outcomes from the modeling and statistical analyses and combining the two omics yields more stable predictions.
In this section, we define how to concatenate transcript and flux data by obtaining fold changes that enable a comparison of their contribution to gene/reaction variables as a result of the conditions under which the cells were grown and harvested.
15. In MATLAB, create datasets for further analysis by concatenating transcripts and fluxes: Principal component analysis (PCA)

Timing: < 5 min
Principal component analysis (PCA) can reduce multidimensional datasets to a few latent dimensions known as principal components, allowing the identification of variables responsible for the largest variations within datasets. The reduction of dimensionality within voluminous omic datasets is an important process to achieve successful multi-omic integration and is vital to facilitate their interpretation.
In this analysis, PCA is being used to compare the contribution of each growth condition to the construction of dimensions that summarize the greatest proportion of variance in the dataset. Furthermore, specific genes and reactions contributing to variance between conditions can be pinpointed using Pathway-level PCA, wherein they are classified according to their genetic/metabolic function. The role of these genes and reactions in significant pathways or cellular processes can also be ascertained in a more detailed manner.
Here, principal component analysis is conducted in R using the FactoMineR and factoextra packages. Full details of the code are provided in the script PCA_script.R, which can be found in the GitHub repository listed in the key resources table: https://github.com/Angione-Lab/ Synechococcus7002-metabolic-modelling. For users wishing to carry out the full analysis on gene transcripts and/or flux rates in the form of .mat variables in MATLAB, the function pca can be used to carry out PCA on raw data, pcares returns the residuals obtained by retaining a given number of principal components and pcacov performs PCA on the square covariance matrix. However, we demonstrate our pipeline using the packages in R for improved analysis and visualization of plots that facilitate the biological interpretation. As seen below, the R packages generate detailed plots, lists of variable contributions, principal component scores and the proportions of variance explained by each dimension.
The gene transcripts dataset is used as an example below, but the same steps can be repeated for all datasets (transcripts, all_ATP_flux, all_ATPTF, etc.). For an example plot using individual growth conditions, see Figure 8. Other useful outputs resulting from the analysis, such as principal component contributions (Figure 9) or coordinates ( Figure 10) relating to all growth conditions or variables within the dataset can also be saved for further inspection.
16. We begin by navigating to the workspace in R and loading the required packages: 17. We then load transcript/multiomic/flux .csv data files for analysis:  Pathway-level PCA

Timing: < 15 min
In order to carry out a more detailed investigation of specific gene transcripts or metabolic reactions in the model, it is possible to perform a pathway-level PCA that categorizes genes and reactions identified during PCA according to their main biological function. Upon obtaining the results of these analyses, we can plot the sum and average principal component contributions across different pathways as well as principal component coordinates for each growth condition against single reaction fluxes. As in the previous principal component analysis section, there are existing functions for plotting these data in MATLAB. The barh function can be used to generate bar plots displaying sums of subsystem contributions, the polarplot function can be used to display average contributions by subsystem and the scatter function can be used to plot principal coordinates for individual reactions against their corresponding flux values across different growth conditions. In this protocol, we utilize the plotrix and fmsb libraries in R to customize individual pyramid plots and radar charts, facilitating comparisons between different pairs of flux objectives and multiple pathways.
This provides an opportunity to study these components in a more detailed manner through expanding the scope of biological insights detected and establishing connections between genes contributions_transcripts <-res_transcripts.pca$var$contrib ind_coord_transcripts <-res_transcripts.pca$ind$coord ll OPEN ACCESS and reactions within the same functional category or pathway. It is important to account for the varying number of reactions within each pathway, therefore both the sum and average contributions to variance can be used as measures of comparison from principal components. Additionally, principal component coordinates for each growth condition can also be compared against single reactions selected from the top flux contributors to variance (identified for all three objective pairs during Principal Component Analysis (PCA)). This helps to quantify the strength of association between these reactions and the principal components they are best summarized by.
22. Within MATLAB, import the table of contributions for the dataset (all_atp_flux is provided as an example): Note: While gene transcripts can be classified by their Cluster of Orthologous Genes (COG) category, reactions must be classified according to the pathways they are assigned within fbamodel.subSystems. Since each reaction can be classified by multiple subsystems, separate cell arrays can be allocated to store subsystems from each column of fbamodel.subSystems. The number of arrays needed depends on the maximum number of subsystems that a single reaction is categorized by within the model. In this case, each reaction is assigned to a maximum of five subsystems, therefore a total of five cell arrays are required to store the subsystem names, which are later concatenated into a single array and used to replace the original fbamodel.subSystems in the model.  Note: Finally, we can also analyze principal component coordinates for each growth condition against single reaction fluxes. An example is demonstrated below using Biomass -ATP maintenance flux data in R (with the expected results plotted in Figure 13). 32. We begin by loading the requisite variables: 33. Select only the columns required 34. Use the data to fit linear models and create scatter plots for both principal components:

K-means clustering
Timing: < 10 min The purpose of clustering techniques is to partition samples into groups based on hidden patterns in data. They are particularly suitable for detecting underlying associations based on shared characteristics where there is little information available. Most clustering methods are categorized within the hierarchical and k-means families. On one hand, hierarchical clustering is an iterative process that progressively combines pairs of observations that are the closest in proximity until all clusters are merged within a hierarchy. On the other hand, k-means finds the number of clusters that minimizes the sum of squared Euclidean distances between each observation and its respective cluster mean (McLachlan et al., 2008). K random points in the dataset (known as cluster centroids) define the groups that the remaining data points are assigned to, which are continually relocated to the averages computed within each group until distinctive clusters are formed. When applied to transcriptomic and fluxomic data in our study, k-means clustering is used as a method to assess whether multi-omic datasets identify clusters of growth conditions according to their respective omic responses and which trends can be observed between growth-promoting and growth-limiting conditions. In this instance, they indicate that the single-omic datasets may benefit from being analyzed in isolation, bypassing an increase in data dimensionality that cannot be easily reduced. k-means clustering is run using the script statistic-s_on_genes.m, which also calls mdscale_robust.m, a script that applies multidimensional scaling to avoid co-location of data points during clustering. Additionally, the generation of silhouette plots (Figures 14 and 15) is used to decide the number of clusters for the final scatter plot (Figure 16). between profiles along all the genes, instead of the correlation between genes along the profiles:

We begin by loading the required variables into MATLAB:
38. The zscore function is used to standardize each of the profiles to have zero mean and unit variance, after which the pdist function is used to compute pairwise distances between pairs of observations in the dataset: Note: K-means clustering requires the user to decide the number of clusters (K) that the data is partitioned into. Prior to clustering, different values of K can be tested using silhouette analysis in order to select the most suitable number of clusters for partitioning data. 39. In order to establish the optimal number of clusters, a silhouette analysis can be conducted to measure the cohesion of data points within each cluster (given by a silhouette value for each variable). The initial pre-plot in Figure 14 displays silhouette values (y) against the number of clusters selected (x), which indicates the best value to select for K (i.e., the number of clusters with the highest silhouette value): Note: Upon selecting a value for K, the silhouette function in MATLAB produces a plot ( Figure 15) that displays values for each individual cluster within the range of [À1,1]. This gives a measure of proximity for each point in one cluster to points in the neighboring clusters.
40. Upon examination of the silhouette plot, the user is prompted to manually select the number of clusters for the k-means plot: Note: The closer the silhouette coefficients are to the value of 1, the further that point is from other clusters and the better the separation of clusters. If the point has a coefficient close to 0, this means that it is very close to the decision boundary between two neighboring clusters. After the silhouette coefficients have been calculated for data points in each cluster, a mean silhouette score can be computed to evaluate the feasibility of the entire cluster.
41. Nonmetric multi-dimensional scaling can be applied to circumvent errors caused by the co-location of data points by multiplying dissimilarities by a scalar: Note: mdscale_robust is a variation of the mdscale function where scaling is used to minimize the squared stress criterion with 500 iterations of the algorithm.
42. The kmeans function is used to perform clustering using the following command: Note: In this instance, the 'dist' metric for clustering is the city block (also called ''Manhattan'') distance. The formula for computing this distance can be specified in general as: where p = 1 in the case of the Manhattan distance, but the user is encouraged to choose the metric most suitable for their dataset.
43. Finally, a scatter plot can be created to display the k-means clusters:

LASSO regression
Timing: < 10 min The main purpose of the analysis is to identify the core subset of predictors (either genes and/or reactions) with positive or negative nonzero coefficients greater than 0.01 that are strongly related to in vivo growth rates by penalizing the recursive predictors (i.e., setting their coefficients to zero). The script lasso.m performs LASSO regression with a= 1, which returns fitted least-squares coefficients for linear models of transcript, flux or multi-omic data (x) and the growth rates (y) in 12 growth conditions. Following this, the mean predictor coefficient (MPC) can be calculated by averaging across nonzero coefficients in all vectors for each gene/reaction. In this example, only 12 out of 23 growth conditions had (i) specified growth rates, (ii) specified doubling times, or (iii) standard growth curves that could be used to calculate growth rates from the original studies Bryant, 2011, 2012a,b), so only the subset of the original datasets corresponding to these growth rates has been selected for analysis. We here describe LASSO regression carried out in MATLAB for the subset of gene transcripts corresponding to these 12 growth conditions, but for the sake of clarity, the generation of multi-omic and fluxomic subsets is also demonstrated.  (2)); % all ones for standard control transcripts_subset(2:12,:) = transcripts ([10:12,14:16,18,19,21:23],:); % Load flux data load('all_atp_flux.mat'); % Create flux data corresponding to 12 growth conditions all_atp_flux_subset = all_atp_flux ([24,10:12,14:16,18,19,21:23],:); % Load multi-omic data (concatenated transcript and flux data) load('ATPTF.mat'); % Create multi-omic data corresponding to 12 growth conditions ATPTF_subset = ATPTF ([24,10:12,14:16,18,19,21:23 to calculating their correlations in order to equally represent the activity of reversible reactions. Using the same data as in LASSO regression, the script corrcoef_tf_gr.m calculates the Pearson correlation coefficients between subsets of transcript/flux data (x) and growth rates (y) across 12 conditions. The example below demonstrates how a table of correlation coefficients calculated between the transcript data and growth rates is generated in MATLAB (corr_transcript_table), but the corresponding tables can also be created for flux data, i.e., corr_ATP_table, corr_P1_table, corr_P2_ table. Example plots of the positive/negative correlation between the transcript data and growth rates are provided in Figure 17.
47. In MATLAB, create output vectors to store correlation coefficients, p-values, and lower and upper bounds of confidence intervals, changing the number of rows for transcripts (3187)

Pathway-level correlation analysis
Timing: < 15 min Similar to the pathway-level PCA, a more detailed functional classification of correlation coefficients can be yielded by performing a pathway-level correlation analysis where mean absolute PCC values are classified according to the subsystems assigned to each reaction in the GSMM (see Figure 18 for a bar plot of pathway correlations). This provides an opportunity to study these components in a more detailed manner through expanding the scope of biological insights detected and establishing connections between reactions within the same pathway. In order to account for the differing number of reactions in each pathway, the number of reactions within a binned range of PCC values can also be recorded for each subsystem listed in the model (see Figure 19 for a heatmap of pathway correlations). In this way, correlations between flux rates in each pathway and their growth rates can be assessed more fairly. In this section, we demonstrate a pathway-level analysis in MATLAB using a table of correlation coefficients calculated between Biomass -ATP maintenance flux values and growth rates (where corr_ATP_table has been generated using the same steps as in correlation analysis). CRITICAL: Correlation coefficients are converted into absolute values prior to calculating the mean PCC for all pathways since only the magnitude of correlation (and not the direction) is considered when plotting the bar chart in Figure 18. However, the heatmap in Figure 19 indicates the signs of individual correlation coefficients as well as the number of reactions within each pathway.
54. Calculate mean PCC values for each subsystem using the same number of reactions recorded within each subsystem (cardinality_subsystems) and reaction indices obtained for each subsystem (ixs_subsystems) as in Pathway-level PCA (optional): 55. Plot a bar chart using the mean values: 58. Plot the number of reactions in each bin and subsystem using a heatmap: Note: Similar heatmaps can be plotted for the Biomass -Photosystem I and Biomass -Photosystem II correlation coefficients to evaluate the correlation between metabolic flux and growth rates across various pathways.

EXPECTED OUTCOMES
The main outcome of this analysis is to establish a procedure for linking specific genes and/or reactions across trans-omic layers of data belonging to the same biological system. Here we present an example of the pipeline applied to Synechococcus sp. PCC 7002, following the workflow laid out in Figure 3.
The process begins with tailoring the GSMM according to available transcriptomic data recorded under different conditions that influence growth and photosynthesis. After performing conditionspecific FBA with norm-2 regularized bi-level optimization, comparisons can be made between the results of analyses performed upon gene transcription data, metabolic flux data and the multi-omic data resulting from their concatenation. These analyses include PCA, k-means clustering, LASSO regression and Pearson correlation analysis. Features identified through these analyses reflect the coordinated responses shared between different data types, as well as the variability in responses between different growth conditions. Since the flux data is informed by transcriptomic data through the integration of condition-specific growth profiles within the GSMM, the downstream effect of differential gene expression on metabolic pathways can be observed. Analyzing both transcriptomic and fluxomic data provides a more complete picture of cyanobacterial metabolism than single-omic analyses.
The protocol can be applied for numerous purposes such as model-aided discovery, hypothesis testing, identification of targets for metabolic engineering and comparison between multi-omic data across biological conditions. These processes can be optimized by examining the downstream effects of gene expression on metabolism, thereby contributing to expanding knowledge and meaningful outputs from metabolic models as well as lending biological interpretability to machine learning models. Our code and step-by-step methodology are intended to make these analyses more accessible to non-experts or serve as a guide to other investigators for combining in silico flux simulation with machine learning.

LIMITATIONS
During flux balance analysis, a series of boundary constraints were defined to fine-tune the calculation of flux rates and more closely represent the metabolic capability of cells. The bounds for nutrient uptake were set based on metabolite concentrations in the growth medium, e.g. for the nitrate condition where the medium was supplemented with 12 mM of sodium nitrate, an uptake rate (lower bound) of -12 was assigned to the nitrate exchange reaction (EX_NO3_E). Currently, there exists no standard operating procedure for the definition of nutritional environments for GSMMs, as they are assessed case-by-case by researchers conducting the study (see troubleshooting problem five). A recent framework proposed a comprehensive set of guidelines in this regard, paying careful attention to the chemical composition of the growth medium as well the physiology of organism(s) concerned and various inorganic environmental factors (Marinos et al., 2020).
As photoautotrophs, cyanobacteria absorb light in excess of biomass and other maintenance requirements, which can be difficult to replicate within a GSMM. Critically, the exact photon absorbances of the Synechococcus sp. PCC 7002 cultures were not measured in the same conditions in which the cells were harvested for transcriptomic sequencing. Therefore, constraints for photon exchange reaction (EX_PHOTON_E) had to be approximated using values listed in literature for dry cell weight and photon absorbance for similar species and adjusted based on the availability of light for each growth condition. This process could be improved by specifying directly measured photophysiological parameters (such as light acclimation, cell density, pigment concentration, photon absorbance, oxygen evolution rate and optical density), and using these values to constrain photon uptake more accurately for each culture (Broddrick et al., 2019;Toyoshima et al., 2020).
Hence, we recommend the use of in vivo experimental data for various growth conditions where feasible to constrain the model and yield more precise flux rates. The prediction of internal fluxes can also be improved by using more specialized FBA techniques that consider constraints on resource allocation between biological processes, such as conditional FBA (Rü gen et al., 2015), Resource Balance Analysis (RBA) (Goelzer and Fromion, 2011) or Constrained Allocation Flux Balance Analysis (CAFBA) (Mori et al., 2016).
A number of linear methods and transformations were adopted in this study to maximize the interpretability of machine learning predictions, using quadratic terms for regularization only. However, a range of techniques for dimensionality reduction or clustering methods could be implemented here, e.g. to elucidate non-linear relationships among different omics.

TROUBLESHOOTING
Problem 1 Raw multi-omic data originating from various sources (transcriptomic, proteomic, metabolomic) differ significantly in terms of their format and structure. Data transformation, normalization or scaling techniques must be applied as forms of pre-processing prior to integration in order to make these data comparable. Particularly, the batch effect must be taken into account both before and after conducting experiments since this gives rise to unwanted variation in datasets caused by differences in technical factors across batches (Step 10 of before you begin).

Potential solution
Methods such as ComBat allow users to adjust for batch effects among samples by utilizing parametric or non-parametric empirical Bayes frameworks (Johnson et al., 2007;Zhang et al., 2020b). Other techniques such as SVASeq or RUVSeq also help to eliminate noise from sequencing experiments and adjust for technical interference (Leek, 2014;Risso et al., 2014). These would be followed by the pre-processing steps. If available, integrating proteomic or metabolomic data into a GSMM can provide a more accurate representation of the cellular phenotype since they include effects downstream of genes and gene transcripts.

Problem 2
There are numerous methods available for integrating multi-omic data within GSMMs, and it can be challenging to choose a single method for data integration (Step 7 of step-by-step method details).

Potential solution
There are many types of approaches to consider for multi-omic data integration, several of which are discussed elsewhere in greater detail (Machado and Herrgå rd, 2014;Cho et al., 2019).
In summary, the generation of context-specific metabolic models is divided into two main classes: (i) switch-based approaches (such as GIMME), which remove inactive or lowly expressed genes by setting the corresponding reaction boundaries to zero, and (ii) valve-based approaches (such as E-flux), which increase or decrease the activity of highly (or lowly, respectively) expressed genes by adjusting the upper and lower bounds for their corresponding reactions, proportional to their normalized gene expression values (Vijayakumar et al., 2018).
The main advantage of GIMME-like methods is that they can re-enable flux associated with false negative values in inactive reactions and record consistencies between gene expression data and flux predictions. On the other hand, non-discretized relative gene expression values are more indicative of protein concentrations since levels of transcription are more comparable across genes. The approach used in this case study is closer to a valve-based approach based on METRADE (Angione and Lió , 2015), where the expression level of each gene set (represented by the vector W) is mapped to a coefficient for the lower and upper bounds of the corresponding reaction in the GSMM. When using our method, it was important to conduct a sensitivity analysis to select the optimal value for the ll OPEN ACCESS g parameter, which magnified the level of gene upregulation or downregulation and therefore the metabolic sensitivity for yielding experimentally feasible flux values for different growth conditions.
In addition to switch-and valve-based integration methods, there are alternative methods that consider the cellular goal specific to each GSMM or remove unnecessary/blocked reactions from the network. Metabolic task derived (MTD) algorithms consider the main objective function(s) that represent the metabolic tasks as the main priority for the cell or community or utilize omics-guided objective functions, as in omFBA (Guo and Feng, 2016). Network-pruning methods (such as MBA) retain only a core set of reactions in the network by iteratively pruning reactions from the model to derive a sub-network that is consistent with the tissue-specific gene expression, among other data. However, these methods are only used to extract a context-specific model and do not provide a corresponding flux distribution. Therefore, the method chosen for data integration depends on the nature of the data, the approach taken for constraining flux bounds and the optimization problem to be solved. During model extraction, the type of thresholding applied (within samples or genes) and the threshold values for gene expression used can also affect the output models (Walakira et al., 2021). Very few methods automate model extraction and flux prediction without a priori knowledge of context-specific functions or binarization of reactions during data integration. However, RegrEx is one such algorithm that uses regularized least-squares optimization for automated model extraction and unbiased flux calculation (Robaina Esté vez and Nikoloski, 2015).

Problem 3
The cut-off value for setting fluxes equal to zero (10 À4 ) may not be applicable for every model, seeing as fluxes toward biomass building blocks and other important metabolic components are at risk of being eliminated (Step 13 of step-by-step method details).

Potential solution
We advise users of the protocol to conduct a robustness analysis to assess different thresholds for flux and fold change values. Starting from the solver tolerance parameter (10 À6 in our case), we recommend increasing the order of magnitude for setting flux rates to zero until a trade-off can be reached between eliminating noise within the data whilst still retaining the ability to identify and quantify functionally significant contributions of metabolic processes. Values that are below the chosen threshold can then be set to zero based on this adjustment without any statistically significant changes in results.

Problem 4
The correlation analysis may give rise to regression artifacts that do not reflect a true linear correlation between gene transcript/flux data and growth rates, leading to incorrect causal inferences (Step 52 of step-by-step method details).

Potential solution
We advise users to manually inspect each correlation plot to assess the validity of correlation between variables. Alternatively, there are preprocessing techniques that can be applied to data such as global scaling normalization or dropout imputation. In some instances where artifacts have been introduced as a result of data oversmoothing or overfitting, reintroducing random noise into datasets has been shown to increase robustness .

Problem 5
There is no standard operating procedure for determining uptake rates (Step 4 of step-by-step method details).

Potential solution
In the absence of in vivo uptake rates obtained from time-course metabolomic experiments, we advise users to approximate uptake rates, starting from the concentration of the organic carbon source in the ll OPEN ACCESS STAR Protocols 2, 100837, December 17, 2021 growth medium (e.g., glucose or glycerol) and convert these values into flux units mmol/gDW h -1 (Schinn et al., 2021). Methods such as Metabotools already use extracellular concentrations to calculate and adjust constraints by defining growth media in terms of concentrations of metabolites measured in mM (Aurich et al., 2016).
Although inorganic substrates are not usually constrained, the inorganic carbon uptake rate is accepted in the absence of a carbon substrate for photoautotrophic organisms such as cyanobacteria (Qian et al., 2017). Furthermore, as the availability of nutrients has a major impact on the calculation of metabolic fluxes, we incorporate the extracellular concentrations of metabolites and co-factors present within various growth media for different conditions to constrain the lower and upper bounds of the associated exchange reactions in the model. This application of condition-specific constraints on the exchange reactions ensures that exchange rates emulate uptake and secretion of metabolites in accordance with the experimental data and the computational model therefore more closely resembles the experimental conditions in which the cells are cultured.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Claudio Angione (c.angione@tees.ac.uk).

Materials availability
The study did not generate new unique reagents or other materials.

Data and code availability
This protocol fully specifies all datasets generated or analyzed during the study. The complete source code relating to all procedures listed within the protocol is freely available on GitHub at: https://github.com/Angione-Lab/Synechococcus7002-metabolic-modelling.