Integrated in silico Analyses of Regulatory and Metabolic Networks of Synechococcus sp. PCC 7002 Reveal Relationships between Gene Centrality and Essentiality

Cyanobacteria dynamically relay environmental inputs to intracellular adaptations through a coordinated adjustment of photosynthetic efficiency and carbon processing rates. The output of such adaptations is reflected through changes in transcriptional patterns and metabolic flux distributions that ultimately define growth strategy. To address interrelationships between metabolism and regulation, we performed integrative analyses of metabolic and gene co-expression networks in a model cyanobacterium, Synechococcus sp. PCC 7002. Centrality analyses using the gene co-expression network identified a set of key genes, which were defined here as “topologically important.” Parallel in silico gene knock-out simulations, using the genome-scale metabolic network, classified what we termed as “functionally important” genes, deletion of which affected growth or metabolism. A strong positive correlation was observed between topologically and functionally important genes. Functionally important genes exhibited variable levels of topological centrality; however, the majority of topologically central genes were found to be functionally essential for growth. Subsequent functional enrichment analysis revealed that both functionally and topologically important genes in Synechococcus sp. PCC 7002 are predominantly associated with translation and energy metabolism, two cellular processes critical for growth. This research demonstrates how synergistic network-level analyses can be used for reconciliation of metabolic and gene expression data to uncover fundamental biological principles.

metabolic and regulatory networks, there are still many fundamental questions that remain unanswered, specifically: (i) what is the relationship between topological and functional importance of genes in GCN and GMN; (ii) are topologically central genes in GCN also functionally essential in GMN; and (iii) if so, what are the characteristics of those genes? Previous studies in this area have addressed similar questions, however, using GMNs alone [16][17][18] or in combination with GMN-derived gene connection structures [19]. The present work is, to our knowledge, the first demonstration of integrative functional and topological analyses of multiple biological networks built independently from disparate data sources.
Availability of well-established model systems, in conjunction with high-throughput "omics" data, provides the scale and resolution to perform integrative computational analyses of regulation and metabolism. In this study, we systematically compared transcript association patterns to metabolic capacity and interrogated the topological and functional importance of genes in a unicellular cyanobacterium Synechococcus sp. PCC 7002 (hereafter Synechococcus 7002). Synechococcus 7002 is a well-characterized microorganism capable of robust growth under a wide range of environmental conditions [20][21][22], in which adaptation to specific conditions is associated with distinct transcriptional patterns indicating tight regulation of metabolic and regulatory modules [23,24]. However, the degree by which transcriptional topology reflects essentiality of genes participating in central metabolic pathways remains unclear due to lack of approaches capable of relating large networked data sets in a quantitative fashion. In this study we present a novel computational framework to concurrently assess the centrality and functional importance of genes through integrative analysis of the GCN and GMN. This approach revealed that topologically central genes with high measures of networked centrality, measured over a broad range of distinct growth conditions, predominantly include those that are also functionally essential for cyanobacterial growth.

Experimental Conditions and Measurements
Expression data for Synechococcus 7002, representing a total of 42 discrete growth conditions, were either generated from continuous cultures or sourced from previously reported studies, which examined growth of the organism under nutrient limitation, varying irradiance levels, extremes of cell density, temperature and salinity, as well as co-cultivation with a heterotrophic partner [23][24][25][26]. The continuous cultures of Synechococcus 7002, operated in chemostat or turbidostat modes, were grown on A+ media in photobioreactor at 30 °C with a dilution rate of 0.1 hr −1 as described previously [27]. In carbon-, nitrogen-, or light-limited chemostats, steady-state growth was limited by 7.7 mM NaHCO3, 0.9 mM NH4Cl, or 140 µE m −2 s −1 , respectively. In turbidostat mode, Synechococcus 7002 was grown under six irradiance levels ranging from 33-758 µmol photons m −2 s −1 with 2% CO2 supplementation to gas (N2) sparged at 4 L min −1 . Two other conditions consisted of a high light and high O2 (up to 60% v/v of sparge gas) adapted strain of Synechococcus 7002 grown under either 7.1% or 16.5% dissolved O2. Total RNA was extracted and sequenced as described previously [23,25] using a phenol chloroform approach and sequencing was performed using SOLiD 5500XL protocol with a read length of 50 bp [25] or with the SOLiD™ 3 or 3Plus protocol [23]. Gene expression levels were analyzed as reads per kilobase per million reads (RPKM) and all transcriptome data sets were compiled using the previously described Rockhopper software [28].

Metabolic Network Analysis of Wild-Type and Knock-Out Strains
Previously developed GMN of Synechococcus 7002 [29] was updated by: (i) changing incorrectly assigned gene-reaction association of r_AGPAT_SYN (1-acylglycerol 3-phosphate O-acyltransferase) from "SYNPCC7002_A0198" to "SYNPCC7002_A0918"; and (ii) separating glycogen from the biomass synthesis equation into the external metabolite pool. As the total number of genes contained in the resulting GMN was 706, we renamed the network model to iSyp706.
In silico flux distributions were calculated for 42 growth conditions using the Expression-Guided Flux Minimization (E-Fmin) algorithm that incorporates RNA sequence data into a genome-scale metabolic network based on the flux minimization principle [30]. An advantage of E-Fmin is that it does not require the specification of a metabolic objective (to maximize or minimize), which is often difficult to reconcile with growth phase and condition. Instead, E-Fmin minimizes the sum of weighted flux magnitudes where the weight is formulated as a decreasing linear function of gene expression levels, ensuring that fluxes associated with genes expressed at low levels are significantly suppressed. In silico fluxes were compared through gene knock-out simulations using the Minimization of Metabolic Adjustment (MOMA) algorithm, which minimizes the Euclidean distance between two flux vectors of parent and mutant strains [31]. While the original formulation of MOMA is a constrained nonlinear optimization problem, we recasted it as a linear optimization problem for computational convenience. A gene was defined as functionally important (FI) if its inactivation produced changes in flux distributions as estimated by both E-Fmin and MOMA ( Figure 1).

Reconstruction of Gene Co-Expression Network (GCN)
Initial reconstruction of Synechococcus 7002 co-expression network were carried out at a genome-scale using all 3236 transcript-associated genes. To allow for cross-network comparisons, a reduced GCN containing only 706 reaction-associated genes included in the genome-scale metabolic network was generated. All network reconstructions were developed using Context Likelihood of Relatedness (CLR) program as previously described [14] and were based on the RPKM values from the 42 experimental conditions. A bootstrapping approach (500 iterations) was used to increase the robustness of the data. During each CLR iteration, two experimental conditions were randomly removed, and only those gene pairs (nodes) with a Z-score of 4.5 (4.5 standard deviations greater than the mean of all mutual information scores of that gene) in at least 375 of the runs were assigned edges between them. This served to remove weak edges and increased the quality of the resulting GCN. Groups of topologically and functionally important genes were subjected to functional enrichment analysis, which calculated the percentage of genes within a given functional category and identified those that are significantly higher than the percentage of genes of the same category in the entire genome with a p-value of >0.05 according to Fisher's exact test.

Quantification of Gene Centrality
To evaluate the relative importance of genes in the GCN, we used four different types of centrality measures, including (i) degree; (ii) eigenvector; (iii) betweenness; and (iv) closeness ( Figure 2). The degree ( i d ), also known as connectivity, is the most elementary concept of centrality and denotes the number of links through which a node is directly connected to its neighbors. Degree centrality of node i is then expressed as: where ij A is the ( , i j ) th element of the adjacency matrix A. The degree of centrality in Figure 2 ranges from 0 to 3.

Figure 2.
Graphical illustration of centrality concepts using a mock network. The nodes and edges, which represent genes and their co-expression relationships, respectively, are shown as an adjacency matrix (A). The graph is disconnected and the individual subgraphs are referred to as components. As centralities are zeros for isolated genes 7 and 8 (shaded in the table on the right), they can be optionally excluded from the adjacency matrix.
A more refined connectivity term is eigenvector centrality ( i e ), which accounts not only for the number of immediate neighbors, but also for their quality (i.e., centrality). Thus the eigenvector centrality of node i is the summation of eigenvector centrality of its neighbors ( j e 's): where  is a constant. Equation (2) can be equivalently represented as a typical form of eigenvalue-eigenvector problem, i.e.,   e Ae . In the example graph, while nodes 4 and 5 are connected to the same number of neighbors, node 4 is more central than node 5 due to the difference in the quality of neighbors.
The term betweenness ( i b ) defines centrality in terms of a number of shortest (or geodesic) paths passing through a given node in a network. Based on the assumption that information in a network flows through geodesic paths, betweenness quantitates the amount of information flowing through a given node (ni) and can be expressed as: where N is the set of nodes in a network and   , j k  and   , \ j k i  denote the number of geodesic paths between nodes j and k passing through node i . As shown in Figure 2, only node 2 has a nonzero value of betweenness. Note that node 1 is not between nodes 2 and 3 because 2-1-3 is not the shortest path between them.
Finally, the term closeness (ci) indicates how close a given node is to the others; for instance, in a graph of n nodes, closeness centrality is measured by summing up the geodesic distances from a node to all other ( 1) n  nodes: where ij g is the geodesic distance between nodes i and j . Note that closeness centrality defined in Equation (4) becomes zero for all nodes in a disconnected graph. To avoid this, we used the modified form of closeness [32] as given below: As shown in Figure 2, nodes contained in the largest component (i.e., component 1) have higher closeness values than those in the component 2. We calculated four centrality values of each gene in the GCN using the Matlab codes developed by Strategic Engineering Research Group (SERG) at MIT [33,34].

Reconstruction and Topological Analysis of GCN
Biological networks often exhibit scale-free properties with inherent robustness to the random disruption of genes, but are vulnerable to targeted deletion of the most central gene nodes [35]. As a fundamental characteristic, the degree distribution of scale-free networks follows a power-law. This principle is shown in Supplementary Figure S1, which compares the 3236-gene (GCN1) and the 706 gene (GCN2) co-expression networks of Synechococcus 7002. Degree distributions of both networks appear as straight lines on log-log plots (Supplementary Figure S1a,b), implying that the developed coexpression networks of Synechococcus 7002 are robust and scale-free. While not apparent from the degree distributions, both GCNs are fragmented and composed of many disconnected subgraphs (or components). As a common feature of these two GCNs, the size of the largest component is significantly large relative to the other components. That is, the number of nodes of the first and second largest components of GCN1 (GCN2) was 985 (52) and 34 (10), respectively with all of the topologically central genes identified in our analysis clustering into the first largest components. As seen in the size distributions of GCN1 and GCN2 components (Supplementary Figure S1c,d), GCNs have a large portion of isolated genes, the number of which was 1842 in GCN1 and 495 in GCN2, respectively. As mentioned above (Material and Methods, Section 2.4), we removed these isolated genes for convenience of topological analysis and used (truncated) GCN2 throughout this work.

Centrality Analyses
The evaluation of gene centrality within GCN provided distinct distribution patterns across different centrality measures (Figure 3). Then, through functional enrichment analysis, we identified three major functional roles of genes, that are associated with different centrality measures. In all cases, translation and energy metabolism were the most dominant roles, though their relative dominance varied depending on the specific centrality measure. The largest fraction of energy metabolism genes that showed significant enrichment was those involved with either photosynthesis, electron transport, or oxidative phosphorylation, while smaller number encoded putative components of the carbon fixation machinery. In the case of translation, the vast majority of genes were part of either the large or small ribosomal subunit. Genes with high degree centrality were almost equally associated with energy metabolism and translation; eigenvector and betweenness centralities were predominantly associated with translation; closeness centralities were also closely associated with translation and energy metabolism at different levels. We defined topologically important (TI) genes as those within the top 50% measure of each centrality.

Identification of Functionally Important (FI) Genes
While gene expression profiles varied with each growth condition, calculated flux distributions were relatively similar. Out of the 42 conditions tested, principle component analysis identified only 5 distinct flux distributions, indicating that the intracellular and exchange fluxes of Synechococcus 7002 are quite invariable (Figure 4, Supplementary Tables S1 and S2). These 5 sets of conditions are interpreted to have a sufficient level of "environmental variation" to profile the range adaptive responses which modulate Synechococcus 7002 metabolism. The limited variability of fluxes across the metabolic pathways of Synechococcus 7002 can be indicative of cyanobacterial growth robustness, which is somewhat intuitive, due to the relatively limited flexibility for energy and carbon acquisition pathways as compared to chemo-and photo-heterotrophs.   The functional importance of genes was evaluated on the impact of metabolic flux distribution, which was predicted in silico by performing comprehensive single-gene knock-out simulations. Within the 706-gene network, we identified three different groups of genes with distinct functional importance. Deletion of any gene in Group 1 (362 genes) had no effect on relative flux distribution. This group was found to be enriched for genes involved in processes that are generally known to be non-essential for growth and/or facilitated by functionally redundant genes in Synechococcus 7002: (i) RNA polymerases (p < 10 −2 ); (ii) folate synthesis (p < 10 −5 ) and genes within the pentose phosphate pathway (p < 10 −3 ). Deletion of any gene in Group 2 (55 genes) altered the relative flux distributions. Group 2 was functionally enriched for genes involved with processes that are known to have genetic redundancy and/or plasticity yet critical, condition specific roles for cyanobacterial growth: (i) glyoxylate/dicarboxylate metabolism (p < 10 −6 ); (ii) pyruvate metabolism; (iii) oxidative phosphorylation (p < 10 −13 ) and (iv) electron transport (p < 10 −3 ). Any deletion of a gene in Group 3 (289 genes) resulted in no growth under at least one specific condition; 279 of the 289 Group 3 genes were lethal deletions under all conditions. Group 3 was functionally enriched for genes involved in critical growth processes such as: (i) photosynthesis (p < 10 −14 ); (ii) carbon fixation (p < 0.05); (iii) tRNA synthesis (p < 10 −10 ) and (iv) ribosome synthesis (p < 10 −23 ). Computationally estimated flux distributions across conditions were provided in Supplementary Dataset S1. Based on this classification, we ranked functional importance of genes as Group 3 > Group 2 > Group 1 and referred to Group 2 and Group 3 as FI. It should be noted, that the results above could be affected to a degree depending on the accuracy of both E-Fmin and MOMA, methodologies employed for the in silico metabolic network analyses. While these methods have proven reliable in several case studies conducted in the original papers, direct validation using 13 C-MFA data of Synechococcus 7002 has not yet been carried out and would be an important future direction for these studies.

Integration of Topological and Functional Analyses
Comparative analysis of topologically and functionally distinct categories (i.e., TI and FI genes) revealed positive correlation, albeit qualitative, between gene centralities and functional importance ( Figure 5). Degree of centrality differentiated Groups 2 and 3 from Group 1 while, eigenvector and betweenness centralities separated Group 3 from Group 1 and Groups 2, showing their usefulness in identifying FI genes. Among four centrality measures considered in this work, however, closeness had the highest correlation with the functional importance of genes, clearly discerning all three groups. Such a proportional relationship was also observed when examining a combined centrality score. To assess the robustness of this analysis, we performed Monte Carlo simulations by randomly generating 5000 sets of genes of the same size as each group and comparing the centrality measures. From the difference in centrality measures between specific (Group 1 to Group 3) and randomly chosen groups, we could infer the same trends displayed in Figure 5 (Supplementary Figure S2). The profile patterns, corresponding to each measure of centrality, were considered with their top three biological roles (summarized in Figure 3) to identify the association of Group 3 and Group 2 with translation and energy metabolism, respectively. Association of Group 1 genes with specific biological functions was relatively weak in comparison to other groups. Details on gene functional enrichment of Group 1, Group 2, and Group 3 were provided in Supplementary Datasets S2, S3, and S4, respectively. The integration of topological and functional analyses provides a basis for the identification of the following three groups of genes: (i) TIs but not FI (here, denoted by Tf) genes; (ii) FI but not TI (Ft) genes; and (iii) intersection of TI and FI (TF) genes. As addressed earlier, we classified genes within the top 50% of the combined centrality scores as TI genes, while isolated genes (centrality values of 0) were classified as topologically unimportant ( Figure 6). Most of TI genes belonged to the Group 3 (i.e., essential genes) (thus, identified as TF), and as an exception, one gene belonged to the Group 1 (thus Tf). Interestingly, an appreciable portion of functionally essential genes (Group 1) were isolated nodes in the GCN as represented with zero centrality in Figure 6a. These topologically unimportant genes were also found in all other groups. The isolated genes among Group 2 and Group 3 thus belong to Ft. Figure 7 summaries the classification of 706 metabolic genes into different categories.  A relatively large number of Synechococcus 7002 genes were grouped within the Ft category (217), enabling a robust functional enrichment analysis. Several categories of genes were enriched when examining this list including those corresponding to processes that are required for production of biomass such as translation, and energy-and carbohydrate-metabolism (Supplementary Dataset S5). In addition, most of the Ft genes (~96%) were included in functionally enriched categories, confirming that the GMN model used in this study is well suited to represent the metabolism of Synechococcus 7002 across the 42 growth conditions considered. In fact, only a few genes on this list were involved in metabolic pathways that were not central to biomass production. All of the Ft genes were determined to be isolated nodes in the GCN implying that: (i) they are not critical for co-expression of other genes (because biomass synthesis belongs to the final process of metabolism); and (ii) the GCN is robust against these genes because the effect of their deletions does not propagate through the network). In contrast, the same classification criteria generated only one gene (atpA; syn7002_A0734) that belonged to the Tf group. The in silico GMN simulations suggested that the deletion of this gene had no effect on metabolite flux distributions, since Synechococcus 7002 has a homolog (atpA; syn7002_G0151) and that potentially function as ATPase in the absence of the syn7002_A0734 genes. Interestingly, genes categorized as TF were almost exclusively functional enriched for translation pathways (i.e., 13 out of the 14 TF genes in total). These genes are topologically central, indicating that they play a key role in co-expression patterns of other genes.

Concluding Remarks
Through integrated computational analyses of co-expression and metabolic networks in Synechococcus 7002, we interrogated biologically interesting questions related to the role of genes in cyanobacteria. The most fundamental result of this research was the interrelationship between the topological and functional importance of genes and that genes with high measures of centrality were predominantly found to be essential for growth and metabolism. Integrated network analysis revealed their overall correlation to be positive. The results from our analysis also suggest that genes with high centrality values in the co-expression network may indicate their important roles in metabolism. This work provides a bench mark foundation for integrated computational analyses of regulatory and metabolic processes and sets the stage for directed experiments to confirm if this approach will be able to provide predictive understandings of cyanobacterial growth, and other model organisms, across an even larger dynamic range of conditions.