Randomizing Genome-Scale Metabolic Networks

Networks coming from protein-protein interactions, transcriptional regulation, signaling, or metabolism may appear to have “unusual” properties. To quantify this, it is appropriate to randomize the network and test the hypothesis that the network is not statistically different from expected in a motivated ensemble. However, when dealing with metabolic networks, the randomization of the network using edge exchange generates fictitious reactions that are biochemically meaningless. Here we provide several natural ensembles of randomized metabolic networks. A first constraint is to use valid biochemical reactions. Further constraints correspond to imposing appropriate functional constraints. We explain how to perform these randomizations with the help of Markov Chain Monte Carlo (MCMC) and show that they allow one to approach the properties of biological metabolic networks. The implication of the present work is that the observed global structural properties of real metabolic networks are likely to be the consequence of simple biochemical and functional constraints.


Introduction
Social networks exhibit ''small world'' characteristics [1,2]; food webs have hierarchies coming from trophic levels [3]; gene networks have small in-degree, broad out-degree, and contain strongly over-represented motifs [4]. These kinds of ''remarkable'' features distinguish natural or even human made networks from random graphs [5,6,7]. However, comparing the networks arising in these different systems to random graphs is unsatisfactory because it ignores all potentially relevant underlying factors that constrain these networks. One should also ask whether these networks are remarkable given where they come from, taking into account the known factors which shape them. A way to address this issue is to perform graph randomization. The most commonly used such approach for biological networks is based on performing edge exchanges [4,8,9]. This algorithm (illustrated in Figure S1) by construction preserves the network's degree distribution exactly.
Our focus in the present work is on metabolic networks. Previous studies have revealed that metabolic networks of living organisms are highly structured. For example, the degree distribution of the metabolites in these networks has a power law tail [10,11]. Metabolic networks seem to have further remarkable features such as a high level of clustering [12]. However, to claim that such features are remarkable, one has to use a benchmark. The use of random graphs has the drawback of ignoring the special nature of the degree distribution. If instead the comparison is made using the edge exchange algorithm, one is confronted with a serious conceptual problem that is specific to metabolism: the randomized ensemble contains meaningless reactions (cf. Figure S1). That is because the edge exchange procedure ignores all biochemistry, and in particular the fact that most biochemical reactions correspond to adding or removing small groups. Furthermore, this naive randomization corresponds to using ''random'' reactions which will not balance mass, charge and even less atomic elements; clearly this is enough to cast a doubt on the relevance of such a procedure. To overcome this problem, we have little choice but to force the reactions to have a minimum of realism; that can be done by using reactions known to arise in various organisms or in vitro. This corresponds to the first level of ''constraints'' that should be imposed when randomizing metabolic networks. Other levels can be introduced based on functionality. For instance, to understand the differences between the metabolic network of a given organism and ''what might have been expected'' in other realizations, one may appeal to the fact that organisms are alive, eat, reproduce etc. The purpose of the present work is to show how, within metabolic networks, one may introduce randomized ensembles; these ensembles can be used as benchmarks, allowing one to meaningfully ask whether a given organism's metabolic network is particularly atypical.
The outline of the paper is as follows. We first examine the fat tail in the metabolite degree distribution when using realistic biochemical reactions and investigate the source of this tail. Then we address the randomization problem in metabolic networks and introduce network ensembles subject to increasing levels of constraints. We study the structural properties of metabolic networks in these ensembles, including the clustering coefficient and sizes of the strong components. These results are discussed in the following section, while detailed methods are provided in the last section.

Degree distribution and the KEGG_Hybrid set of reactions
Given a metabolic network such as illustrated in Fig. 1(a), the degree of a metabolite is the number of reactions in which it participates. With the availability of genome-scale metabolic networks [10], the metabolite degree distributions in a number of organisms have been determined. A striking result is that all organisms show metabolite degree distributions with fat tails well described by a power fall-off [10,11]. Furthermore, the power of this tail varies very little from one organism to another, being always close to 2.2 [10]. Clearly there exist metabolites involved in a large number of reactions; examples include ATP (which provides the transfer of a phosphate group), NADH (which provides the transfer of electrons) etc. This behavior is in fact typical of all metabolites of high degree: they transfer small groups and are therefore generally referred to as ''currency'' metabolites [13,14]. Because these currency metabolites arise in so many reactions, one can expect nearly all living organisms to produce them. If this occurs, one also expects a similar power law to arise in the metabolite degree distribution in different organisms.
To address this point in a quantitative framework, we ask what would be expected in a ''random'' organism, that is, in one using random biochemical reactions? One could introduce artificial reactions in which randomly chosen metabolites would be transformed into others; however this would not preserve atomic species, and even if one could enforce conservation, it would nearly always lead to reactions which have no existence. A more suitable approach is to restrict ourselves to biochemically realizable reactions. We used a database of such reactions compiled by Rodrigues and Wagner [15]. These authors combined the reactions in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [16] with those of the E, coli iJR904 metabolic model [17] and curated the resulting set (see Materials and Methods). This leads to a total of 5870 reactions and 4816 metabolites; we shall refer to this list of reactions and metabolites as KEGG_Hybrid. Given this list of possible reactions and metabolites, Fig. 2 shows the degree distribution of the metabolites within the database on a log-log scale. The power law is clearly a good approximation; fitting these data using the method in [18,19] gives an exponent of 2.31. This value is close to the exponent for the E. coli genome-scale metabolic network [17] which is 2.17; the corresponding distribution is also displayed in Fig. 2.
The metabolic network of E. coli has far fewer reactions than the 5870 in the KEGG_Hybrid database, and so the maximum degree in E. coli must be smaller; this is visible in Fig. 2. Furthermore, our objective is to compare E. coli to ''random'' organisms so we should force the number of reactions to be the same as in E. coli, allowing any of the biochemical reactions in KEGG_Hybrid. This defines a simple ensemble of possible metabolic networks where the biochemical constraint of using real reactions is enforced. We have thus generated 1,000 random genomes (lists of n reactions chosen at random in KEGG_Hybrid, n = n E = 831 being the number of reactions of the in silico E. coli metabolic network) and computed the degree distribution in this ensemble. The result is displayed in Fig. 2 with the label ''Random'', and the distribution again seems to follow a power law but with a slightly higher exponent, around 2.51. Thus this ''Random'' ensemble leads to metabolic networks whose metabolite degree distribution has characteristics rather similar to those of E. coli.
The similarity of the three distributions in Fig. 2 may seem remarkable, but upon reflection it can be understood as follows. The highest degree metabolites in the KEGG_Hybrid set participate in many reactions. Thus they are most likely very important biochemically, so they should be present in E. coli. Quantitatively, we have checked this: among the metabolites of degree at least 50 in KEGG_Hybrid, 94% are also present in E. coli. Furthermore, this same pattern is expected in the Random ensemble simply because choosing reactions at random gives a higher probability of incorporating metabolites that participate in many reactions. Again this can be tested explicitly: for any of the metabolites that have degree at least 50 in KEGG_Hybrid, the Random metabolic networks include them with probability above 0.99. Interestingly, the biochemical nature of these high degree metabolites is quite specific: they are categorized as ''carriers'' or as ''precursors'' in the biochemical literature [13]. Tanaka and Doyle [14,20] have investigated the degree properties of these two classes and found that indeed they are the contributors to the fat tails in the degree distribution for different organisms, but we see here that this also holds for the KEGG_Hybrid set and for our ''Random'' ensemble.
All the above concerns the degree of the metabolites. The same kind of analysis can be performed for the degree of the reactions, where the degree of a reaction is given by the number of its substrates (metabolites it involves). In contrast to metabolites, reactions do not have high degrees: a typical reaction will involve just a few metabolites, the most frequent number being 4, and very rarely will there be more than 6. This situation is illustrated in Fig. 3 where we also distinguish the different kinds of metabolites (see Materials and Methods for details). One sees from Fig. 3 that reactions typically involve a currency pair, sometimes two, but that almost always there are at most 3 metabolites per reaction that are not of the currency type.
Ensembles: from implementing biochemical realism to allowing for functional constraints The ''Random'' ensemble is a first way to define randomized metabolic networks: it takes into account both the need to use meaningful biochemical reactions and the number of reactions in the genome of interest. We shall consider that the metabolic network (of an organism or of a randomized organism) is specified by its set of enzyme coding genes, and we shall refer to this set as its ''genotype''.
The approximate power law tail found in the degree distribution of metabolites in living organisms can be traced to the contribution of currency metabolites (cf. Fig. 2). However, numerous other statistical properties of biological metabolic networks set them aside from those in the Random ensemble; for instance their number of metabolites is significantly lower, they have fewer ''blocked'' [21,22] reactions (reactions that cannot sustain flux for instance because they are not connected to other reactions), etc. Just as we fixed the number of reactions n in the genotypes forming the Random ensemble, it is appropriate to  include further ''macroscopic'' constraints to refine the randomization ensembles. For each added constraint, one can expect the statistical characteristics in the ensemble to become closer to those of living organisms, but the hope is that just a few relevant constraints will be sufficient to have a quite satisfactory randomized ensemble. Beyond the constraints already mentioned, there is the simple fact that metabolism of living organisms allows them to grow and reproduce. Although these are incredibly complex tasks, genome-scale metabolic network models [23,24] take into account the realizable fluxes through biochemical reactions and the possibility that a given set of reactions (catalyzed by enzyme coding genes) may produce all the biomass components necessary for cell growth. In our most constrained ensemble, we shall thus enforce the ''functional'' constraint that the genotype's metabolic network allows for production of these biomass components. The simplest ensemble is the one previously introduced under the term ''Random'' and we shall label it R because it is simply based on using a fixed number of random reactions in KEGG_Hybrid. Adding the constraint of the number of metabolites gives the ensemble we label RM and so forth. We now describe these successive ensembles and the computational tools used to sample them. We shall then examine the statistical properties of the networks in each of these ensembles and see the effects of successively adding these constraints.
The ensembles R, RM, and uRM. The allowed reactions will always be taken from the aforementioned KEGG_Hybrid list. This guarantees that every reaction satisfies atomic conservation laws. Also, these reactions are either reversible or irreversible, and this is taken into account in the modeling. The first ensemble R constrains the number of reactions; it consists of genotypes having exactly n reactions in the KEGG_Hybrid list where n is the number of reactions of the ''reference'' organism which one wants to benchmark in a randomization test; for specificity, the reader can think of this reference organism as being E. coli. The second ensemble further constrains the number of metabolites; the number of metabolites m in a genome is obtained by counting all the distinct metabolites associated with the reactions in that genotype. In practice, the sampling is simpler if one constrains m to be in a range; we shall use m#m E where m E is the number of metabolites in the reference organism E. coli. We denote this second ensemble which imposes two constraints by RM. (Note that to go from a sample of m#m E to one with m = m E , it is sufficient to use the subsample satisfying m = m E .) The third ensemble, denoted by uRM, restricts the nature of the reactions in KEGG_Hybrid that we permit ourselves to consider. The motivation for this restriction comes from the fact that KEGG_Hybrid includes many reactions involving ''exotic'' metabolites which are involved with just one reaction. In such cases, those reactions will necessarily be isolated and thus ''afunctional'' biochemically; unless the associated metabolites are part of the biomass, such reactions would have no reason to be kept in a biological organism. (Note that some of these cases may be due to errors or missing reactions in KEGG_Hybrid; these limitations are expected to be resolved in the not so distant future as databases get improved.) A similar situation arises when a reaction requires a metabolite that can only be produced in particular chemical environment that we do not consider; such a reaction will then be blocked. Our working definition of a ''blocked'' reaction is based on the possible fluxes it can sustain in the steady state; if a reaction is guaranteed to be never used in such conditions, then it is considered as blocked and removed from the KEGG_Hybrid list. In practice, for all the reactions in KEGG_Hybrid we determine whether they are blocked [21,22] (cannot sustain non zero flux; see Materials and Methods for the details); the reduced set of unblocked reactions is then used as input for constructing genotypes. The nomenclature uRM of the ensemble indicates that we use unblocked reactions with constrained numbers of reactions and metabolites.
The ensembles uRM-V1, …, uRM-V10. The essence of living organisms is growth and reproduction. Metabolism plays a central role therein, transforming various nutrients brought in from outside the cell into primary metabolites (amino acids, nucleic acids, fatty acids, etc). These are then used as bricks for building proteins, DNA, lipids, etc. Genome-scale metabolic models provide a tested framework to relate genotypes (lists of enzyme-coding genes) to metabolic capabilities and phenotypes [23,24]. The framework, often called FBA for ''Flux Balance Analysis'', allows one to compute the possible flux distributions through all the reactions assuming the metabolic network is in the steady state [23,24,25]. One may ask whether all the biomass compounds can be produced given a chemical environment, i.e., a set of nutrients defining allowed input fluxes into the cell. If a genotype's metabolic network satisfies this constraint, we say that the genotype is ''viable'' on that chemical environment because the in silico FBA predicts that the cell can grow given those nutrients. This is illustrated in Fig. S2.
There are many possible choices of nutrients; clearly one needs sources of all major elements (H, C, O, N, S, P). It is common practice to focus on the carbon utilization because it is often limiting; we thus work with ''minimal'' environments having a single carbon source. In the lab, it is easy to test whether a microorganism grows on a whole panel of different environments, and the corresponding list of growth/no-growth results is referred to as the growth phenotype of the organism. This growth phenotype can be considered a constraint to impose on a randomization. In this spirit, we consider a succession of ensembles associated with viability on multiple environments. Specifically, given an organism like E. coli and its growth phenotype, we can consider the random genotypes that have the same in silico growth phenotype (as predicted by FBA). The associated ensemble thus takes into account viability constraints. These constraints can be considered as being imposed successively: one can force growth first on one chemical environment, then on two, then on three, and so on. We refer to these ensembles as uRM-V1, uRM-V2, uRM-V3, etc. Interestingly, the first step, namely going from uRM to uRM-V1, turns out to be the most stringent as the ones thereafter give rise to only rather small changes.
MCMC sampling of each ensemble. The ensemble R can be sampled by drawing genotypes with the correct number of reactions, but the constraints inherent to the other ensembles do not allow such a simple procedure. We thus resort to Markov Chain Monte Carlo (MCMC) as a way to sample each ensemble. This requires obtaining an element of the ensemble as a starting point, and then performing random walks within the ensemble. Each trial step involves doing a reaction swap (exchanging a reaction in the genotype with one that is not) to respect the constraint of having a fixed number of reactions. Then the different constraints of the ensemble of interest are checked; if they are satisfied, the trial step (to a new genotype) is accepted, otherwise it is rejected. The schematic representation of this is given in Fig. 4 while the detailed procedures are given in Fig. S3 for the ensemble uRM-V1. Note that by construction, in each ensemble, all of the elements are equiprobable; our ensembles are then just nested sets of genotypes satisfying increasing numbers of constraints. The reduction in size of these sets as each constraint is added can be extremely severe. For instance, when going from all reactions to unblocked reactions, over half of the reactions are discarded, leading to a reduction of the genotype space by approximately a factor 2 n . Similarly, it was shown in previous work [26] that including the viability constraint on the first chemical environment leads to a reduction by at least a factor 10 22 . These numbers drive home the necessity of using MCMC for sampling: direct random sampling is hopelessly inefficient.

Metabolic network statistical properties
Genetic diversity. In any of our ensembles sampled by MCMC, two genotypes taken at random will share some reactions and differ in others. For instance, in the ensemble uRM-V10, we find a total of 106 specific reactions that are necessary for any genotype to have nonzero biomass flux. These 106 reactions are then present in every genotype of the ensemble uRM-V10. Nevertheless, two genotypes taken at random in this ensemble tend to be rather different. Specifically, if one takes two genotypes G 1 and G 2 at random in uRM-V10, we find that on average G 2 will have more than 50% of its reactions that are not in G 1 . This also holds true when we compare a random genotype to that of E. coli. In the ensembles with fewer constraints (eg. uRM-V1), the level of dissimilarity between random genotypes is even higher. Thus, in spite of some shared reactions in all genotypes, the genotypes in our ensembles are not very similar to each other or to E. coli: the ensembles have a high level of genetic diversity.
Global topological properties. To each genotype is associated a list of reactions and their corresponding metabolites; the whole can be represented by a bipartite graph (cf. Fig. 1(a)). From this graph, one may form a reduced graph for only the metabolites, or one for only the reactions; these graphs are called the metabolite-metabolite graph and the reaction-reaction graph respectively. (See the Materials and Methods section for the associated procedures.) It is appropriate to emphasize that the bipartite graph representation of the metabolic network contains more information than its associated unipartite graph representation.
For each genotype generated and saved in the different ensembles, we have constructed its metabolite-metabolite graph. Then we measure several of the standard structural properties of that (directed) graph. These are as follows. (1) The clustering coefficient C which roughly is a measure of the frequency of triangles in the network. (2) The average path length L between randomly chosen nodes. (3) The probability P C that two randomly selected nodes A and B are connected by a directed path from A to B; this gives an indication of whether a metabolite is involved in another's production. (4) The size of the largest strong component (LSC) which measures the connectivity of the network. In a directed graph, a strong component is defined as a maximal subgraph such that there exists a directed path between any two of its nodes; in the case of an undirected graph, it is then just a maximal connected component. We focus on the largest of these strong components in this work. (5) The ''IN'' (respectively the ''OUT'') sub-graph for a given strong component is the set of nodes for which there is a directed path to (respectively from) the strong component [27]. We shall monitor the union of the largest strong component (LSC) and its associated IN and OUT parts.
For each of these indicators of graph structure, we have computed the mean values within the different ensembles for the metabolite-metabolite graph, and have also determined the value for the graph associated with the E. coli genotype. In Fig. 5 we display as a function of the increasingly constrained ensembles three structural quantities related to connectivity: the average of P C , the average size of the LSC, and the average size of LSC+IN+OUT. To the right of the bar associated with uRM-V10 we show the value for E. coli. Clearly the first constraints strongly affect the structure of the metabolic networks, while increasing the number of environments on which one forces viability gives rise only to modest changes. When considering the other structural properties of networks in the ensembles, we see that for the clustering coefficient, the first constraint (going from R to RM) is the most important (cf. Fig. S4). For the average path length, already at the level of R one has quite good agreement with the value in E. coli, just as was the case for the degree distribution (cf. Fig. S5). Thus we have ensembles of randomized metabolic networks that are good benchmarks of comparison for the biological network.
Functional constraints shape global network structure.
The trends described above can be summarized by following the joint statistics of the structural properties in the ensembles as one adds successive constraints. This is illustrated in Fig. 6 for three of the structural properties. Each ensemble is represented by 1000 of its genotypes and for clarity we have displayed only three of the ensembles. We see a systematic change in the structural properties as constraints are added, and that the three clouds associated with the constraints represented here have little overlap. Note that by construction our ensembles are actually embedded sets of genotypes; each added constraint reduces the genotypes allowed. Such a hierarchical structure does not prevent the clouds in Fig. 6 from being rather well separated. MCMC allows one to sample this tiny fraction by generating a random walk restricted to the genotypes in the ensemble of interest. The MCMC starts with the E. coli genotype (shown in the figure as G 0 ) and proceeds as follows. At each trial step, a modified genotype is generated by applying a reaction swap to the current genotype. If the modified genotype satisfies the constraints of the ensemble, the trial move is accepted (shown in the figure as blue arrow) with the modified genotype becoming the next genotype of the walk. If the modified genotype does not satisfy the constraints of the ensemble, the trial step is rejected (shown in the figure by red arrows) and the walk stays at the previous genotype for that step. The advantage of using reaction swaps in our approach is that it leaves the number of reactions constant over time. The genotypes on the boundary of the large circle are in the neighborhood of genotype G k and differ from it by a single reaction swap. doi:10.1371/journal.pone.0022295.g004 Furthermore, the trend with addition of constraints very clearly brings the clouds closer to the point representing the structural properties of E. coli. This was visible too at the level of the individual observables (cf. Figs. 5 and 6).
While computing the above mentioned structural properties, the construction of the metabolite-metabolite graph plays a key role. However, the unipartite metabolite graph construction relies on a classification of metabolites into currency and non currency  metabolites, and such a classification is not clear-cut. Indeed some currency metabolites have a carrier role in some reactions and a non carrier role in others. To check that our conclusions are not sensitive to some level of arbitrariness in the classification scheme, we have repeated the whole calculation for a modified set of currency metabolites, removing 20 currency metabolites from the original list in Table S1. We display in Fig. S6 the analog of Fig. 6; the difference between the two figures is hardly detectable by eye, showing that the trends found above are robust.

Discussion
Over the past decade, genome-scale metabolic networks have been constructed for several organisms. In all cases, the metabolite degree distribution exhibits a fat tail compatible with a power decay [10,11] of exponent around 2.2. We found that this behavior can be traced to the metabolite degree distribution in the set of all known biochemical reactions as given for instance in KEGG. The fundamental source of these fat tails is the large number of currency metabolites that transfer small groups in nearly all real biochemical reactions. It is essential to take into account this fact when testing whether biological metabolic networks have unexpected features. The commonly used edge exchange algorithm has the desirable property of preserving the network degree distribution but it is inappropriate because the procedure introduces fictitious reactions having no meaning. Any sensible testing framework should force the benchmark (the randomized ensemble) to incorporate real biochemical reactions. We showed that this could be done in practice by using a database of real biochemical reactions compiled from KEGG and iJR904. In this framework, we found that the degree distribution of metabolites had a fat tail very similar to what is seen in real organisms.
One may ask whether the observed fat tail is an artifact of the KEGG database itself which summarizes the reactions in today's organisms. It is quite possible that other reactions and cofactors can act as substitutes to the ones occurring in KEGG, and thereby affect the degree distribution. However, it is likely that selection pressures act against such substitutes, for instance because of efficiency of catalysis or availability of molecular species. In effect, the use of KEGG reflects evolutionary constraints in addition to the purely biochemical ones. Thus, the present work is relevant for natural organisms but much less so for synthetic ones.
Another caveat associated with this study is the bias arising due to incompleteness of the KEGG database. Firstly, KEGG is missing transporters of less studied organisms; as a consequence a number of reactions appear to be blocked. Secondly, many biosynthesis pathways are incomplete; this is especially true for rare (or poorly understood) pathways. However, both of these biases can be expected to have only mild consequences within our study. Indeed for our choices of chemical environments, the curation of genome-scale models of different organisms has filled the gaps for transporters. Furthermore, our use of the E. coli biomass reaction formula makes our growth phenotypes insensitive to missing reactions as long as they arise in rare biosynthetic pathways.
Looking at structural properties beyond the degree distribution, we found that the ensemble R showed significant differences with the biological case (where the ensemble R corresponds to choosing randomly a given number of reactions in the database KEGG_ Hybrid). Since understanding the topological properties of networks can give insights into their structure-function relationship, it is appropriate to refine the benchmark ensemble. Thus, we successively added further global constraints, in particular by enforcing metabolic capabilities, in this context biomass produc-tion. Adding such functional constraints takes into account the growth properties of living organisms and thus the ''macroscopic'' forces which shape biological metabolic networks. We find that by adding biochemical and functional constraints, the structural properties of the random networks in our ensembles become very close to what is seen biologically as illustrated in Fig. 6; that this is possible without taking into account any microscopic properties is really remarkable. Depending on the structural feature considered, we find that some features emerge relatively ''early'', that is follow from fewer macroscopic constraints than others.
Perhaps most strikingly, these trends occur within ensembles that maintain a high level of genetic diversity. Indeed even in our most constrained ensemble, uRM-V10, the metabolic networks show large differences in reaction usage. Quantitatively, two randomly chosen networks in the ensemble uRM-V10 will typically differ in half of their reaction content. As a cautionary note, it is important to stress that the observed trends here concern global structural measures commonly used in general network analysis. One cannot exclude the possibility that consideration of metabolism-specific observables based for instance on fluxes may lead to a different picture.
In conclusion, the present work indicates that the observed global structural properties of metabolic networks in living organisms are likely to be consequences of the simplest biochemical and functional constraints. Such a possibility has been previously suggested [28,29] but remained in the spirit of a conjecture; we hope that the direct computational evidence provided in this work will transform conjecture into paradigm.

Materials and Methods
Biochemical reaction sets KEGG_ _Hybrid reaction set. We have used a hybrid database compiled by Rodrigues and Wagner [15] containing 4816 metabolites and 5870 biochemical reactions for this work. This database of 5870 reactions was compiled by merging the Kyoto Encyclopedia of Genes and Genomes (KEGG) LIGAND reaction database [16] with the E. coli genome-scale metabolic model iJR904 [17], followed by appropriate pruning to exclude elementally imbalanced and generalized polymerization reactions [15]. Of the 5870 reactions in the hybrid database, 3369 are irreversible and 2501 are reversible reactions. Also, more than 5500 reactions are contained in the KEGG LIGAND database and so less than 300 reactions are specific to the E. coli genomescale metabolic model iJR904. In this work, we will refer to the set of 5870 reactions contained in the hybrid database [15] as ''KEGG_Hybrid reactions''.
The hybrid database also contains transport reactions for 143 external metabolites in the E. coli iJR904 metabolic model; these can be used to transport such metabolites across the cell boundary. The 143 external metabolites were taken to be the set of possible uptake and secreted metabolites in the network. Further, an objective function Z in the form of a biomass reaction, that requires synthesis of all biomass components of E. coli, as defined in the iJR904 model [17], is also included in the hybrid database. The biomass reaction is used to determine the viability of a network.
Unblocked KEGG_ _Hybrid reaction set. Genome-scale metabolic networks typically contain ''blocked'' reactions that can have only zero flux in every investigated chemical environment under any steady state condition [21,22]. Such blocked reactions cannot contribute to the steady state flux distribution. With the set of 143 external metabolites in the E. coli iJR904 model, we found 2968 of the 5870 reactions in the hybrid database to be blocked under all environmental conditions [26]. We have excluded the 2968 blocked reactions from the set of 5870 reactions in the hybrid database to arrive at a reduced reaction set of 1597 metabolites and 2902 reactions. We refer to this reduced set of 2902 reactions in this work as ''Unblocked KEGG_Hybrid reactions''.
The E. coli metabolic network. The E. coli metabolic model iJR904 [17] contains 931 reactions which are also part of the hybrid database. After having excluded the 2968 blocked reactions from the hybrid database, the unblocked reaction set of 2902 reactions still contains 831 reactions of the E. coli iJR904 model. In this work, we refer to this set of 831 reactions as the E. coli metabolic network.

Graph-theoretic representations of metabolic networks
The metabolic network can be represented as a directed bipartite graph built up of two types of nodes, metabolites and reactions, connected by two types of links. We can distinguish reactant metabolites from product metabolites of a reaction as follows: A link from a metabolite node to a reaction node specifies a reactant while a link from a reaction node to a metabolite node specifies a product. Note that in a bipartite graph, links between similar types of nodes are forbidden. It is important to differentiate between reversible and irreversible reactions in the network. In Fig. 1(a) we have used the bipartite representation to show three reactions in the glycolytic pathway.
Starting from a directed bipartite graph of metabolites and reactions, we can construct an associated directed unipartite graph of metabolites, referred to as a metabolite-metabolite graph. It summarizes the metabolic network structure by assigning links from reactant metabolites to product metabolites in each reaction. In the simplest definition, two metabolites will be ''neighbors'' (connected by a link) if and only if they appear in at least one common reaction [11]. However, a sizeable fraction of metabolites in the network have quite high degree, so this construction leads to very dense graphs whose statistical properties are dominated by the role of the currency metabolites. To overcome this problem and also maintain biochemical relevance, we construct the metabolite-metabolite graph by first removing the currency metabolites, and then assigning links from reactant metabolites to product metabolites in each reaction [13,14]. This representation has the advantage that the (directed) link between two metabolites signifies transformation of one into the other. For reversible reactions, the links between metabolites appear in both directions. See Fig. 1(b) for an illustration.
The different treatment of currency vs. non currency metabolites is based on the fact that biochemical reactions most often consist of adding or removing a small group (proton, phosphate, methyl, etc) of a large compound. Currency metabolites are the co-factors responsible for such transfers, and they are quite ubiquitous. Examples of currency metabolites include ATP, ADP, NADH, NAD + , H 2 O, H + , Pi that are normally used as carriers for transferring electrons or certain functional groups such as phosphate group, amino group, methyl group, one carbon unit, etc. In our construction of the unipartite graph, we omit links arising due to presence of currency metabolites in each reaction. In Fig. 1(b), we show the unipartite graph corresponding to the bipartite graph shown in Fig. 1(a) for the three reactions in the glycolytic pathway. The list of currency metabolites used in our work was based on that in the paper by Ma and Zeng [13] and is given in Table S1.

Structural properties of metabolic networks
Metabolite degree distribution. The degree of a metabolite i (denoted by k i ) is the number of reactions in which the metabolite i participates either as a reactant or a product in the network. The metabolite degree distribution P(k) is defined as the probability that a randomly selected metabolite node participates in exactly k reactions in the network. We use the bipartite graph representation of the metabolic network to compute the metabolite degree and degree distribution.
In Fig. 2, we have displayed several metabolite degree distributions after applying logarithmic binning. It is seen that the metabolite degree distributions approximately follow a power law, P(k) ,k -c [5], and the degree exponents c were extracted by using the maximum likelihood estimate method [18,19] recently proposed by Newman and colleagues rather than by fitting the binned data.
Reaction degree distribution. The degree of a reaction is the number of substrates that participate either as a reactant or a product in it. The reaction degree distribution P(k) gives the probability that a randomly selected reaction has exactly k substrates in it. We use the bipartite graph representation of the metabolic network to compute the reaction degree and degree distribution. Fig. 3 shows this distribution in the KEGG_Hybrid database.
Clustering coefficient. The clustering coefficient quantifies the extent to which the neighbors of a node in a graph are connected to each other [1]. The global clustering coefficient of a graph measures the fraction of triangles among the connected triples [30]. It is given by: where N D is the number of triangles and N 3 is the number of connected triples in the graph. In this work, we have computed the clustering coefficient for each network in our ensemble using the unipartite metabolite graph representation. Note that when computing the clustering coefficient, the graph is considered undirected.
Path length and connectivity. The average path length ,L. is a measure of the overall navigability in a network. It is defined as the average length of the shortest paths between all pairs of nodes in the directed unipartite metabolite graph. When computing the average path length for a disconnected graph, one considers only the node pairs for which a directed path exists. We have also computed the probability P C that a directed path exists between any two nodes in the unipartite metabolite graph. The clustering coefficient C, average path length ,L. and probability P C that a path exists between two nodes in a graph were computed using the igraph library [31].
Largest strong component. Given a directed graph, a strongly connected component is a maximal set of nodes such that for any pair of nodes i and j in the set there is a directed path from i to j and from j to i [32]. In general, a directed graph may have one or many strong components. The strong components of a graph are disjoint sets. The strong component with the largest number of nodes is designated as the largest strong component (LSC). The associated IN component consists of nodes which have access to LSC nodes via some directed path, but lack access from LSC nodes back to them via any directed path. The OUT component consists of nodes which can be accessed from the LSC nodes via some directed path, but lack access to LSC nodes from them via any directed path. Note that the so-called ''bowtie'' architecture of networks is based on these LSC, IN and OUT components; that architecture has been observed both in the World Wide Web (WWW) [27] and in bacterial metabolism [33,34]. In this work, we have computed the fraction of nodes in the largest strong component (LSC) and in the union of LSC, IN and OUT components for networks in our ensembles using the directed unipartite metabolite graph representation.

Genotype-to-phenotype map
A metabolic network genotype is any subset of reactions taken from the global reaction set itself consisting of N reactions. A simple representation of a metabolic network genotype G uses a binary string of length N, e.g., G~(b 1 ,:::,b N ), with each reaction i being either present (b i = 1) or absent (b i = 0) (see Fig. S2 for an example). Each randomized network in our ensemble can be thought of as one genotype existing in a vast genotype space of possible metabolic networks. For any genotype, we can use flux balance analysis (FBA) [23,24,25] to determine whether the corresponding network has the ability to synthesize all biomass components in a given chemical environment or medium. FBA primarily uses information about the stoichiometry of reactions in the network to obtain a prediction for the steady-state fluxes of all reactions and the maximum possible biomass synthesis rate. The predictions of FBA and related approaches are generally in good agreement with experimental results [35,36].
We consider a genotype to be ''viable'' in a given chemical environment if and only if its maximum biomass flux predicted by FBA is non-zero (see Fig. S2). Otherwise, we consider the genotype to be non-viable. We use FBA and the E. coli biomass composition [17] to determine viability of a genotype in different chemical environments corresponding to minimal media. Specifically, we consider only minimal environments that contain a limited amount of a carbon source, along with unlimited amounts of the following inorganic metabolites: oxygen, water, protons, sulfate, ammonia, pyrophosphate, iron, potassium and sodium. Here, we have considered 10 carbon sources: glucose, acetate, succinate, pyruvate, oxoglutarate, glucose-6-phosphate, sucrose, acetaldehyde, glycerol and glycerol-3-phosphate.

Generation of randomized ensembles
Random ensemble R of genotypes with fixed number of reactions. A genotype in the ''random'' ensemble R can be simply generated by uniformly sampling subsets with exactly n E valid biochemical reactions from the KEGG_Hybrid reaction set of N = 5870 reactions, where n E = 831 is the number of reactions in the E. coli metabolic network. Using this procedure, we have generated 1000 genotypes in the random ensemble to compare with the E. coli metabolic network. Our motivation to fix the number of reactions in our genotypes is as follows: The biochemical reactions inside the cell are catalyzed by enzymes which are proteins coded by genes. By fixing the number of reactions in our genotype, we impose the constraint of fixed metabolic genome size.
Ensemble RM of genotypes with fixed number of reactions and metabolites in the KEGG_Hybrid set. The E. coli metabolic network consists of n E = 831 reactions involving m E = 668 metabolites. Though the genotypes in the random ensemble R have exactly the same number of reactions as in E. coli, they typically contain many more metabolites than in the E. coli network. As a next step, we enforce the additional constraint that the genotypes have the same number of metabolites m E as in the E. coli network. Note that we cannot pick a fixed number of reactions at random from the KEGG_Hybrid reaction set if they are to satisfy the additional constraint of fixed number of metabolites. Hence, we use the Markov Chain Monte Carlo (MCMC) method to sample genotypes in the KEGG_Hybrid reaction set with same number of metabolites and reactions as in E. coli.
The MCMC method produces a sequence of genotypes forming a chain, the term ''chain" coming from the property that the (k+1) th element of the sequence is generated from the k th one using a probabilistic transition rule. We start with the E. coli genotype and then propose a small modification in the genotype; if this modified genotype has its number of metabolites # m E (the number in E. coli), one accepts it as the next genotype of the sequence, otherwise the next genotype is identical to the current genotype. In this work, the modification introduced at each transition step is a reaction swap. That is, each modification adds one reaction from KEGG_Hybrid reaction set and removes another reaction from the current genotype, so as to keep the number of reactions n E constant in the genotype. The MCMC thereby produces a walk in the subspace of genotypes of n E reactions and at most m E metabolites. Starting from the initial E. coli genotype, we first carried out 10 5 attempted swaps or Markov chain steps to erase the memory of the starting genotype. After this initial phase, we continued the MCMC procedure to sample genotypes with exactly n E reactions and at most m E metabolites. During this phase, it is not useful to keep all of the genotypes produced because they are strongly correlated. We thus saved only every 1000 th genotype generated, and we did 10 6 steps. We refer to the set of 1000 genotypes with n E reactions and # m E metabolites within KEGG_Hybrid reaction set as the RM ensemble. We find that the procedure is relatively efficient, with an acceptance rate 0.22 that is not small. We also find that a substantial fraction of the networks have in fact m = m E .
Ensemble uRM of genotypes with fixed number of reactions and metabolites in the Unblocked KEGG_Hybrid set. ''Blocked'' reactions can have only zero flux in every investigated chemical environment under steady state conditions, and thus are ''afunctional'' in all genotypes. As a next step, we enforce the constraint that the genotypes in the ensemble are sampled within the Unblocked KEGG_Hybrid reaction set rather than the KEGG_Hybrid reaction set. We refer to this ensemble of genotypes containing the same number of metabolites m E and reactions n E as in the E. coli network within the unblocked reaction set as uRM. We generate the genotypes in the uRM ensemble through a slightly modified MCMC method from that mentioned above to generate the RM ensemble. In this case, at each transition step, we impose a reaction swap to the current genotype that is restricted to the Unblocked KEGG_Hybrid set, i.e., we remove one reaction from the current genotype and add a reaction from the Unblocked KEGG_Hybrid set. The rest of the procedure is exactly the same as above for sampling RM. We have sampled 1000 genotypes in the uRM ensemble with n E reactions and at most m E metabolites.
Ensembles of viable genotypes with fixed number of reactions and metabolites. The E. coli metabolic network has the ability to produce biomass components starting from nutrient metabolites in its environment for growth and maintenance. Thus as a next step, we enforce the additional functional constraint of growth in a chemical environment. We define the ensemble uRM-V1 as that part of uRM in which the genotypes satisfy the functional constraint of non-zero biomass flux in the glucose minimal environment (as determined by FBA; see Fig. S2). We sample the genotypes in uRM-V1 ensemble using a slightly modified MCMC method from that mentioned above to generate the uRM ensemble. In this case, at each transition step, we perform a reaction swap to the current genotype that is restricted to the Unblocked reaction set and accept the swap if the modified genotype satisfies the following two conditions: (a) the number of metabolites in the modified genotype is at most m E , the number in E. coli, and (b) the modified genotype is viable under glucose minimal environment. The rest of the procedure is exactly same as when sampling genotypes in the ensemble uRM; a flowchart of the MCMC algorithm for sampling the ensemble uRM-V1 is shown in Fig. S3. We have sampled 1000 genotypes in this uRM-V1 ensemble.
Since, E. coli is able to survive and grow under diverse environmental conditions (rather than just one chemical environment), we have further generated two ensembles of genotypes satisfying increased functional constraints of (a) viability under 5 specified minimal environments (referred to as the ensemble uRM-V5) and (b) viability under 10 specified minimal environments (referred to as ensemble uRM-V10), respectively. These ensembles can be sampled by a MCMC procedure that is just a slight modification from the one shown in Fig. S3. Figure S1 Edge exchange randomization is biochemically meaningless. The commonly used edge exchange or link permutation procedure for randomizing biological networks is inappropriate for metabolic networks as the method generates fictitious reactions violating balance of mass, charge and atomic elements. Here, starting with two reactions (ASPT: asp-L R fum + nh4; CITL: cit R oaa + ac), we perform an edge exchange associated to metabolites fum and ac that generates two new hypothetical reactions (ASPT*: asp-L R ac + nh4; CITL*: cit R oaa + fum) that violate balance of mass and atomic elements. Note that ac has 2 carbon atoms and fum has 4 carbon atoms. (TIF) Figure S2 Schematic summary of the relationship between genotypes and phenotypes. The genotype specifies the list of reactions in a metabolic network. The phenotype is determined by whether the metabolic network can produce biomass components (growth) in a choice of chemical environments; this condition is computed using FBA. (TIF) Figure S3 Flowchart of the MCMC algorithm to sample genotypes in the ensemble uRM-V1. The Markov chain starts with the E. coli genotype. We then perform 10 5 Markov Chain steps to erase the memory of the initial genotype. After this initial phase, we continue the MCMC procedure to sample the genotype network and save every 1000 th genotype generated. We terminate the Markov chain after saving 1000 genotypes. Note that the length of the run (and the choice of saving frequency) should be long enough to obtain a meaningful and uncorrelated sample of genotypes using this algorithm. (TIF) Figure S4 Clustering coefficient C of the metabolic networks in the different ensembles. Different bars from left to right correspond to network ensembles incorporating an increasing number of constraints and the last bar corresponds to the E. coli metabolic network. The standard deviation is also displayed for each ensemble. (TIF) Figure S5 Average path length ,L. of the metabolic networks in the different ensembles. Different bars from left to right correspond to network ensembles incorporating an increasing number of constraints and the last bar corresponds to the E. coli metabolic network. The standard deviation is also displayed for each ensemble. (TIF) Figure S6 Statistical properties of randomized networks in different ensembles and the E. coli metabolic network using a modified currency list. The three axes are associated with graph characteristics of the networks and are same as in Figure 6. Each cloud represents 1000 randomized networks in the ensemble considered. In order to compute graph characteristics of randomized networks shown in this figure, we have constructed the metabolite-metabolite graph corresponding to each randomized network using a currency list modified from that listed in Table S1. The modified currency list was generated as follows. We first ranked metabolites in the currency list (given in Table S1) based on metabolite degree in the complete reaction database. The lowest degree metabolite in the currency list was designated rank 1. The 20 metabolites of smallest rank in this ranked currency list were then eliminated to generate the modified currency list used for computing graph characteristics shown in this figure. By comparing this figure with its analog (Figure 6), one sees that our conclusions are the same for the two definitions of currency metabolites. (TIF)

Supporting Information
Table S1 List of currency metabolites used to construct the unipartite graph. This list was built mostly from information in the paper by Ma and Zeng (Bioinformatics, 19, 270 (2003)). The metabolites shaded in grey were absent in the modified currency list used to generate Figure S6. (XLS)