Signatures of Arithmetic Simplicity in Metabolic Network Architecture

Metabolic networks perform some of the most fundamental functions in living cells, including energy transduction and building block biosynthesis. While these are the best characterized networks in living systems, understanding their evolutionary history and complex wiring constitutes one of the most fascinating open questions in biology, intimately related to the enigma of life's origin itself. Is the evolution of metabolism subject to general principles, beyond the unpredictable accumulation of multiple historical accidents? Here we search for such principles by applying to an artificial chemical universe some of the methodologies developed for the study of genome scale models of cellular metabolism. In particular, we use metabolic flux constraint-based models to exhaustively search for artificial chemistry pathways that can optimally perform an array of elementary metabolic functions. Despite the simplicity of the model employed, we find that the ensuing pathways display a surprisingly rich set of properties, including the existence of autocatalytic cycles and hierarchical modules, the appearance of universally preferable metabolites and reactions, and a logarithmic trend of pathway length as a function of input/output molecule size. Some of these properties can be derived analytically, borrowing methods previously used in cryptography. In addition, by mapping biochemical networks onto a simplified carbon atom reaction backbone, we find that properties similar to those predicted for the artificial chemistry hold also for real metabolic networks. These findings suggest that optimality principles and arithmetic simplicity might lie beneath some aspects of biochemical complexity.


Introduction
The prominent role of metabolism in any biological process and the fact that a large portion of the environmental factors shaping living systems are ultimately metabolic in nature, suggest that strong selective forces have been acting on metabolic networks throughout the history of life. In laboratory evolution experiments [1][2][3] one can witness mostly short term metabolic adaptations, affecting metabolic enzyme regulation and fine tuning of kinetic parameters. However, especially during major transitions, such as the early stages of life's appearance or the rise of oxygen in the Earth's atmosphere, selective forces must have shaped the metabolic wiring itself [4]. Comparative genomics can provide top-down insight into some long-term evolution of metabolic pathways [5,6]. In addition, studies of prebiotic chemistry scenarios have suggested possible seeds of biochemical organization from a bottom-up perspective [7,8]. Yet, whether the long term evolution of metabolism was dominated by unpredictable frozen accidents, or by inevitable network optimization processes, remains a fundamental open question.
In a 1961 review, Baldwin and Krebs suggested that biochemical network topologies may reflect the adaptation toward optimally efficient metabolic strategies, and that manifold use of certain molecules may be a crucial element of this adaptation, as "it is indeed a general principle of evolution that multiple use is made of given resources." [9]. Some computational studies have proposed that the topology of specific metabolic pathways may have evolved towards maximal efficiency [10], minimal number of steps [11], or that network properties may reflect optimal organization [12]. Here we seek to address this problem by exploring a system that can reach a level of complexity comparable to the one observed in the union of all known metabolic pathways, yet is simple enough to allow efficient computation and analytical calculations. In addition, we wish to explore at an ecosystem-level the potential role of "metabolic multi-tasking", as suggested by Baldwin and Krebs. The increasing evidence of abundant horizontal gene transfer in the history of life suggests that this question may be indeed especially relevant at the ecosystem level, where the interchange of genetic information might have created a free economy of enzymes among simple organisms, allowing for the emergence of species that share common molecular tools [13]. Recent metagenomic studies of microbial consortia [14] also suggest the question of whether metabolic functions, more than individual species distributions, might be directly dependent on environmental conditions. Hence it is possible that hallmarks of metabolic optimality in metabolic network wiring may be observable at the level of global (multi-species) metabolic network [15,16], more than at the individual species level. Do specific molecules, reactions or pathway topologies appear to be universally useful in a biochemical network, i.e., relevant for maximally efficient completion of several possible metabolic tasks, possibly across multiple organisms?
In the present work, we combine the study of an extremely simple artificial chemistry [17][18][19][20] with recent systems biology approaches [21,22] to systematically compute pathways that are optimal for an array of elementary metabolic tasks, converting an input molecule into an output one. Behind the apparent complexity of the ensuing pathways, we identify recurring, modularly organized categories of network topologies, and analytically predictable trends in pathway length. In addition, we observe the emergence of "universal metabolic tools" across all optimal pathways. Finally, despite the huge gap in the underlying chemical rules, we find that some properties of real metabolic pathways are consistent with the patterns detected in the model, suggesting that fundamental optimality principles may have played a role in shaping biochemical networks.

Results
Our artificial chemistry consists of a set of N possible molecules {a 1 , a 2 , a 3 , …, a N }, that can participate in reversible ligation/cleavage reactions of the form a i + a j ↔ a k , with i + j = k. This model could be viewed as the simplest possible string-based artificial chemistry [17]. The reaction network R N that includes all metabolites up to length N and all possible reactions (of the order of 0.25N 2 ) between them ( Fig. 1) can be thought of as the underlying chemistry based on which specialized metabolic tasks could emerge. Here we were concerned with pathways, within the R N network, that can optimally perform a given metabolic task. In particular, we searched for optimal solutions to the problem of producing a specific end-product (e.g., a j , with output flux v out ) from a single available nutrient (e.g., a i , with input flux v in ). We define an optimal pathway as one that satisfies the following conditions: (i) it allows a steady state solution, i.e., a mass-conserving flow from input to output; (ii) it has maximal yield, and no waste [23], such that v out =v in ·j/i; and (iii) it has the fewest reaction steps possible. A pathway satisfying these conditions is termed a minimal balanced pathway (MBP) between a i and a j , and will be denoted a i ⇒ a j . MBPs (also referred to below as optimal pathways) can be thought of as the pathways that are most efficient for a specific metabolic task, in the sense that they require the smallest possible number of different enzymes for producing the maximal possible yield [10,24,25].
Despite the simplicity of our artificial chemistry, identifying the MBPs between all possible input-output pairs in a given artificial chemistry R N is a challenge for large N.
We implement three algorithms to approach this problem: a mixed integer linear programming (MILP) akin to flux balance analysis (FBA) [26]; an algorithm that uses enumeration of elementary flux modes [21]; and finally an iterative algorithm that gradually assembles new MBPs from already identified simple ones (see Methods). The three algorithms differ mainly in their scalability, and in their capacity to predict multiple degenerate solutions (see Table S1). A partial overview of the results of our calculations is shown in Fig. 2A and Fig. S1 (see Table S2 for a comprehensive list of MBPs).
Behind the apparent complexity of the topologies encountered in each of the different pathways, it is possible to observe the recurrence of three fundamental categories: each MBP functions either as a pure "addition chain" [27], where smaller metabolites are progressively added together to build the target molecule, or as an "addition-subtraction chain", in which metabolites are both synthesized and degraded within the pathway.
Addition and addition-subtraction chains are concepts borrowed from the field of cryptography, whose relevance to our question will become apparent later. There is also a third, smaller category of cyclical pathways that cannot proceed unless a certain intermediate molecule is already present in the system. These pathways are autocatalytic cycles (Fig. 2B) that very much resemble autocatalytic cycles found in real biochemistry, such as the reverse TCA cycle [7], or the formose reaction [28]. Our results show that autocatalytic cycles can be simultaneously optimal for multiple tasks (Fig. 2B), suggesting that such types of structure may have a fundamental evolutionary advantage in a biological context. In addition to the recurrence of these topological categories among MBPs, we find that some specific structures are used repeatedly, often in a modular fashion (Fig. 2C). Specifically, many simple MBPs are used hierarchically as a toolkit for the construction of progressively more complex MBPs (data not shown), similar to what has been observed in real metabolic networks [29,13,30].
This modular architecture of recurring graph types provides a topological signature of optimally efficient pathways in our idealized chemistry. Since these pathways are chosen based on their minimal length, one may expect that a systematic analysis of all MBP lengths will display additional distinctive properties. Indeed, pathway lengths increase roughly logarithmically with the size of the input (or output) molecule (Fig. 3), with superimposed sharp jumps. For example, the task a 9 ⇒ a 6 can be performed in 2 steps, but the neighbor task a 9 ⇒ a 7 requires a minimum of 6 steps. Moreover, while most MBPs have only one or a few optimal realizations, selected instances display a peak in possible redundant solutions (Fig. S2), usually due to interconversions between molecules of similar size (e.g., a x ⇒ a x+1 ), or to the inherent complexity of a specific molecule (e.g., a 7 ⇒ a j ). These regular patterns suggest that it may be possible to reproduce the MBP length curves without having to actually compute the MBPs.
A similar search for patterns associated with minimal steps had been previously encountered in the mathematics of addition-subtraction chains, of high importance in cryptography [27]. These are integer sequences, beginning with 1, in which the i-th entry is either the sum or difference of any two previous entries in the sequence. These chains are often used in calculating large exponents of numbers [31]. For example, calculating n 128 can either be performed in 127 multiplications (n × n = n 2 , n 2 × n = n 3 ,…, n 127 × n = n 128 ) or in a chain of 7 exponent multiplications (n × n = n 2 , n 2 × n 2 = n 4 , n 4 × n 4 = n 8 ,…, n 64 × n 64 = n 128 ). The latter can be further simplified by tracking the sums of the exponents in each calculation, which form an addition chain (1,2,4,8,16,32,64,128).
Shortest addition-subtraction chains are commonly used to calculate very large numbers in the fewest number of steps, thus speeding up computation time. These are often applied to methods in cryptography where the calculated exponents can be on the order of thousands to tens of thousands of bits [32].
The pathways explored in our model resemble optimal addition-subtraction chains. For example, the problem of obtaining a 128 from a 1 is formally equivalent to the addition chain example described above. However while typical addition-subtraction chains start with the number 1, in our MBPs we explore minimal paths starting from any molecule a i (i ≥ 1). As described in detail in the Methods, we extended previous work on additionsubtraction chains [27,33] to derive the following analytical estimate of the length of MBPs : where L(i, j) is the number of reactions in the MBP with input a i and output a j , and gcd(i, j) is the greatest common divisor of i and j. As seen in Fig. 3, Eq. 1 reproduces the corresponding pathway lengths obtained by computing individual MBPs. This agreement implies that the number of reaction steps needed to construct an efficient metabolic pathway between two metabolites in our artificial chemistry can be roughly estimated from Eq. 1. The only feature that determines the pathway lengths is the complexity of the input and output molecules.
We can now ask whether similar minimal pathway length signatures are discernible in real metabolic networks. To cope with the gap in complexity between our model and real chemistry, we mapped real metabolic networks onto a single atom backbone [11,12]. For example, the aldolase reaction, which cleaves fructose-1,6-bisphosphate (C 6 H 14 O 12 P 2 ) into dihydroxyacetone phosphate (C 3 H 7 O 6 P) and glyceraldehyde-3-phosphate (C 3 H 7 O 6 P), can be mapped onto a carbon atom backbone, becoming simply C 6 ↔ C 3 + C 3 (see Methods). This reaction is now formally analogous to the a 6 ↔ a 3 + a 3 reaction in the idealized chemistry. Upon performing this mapping onto a carbon atom backbone, we ask whether the structure of real metabolic networks allows interconversions that use the minimal, logarithmic number of steps found for the artificial chemistry ( Fig. 3 and Eq. 1).
Specifically, we identified all shortest pathways between any two carbon compounds in Escherichia coli's metabolic network. This was performed using two methods. The first was an explicit use of elementary flux modes as done in the artificial chemistry. As with the artificial chemistry method, this has the advantage of finding all of the shortest pathways that connect any two carbon compounds, but is limited in computational scope to a smaller network. Because of this limitation, we used the network of E. coli's central carbon metabolism [34,35], modified to remove cofactors and reactions that do not affect carbon transfer (see Methods). After finding all minimal elementary flux modes that connect every pair of carbon compounds in the network, we reduced those compounds to their carbon content alone, as described above. We determined, for each input compound,  (Fig. 4). This last fact, as discussed later, may be due, for example, to energetic constraints, or to the higher complexity of real organic chemistry.
The second method is aimed at identifying all shortest pathways between any two carbon compounds in the whole genome-scale metabolic network of Escherichia coli [36], for which it is still infeasible to apply the elementary flux mode analysis. For this, we implemented a heuristic approach to analyze the set of shortest pathways between every pair of metabolites. We first determined, for each input compound, the minimal path length to reach its closest molecule with j carbons; then, for each value of i, we averaged these path lengths over all input molecules with i carbons. The results (Fig. 3) show that these E. coli minimal path lengths approximately follow the predicted logarithmic trend.
For some curves (e.g. the one with C 5 as an input), the specific peaks and valleys of the predicted function are closely followed by the E. coli network. While this does not prove that MBPs are indeed used in real metabolic networks, it suggests that the logarithmic strategy of MBPs is embedded in their architecture. However, because this second method focuses on shortest paths across a metabolite-to-metabolite network rather than on flow-conserving MBPs, the predicted values are likely an underestimate of the number of reactions necessary to construct one metabolite from another in a mass-conserved manner.
So far, we have analyzed the properties of individual MBPs in our idealized chemistry, as well as analogous minimal length pathways in E. coli metabolism. However, some fundamental aspects of the architecture of metabolism may be visible only at the ecosystem-level, namely by collectively analyzing the metabolic network obtained as the union of the metabolic maps of known individual species (sometimes called the "metametabolome"). For example, previous work using an algorithm of network expansion applied to this meta-metabolome has identified potential signatures of major evolutionary events [37], including the metabolic transition that took place upon the great oxidation event, about two billion years ago [4].
Here, we build a meta-metabolome for our idealized chemistry by considering the collection of MBPs. One could imagine that each task a i ⇒ a j corresponds to a different organism, which has filled a specific metabolic niche (availability of a i ), and found an optimal solution (the MBP) for its main metabolic task (produce a j ). The question we ask next is whether, in this ecosystem of MBPs, all metabolites and reactions are used in roughly the same number of pathways, or if specific metabolites or reactions seem to be essential for many optimal tasks, hence representing "universal tools".
For this analysis we used the set of MBPs calculated on the R 19 network using the MILP method. One first result of this analysis is that every metabolite of an even length is used in many more MBP reactions than their odd length neighbors, compared to the underlying chemistry (Fig. 5B). Thus even-length metabolites are more important in that they can be used for more tasks. A possible explanation for this enhanced importance comes from the logarithmic nature of the MBP path lengths. For example, producing a 8 from a 1 requires only three doubling reactions (a 1 + a 1 → a 2 , a 2 + a 2 → a 4 , and a 4 + a 4 → a 8 ). In addition, this same pathway, with one additional reaction, can also be used to optimally produce a 9 and a 10 (see Table S2), overall increasing the number of pathways in which each of those even-length intermediates is used. Indeed, because similar logarithmic pathways can be used as a backbone connecting distant inputs and outputs, we expect metabolites of an even length to appear more often. Similarly, one can address the relevance of each possible reaction across different MBPs. The existence of ubiquitous reactions is visible in Fig. 2A and Fig. S1, and can be more systematically assessed by plotting a usage distribution (Fig. S2). The most abundant reactions -the "universal tools" in this model chemistry -are the ones that ligate two identical molecules (e.g. a 2 + a 2 ↔ a 4 , see Tables, 1, S2, and S3). Strikingly, the distribution of reaction utilization follows a long-tailed distribution (Fig. 5A), whose fit to a power law gives an exponent of approximately -1.1 (R 2 = 0.99). This value is close to our theoretically predicted value of -1 (See Methods).
As in the case of the artificial chemistry network, we can now search for patterns of metabolites and reactions usage in the collective set of all metabolic reactions known in living systems, obtained from the KEGG database [15]. The presence of such signatures would suggest a long-term selective advantage of molecules and reactions that are useful for multiple tasks across different organisms and environments. By counting how many times each possible carbon backbone reaction is used across this biosphere-level metabolism we obtained a broad distribution, and a fit to a power-law gives the exponent of -0.89, comparable with the analytically predicted value, and with that in the artificial chemistry model ( Fig. 5A and Fig. S4). Most surprisingly, we found that several reactions that are top ranking in their count across MBPs in the artificial network, are also at the top of the list in the KEGG-derived reactions (Spearman correlation p-value<10 -6 ; see also In addition to a preference for specific reactions, we can ask whether the spectrum of metabolite usage across the whole KEGG metabolism reflects the possible optimality criteria encountered in the model (Fig. 5C). The metabolite usage in the hydrogen backbone network (see Methods) is similar to that in the artificial chemistry: each evenlength hydrogen metabolite is used more often than its odd-length neighbors (Fig. S4A).
For the carbon backbone distribution, we see a similar descending periodic behavior, but with a periodicity of approximately 5 ( Fig. 5C and Fig. S5B). Hence, molecules containing carbons in a number that is a multiple of 5 are used more abundantly than other molecules across different metabolic reactions. One possible explanation for this C 5 periodicity is the profuse usage of adenine and nicotinamide adenine dinuculeotide compounds as energy carriers and redox balance molecules, although the removal of such compounds has little effect on the observed periodicity (Fig. S6). Hence, the prominent usage of compounds with specific numbers of carbons might reflect global network optimization principles for the efficiency of multiple pathways, as observed in the artificial chemistry model. The periodicity of 5 that we observe, together with the evidence displayed in Fig. 3C, may suggest that the evolutionary optimization of metabolism has been partially taking place around building blocks of five carbons, compatible with previous observations of prebiotic abundance of terpenoids [38] and pentoses [39]. It is also interesting to note that an unexplained periodicity of two had been previously observed in the distribution of the number of carbons among known organic compounds [40][41][42]. While our analysis is based on the distribution of usage of carbon compounds in different reactions, rather than the total count of molecules, future analyses may investigate possible connections between these trends.

Discussion
We have explored the potential existence of general principles underlying the evolution of metabolic network architecture. Specifically, we studied the properties of pathways [44] might be useful for this enumeration. In addition, while intuitively this approach seems to capture the full degeneracy of MBPs, this still remains to be formally proven.
Another intriguing possibility might be to modify our flux balance MILP approach to identifying degenerate solutions by employing integer cuts, as described in [45].
Alternative avenues for optimization using Linear Programming rather than MILP could be in principle devised to reduce the complexity of calculations. For example minimizing the sum of absolute values of fluxes allows for rapid calculation of pathways up to the R 100 network, though in this case the ensuing pathways are not of minimal length (see Fig.   S7). In any case, for the purpose of the current work, we verified that degeneracy does not affect the statistics of usage of different reactions (Fig S8).
Among the recurrent MBP topologies identified, we encountered numerous autocatalytic cycles. The properties of autocatalytic cycles have been studied previously [9,46], and their self-replication potential has been theorized to be important in the early evolution of carbon fixation [8,7]. Autocatalytic cycles have also been shown to be kinetically stable, even in the absence of regulatory control [47]. We found that some autocatalytic pathways (e.g. the pathway from a 7 to a 8 ) are simultaneously optimal for multiple metabolic tasks. In this specific case, the MBP a 7 ⇒ a 8 is also an MBP for the production of each intermediate in the cycle (Fig. 2B). This special property of autocatalytic cycles in our artificial chemistry may have a parallel in real metabolism. For example, many metabolites in the TCA cycle (which is autocatalytic when run in reverse [7]) are precursors for fundamental anabolic processes [48,49]. Similar properties can be observed in the fundamental autocatalytic cycle known as the formose reaction [28]. Determining to what extent real metabolic networks obey optimality principles like the ones described here will take additional effort. Even if an underlying arithmetic simplicity governs idealized optimal pathways, deviations from ideal behavior should be expected. For example, parallel selection pressures for energy production and biochemical stability would likely sacrifice pathway minimality. However, guiding principles as the ones we are proposing could serve as reference points for future research, including circumstances in which metabolism can be different from what we are used to.
Using synthetic biology techniques, for example, it might be possible to redesign metabolic pathways so as to approach predicted ideal efficiencies and minimal enzyme cost [51,52]. From a totally different perspective, in the field of astrobiology, having a prediction of possible signatures of an evolved metabolism might help select, among the molecular spectra of extrasolar planets, those possibly indicative of biogenic processes [53][54][55].

Artificial Chemistry Model
We define an artificial chemistry inspired by previous string-based artificial chemistries (see also main text and

Flux Balance Analysis
Flux Balance Analysis (FBA) is a steady state constraint-based approach to study the flow of mass through metabolic networks [26,56,57]. Briefly, FBA represents the metabolic network of interest as an n×m stoichiometric matrix S, whose element S ij indicates the number of molecules of metabolite i (i=1,…,m) that participate in reaction j (j=1,…,n) (with a positive sign if the metabolite is produced, negative if it is consumed).
Each reaction can be associated with a rate, or flux, v j . Under the assumption of a steady state the following set of mass conservation constraints on the fluxes is generated: Additional constraints (such as availability of nutrients, experimentally observed irreversibility, maximal or minimal rates, etc.) can be imposed on the fluxes as inequalities of the form where α j is the minimal allowed rate of a reaction and β j is its maximal rate. Taken together, the above constraints define a convex polyhedron (the "feasible space") in the n-dimensional space of fluxes. Linear programming (LP) can be used to identify, within the feasible space, flux vectors that maximize or minimize a given linear objective function. In microbial systems it has been often hypothesized that a biologically meaningful objective is the maximization of the flux through the reaction that represents cellular growth, or biomass production [2,58]. Hence, LP applied to FBA provides a prediction of all metabolic fluxes in a cell. FBA can be applied at genome scale, and corresponding stoichiometric models are available for a number of organisms. FBA predictions have been experimentally validated most thoroughly in Saccharomyces cerevisiae SC288 [59] and E. coli K-12 [36].

Minimal Balanced Pathway discovery algorithms.
Minimal Balanced Pathways (MBPs) are defined as sets of reactions in the R N network that can optimally perform a given metabolic task. A task is defined as the production of a specific end product (e.g., a j, with output flux v out ) from a single available nutrient (e.g., The MBP between a i and a j will be indicated as a i ⇒ a j .
We have developed three different algorithms for computing MBPs, as described below: a. Flux Balance Analysis/Mixed Integer LP algorithm We use a modified FBA approach to formulate the MBP problem in a constrained optimization framework. Specifically, we impose the same constraints used in an FBA problem, and further require that the maximal yield condition v out =v in ·j/i be satisfied. We fluxes is an integer -this problem must be solved using a mixed integer linear programming (MILP) algorithm. Our MILP problem for the optimal MBP between a n and a m can be formulated as follows: Subject to: The optimal solution for this problem will give the flux distribution v that uses the fewest nonzero values to maximize the objective. In our MBP computations, the only flux constraints used were those that limit the uptake of the single nutrient to an arbitrary 3. It must be non-decomposable. There are no two smaller EFMs that can be linearly combined to form the one in question.
Because of these constraints, those EFMs that use the minimal number of reactions satisfy the requirements for being an MBP. We used the METATOOL software package [60] to find all EFMs in the R 10 network, and then identified all of those EFMs that are also MBPs.
c. Iterative additive algorithm We designed and implemented an algorithm to produce most MBPs de novo, without relying on prior steady state stoichiometric modeling methods. The algorithm works in an iterative manner, producing longer pathways from shorter ones. For example, we can start from two trivial MBPs: a 1 ⇒ a 1 (which requires no reactions), and a 1 ⇒ a 2 (requiring one trivial reaction, a 1 + a 1 → a 2 ). To compute a 1 ⇒ a 3 , we identify all the ways in which we can decompose 3 into two smaller addends (in this case, only one: 3=2+1). Next we combine together the previously computed MBPs that progress from a 1 to each of these two addends, giving a new putative MBP for the desired new task (a 1 + a 1 → a 2 , and a 1 + a 2 → a 3 ). This procedure can be then iterated to give a prediction of This algorithm is fast and efficient compared to the previous methods, allowing us to apply it to even the R 100 network. However, it has two main drawbacks. First, it will miss pathways that "overshoot" the target value then subtract down to it. Second, it may miss MBPs that are not built modularly from smaller ones. From a comparison of the MBPs predicted by the different algorithms, one can see that the approximations introduced in this algorithm cause 18 out 361 MBPs (5%) in R 19 to overestimate pathway length by one reaction. Also, this algorithm correctly identifies 204 of the 384 degenerate MBPs that the EFM algorithm finds in R 10 . The reaction usage using this method is highly correlated with that of the MILP method applied to R 19 (Pearson correlation 0.96, p-val 10 -51 ), and the EFM method applied to R 10 (Pearson correlation 0.98, p-val 2·10 -17 ).

KEGG reaction reduction
Data used for the comparison between the R N metabolic network and real metabolism was gathered from the KEGG LIGAND database (July 26, 2009 release) [15]. This database was parsed to convert its compounds and reactions into a single-atom form, as described in the text. Compounds that carried any uncertainty in their atomic makeup, including non-specific side-chains or variable chain length were removed from the current analysis. We also removed from the analysis reactions with no associated formula, as well as reactions involving non-specific molecules (such as generic glycans and nonspecific nucleotide or peptide chains). Finally, a number of reactions were found to leave the atomic composition of the compounds essentially unchanged on either side of the reaction (e.g., C 3 ↔ C 3 ). These reactions were ignored as well, without consequences on the results (data not shown).

Metabolite and reaction usage
We counted how often each metabolite and reaction was used in the artificial chemistry pathways as well as in the KEGG-derived single-atom networks. In the model pathways, reaction usage was calculated by counting how many times each reaction was used across all pathways. Metabolite usage was similarly calculated by counting the occurrence of reactions in which each metabolite participates. For example, in the pathways that convert a 9 to a 10 in Fig. 2C, a 9 participates in only one reaction, but a 10 participates in two.
In the KEGG-derived networks, a similar counting scheme was used. The reaction usage was calculated by counting how many times each reduced reaction appears, and the metabolite usage was calculated by counting how many times each metabolite appears across all reactions.

Shortest paths in Escherichia coli a. Elementary flux modes
The first method used EFMs to find all shortest pathways in the central carbon metabolism of Escherichia coli [34,35]. Because we are interested in the pathways that alter the carbon content of different molecular species, we removed those common cofactors that do not alter the carbon content of other metabolites (ATP, ADP, AMP, NADH, NADPH, NAD+, NADP+, H+, and PO 4 ). Also, we effectively ignored reactions involving transport and exchange. We then used the EFM method described in Methods 3c above to find the number of reactions in each of the MBPs for this reduced network.
For each input compound, we listed the lengths of the MBPs for all output compounds containing a number j of carbons. For each j, we select among these paths the shortest one, giving an estimate of the shortest path between any individual compound and the closest j-carbon compound. Finally, for each value of i, we averaged these path lengths over all input molecules with i carbons. The end result is a matrix that provides the average of the shortest paths from any i-carbon compound to its nearest j-carbon compound.
b. All-pairs shortest paths with Johnson's algorithm The larger,genome-scale metabolic network of E. coli [36] has 761 metabolites and 1075 reactions, and is currently infeasible to study with the EFM method described above, limiting us to a graph theoretical approach. Because there are both metabolites and reactions, metabolic networks are inherently bipartite: metabolites connect to each other only through reactions. Pathway lengths were computed by transforming the metabolic reaction network stoichiometric matrix into an adjacency matrix where metabolites and reactions are all represented by the same type of node.
As described above, we are only interested in the connections between carbon compounds, so we removed any non-carbonaceous metabolites (water, phosphate, ammonia, etc.). Also, we removed the following cofactors that are used in many reactions, but do not participate in the transformation of carbons: ATP, ADP, AMP, NAD+, NADH, NADP+, NADPH, coenzyme A, acetyl-CoA, and the acyl carrier protein.
Next, we used Johnson's all-pairs shortest paths algorithm (available as a Matlab function) to find shortest pathways between any two carbon compounds in E. coli's metabolic network. This set of pathways was collated as in the previous section, to produce a matrix of the average of the shortest paths from any i-carbon compound to its nearest j-carbon compound.

Analytical estimate of MBP lengths in analogy with addition-subtraction chains
We developed an analytical approximation for the expected numbers of reactions to be found in any MBP a i ⇒ a j . We begin with a simplified version of the artificial chemistry model in which only irreversible addition reactions of the form are allowed. Under these restrictions, we first ask what is the smallest number of reactions necessary to produce any a j from a 1 . We shall denote by l(j) the smallest possible number of such reactions (we count the use of each reaction (1) once). This problem is equivalent to the problem of addition chains [27], in which one attempts to compute a positive integer by generating a sequence of integers such that each term in the sequence is the sum of two previous terms. Addition chains have been studied extensively, mainly because of their applications in computer science and cryptography [27]. For addition chains, l(j) grows logarithmically with j: Our artificial chemistry represents a generalization, in which a metabolite of any length i can be used to produce an output metabolite of any length j. If we still assume that only addition reactions are possible (i.e. molecules cannot be broken down), a chain from a i to a j will exist only when i is a divisor of j. The problem can then be reduced to the case with a 1 input and a j/i output. Therefore, in the irreversible case, we can assume that inputs consist of monomers without loss of generality. Let L(j) be the length of the shortest reaction chain in this case. Because not every reactant exists when dividing by the input length i, we have the obvious inequality Sometimes the shortest chain can be found easily. For instance, {2 0 , 2 1 , …, 2 k } is obviously the shortest chain from 1 to j = 2 k whose length is k + 1. This suggests the general lower bound on the shortest length L(j) of the addition chain: where ⎡x⎤ represents the ceiling of x, or the smallest integer not less than x. Likewise, as seen below, ⎣x⎦ represents the floor of x, or the largest integer not greater than x (for example, ⎡3.14⎤ = 4, and ⎣3.14⎦ = 3). The longest minimal addition chain arises when the output length is j = 2 m -1. From this fact, we have the upper bound [27] ( ) ( ) where υ(j) is the number of 1s in the binary representation of j. Since ( ) ( ) the bound in equation (5)  There are various conjectures regarding L(j); one of the most famous [33] asserts that computing L(j) is NP-hard. Nonetheless, the computation of L(j) has been pushed up to n ≤ 2 25 . Two other conjectures [31] predict the general lower bound and the upper bound While algorithms for generating the shortest addition chains are discussed by Thurber [31], these all hold for the specific case of pure addition where the input is always a 1 .
We are interested in the general case involving both addition and subtraction, and specifically the lengths l(i, j) of the shortest reaction chains (MBPs) with a i input and a j output. Addition-subtraction chains have also been studied previously as an expansion of addition chains, although these correspond to MBPs with only a 1 as an input. Sometimes, in these cases, l(j) is readily computable, e.g.
( ) 2 while ( ) remains unknown for sufficiently large k. Both lengths can also be equal, i.e. l(j) = L(j). For example, Note also an inequality: All of these features explain the growth law in equation (2).
The quantity l(i, j) has a rich behavior, e.g., there is only a trivial lower bound since l(j, j) = 1. To ignore this non-interesting effect, let us divide i and j by their greatest common divisor as it never affects the length of the MBP: then we can use an obvious inequality Recalling (2) we finally arrive at an approximation for the number of reactions in an MBP that uses a i to produce a j : The approximation in (15)  4 a 3 + a 3 ↔ a 6 C 1 + C 4 ↔ C 5 5 a 2 + a 4 ↔ a 6 C 5 + C 5 ↔ C 10 6 a 6 + a 6 ↔ a 12 C 1 + C 7 ↔ C 8 7 a 1 + a 2 ↔ a 3 C 1 + C 8 ↔ C 9 8 a 8 + a 8 ↔ a 16 C 3 + C 3 ↔ C 6 9 a 5 + a 5 ↔ a 10 C 4 + C 5 ↔ C 9 10 a 1 + a 4 ↔ a 5 C 1 + C 2 ↔ C 3