Journal Pre-proof A biophysical basis for the emergence of the genetic code in protocells

The origin of the genetic code is an abiding mystery in biology. Hints of a ‘code within the codons’ suggest biophysical interactions, but these patterns have resisted interpretation. Here, we present a new framework, grounded in the autotrophic growth of protocells from CO 2 and H 2 . Recent work suggests that the universal core of metabolism recapitulates a thermodynamically favoured protometabolism right up to nucleotide synthesis. Considering the genetic code in relation to an extended protometabolism allows us to predict most codon assignments. We show that the first letter of the codon corresponds to the distance from CO 2 fixation, with amino acids encoded by the purines (G followed by A) being closest to CO 2 fixation. These associations suggest a purine-rich early metabolism with a restricted pool of amino acids. The second position of the anticodon corresponds to the hydrophobicity of the amino acid encoded. We combine multiple measures of hydrophobicity to show that this correlation holds strongly for early amino acids but is weaker for later species. Finally, we demonstrate that redundancy at the third position is not randomly distributed around the code: non-redundant amino acids can be assigned based on size, specifically length. We attribute this to additional stereochemical interactions at the anticodon. These rules imply an iterative expansion of the genetic code over time with codon assignments depending on both distance from CO 2 and biophysical interactions between nucleotide sequences and amino acids. In this way the earliest RNA polymers could produce non-random peptide sequences with selectable functions in autotrophic protocells. (chemical) selection for codon:anticodon interactions with strong binding interactions between small protometabolic equivalents of mRNA and tRNA.


Introduction
The genetic code offers tantalising clues to its own origins -a 'code within the codons' [1].
The first base in the triplet codon links the amino acid encoded with a shared precursor, implying a relationship between codons and biosynthetic pathways [1][2][3]. The second position of the anticodon is associated with the hydrophobicity of the amino acid encoded [4,5]. These associations have been identified both in biological systems with anticodon enrichments in aptamers [6,7] and through experimental methods [8][9][10][11]. Experimental identification of interactions however is difficult to definitively prove. The third position is agreed to have less coding information but has been linked to amino acid 'complexity' and molecular size before [1,12]. These patterns imply that direct biophysical interactions between amino acids and their cognate codons or anticodons fashioned at least some of the genetic code. Both the co-evolution and stereochemical hypotheses have complex histories in the literature, with proponents for and against the proposals [13][14][15][16]. The code within the codons remains deeply enigmatic.
The coevolution of metabolism and the code plainly depends on the actual structure of metabolism. Most hypotheses either assume heterotrophic origins or avoid specifying a particular autotrophic metabolism [1][2][3][17][18][19]. Di Giulio [18] has noted that a heterotrophic origin of amino acids is not easy to reconcile with the coevolution hypothesis, as it implies acid biosynthesis operate in this way. Instead, we favour a possible link with autotrophic protometabolism, which has gained strong experimental support in recent years.
Phylogenetics indicates that the last universal common ancestor (LUCA) was likely autotrophic and chemiosmotic, growing from H 2 and CO 2 via the acetyl CoA pathway and reverse incomplete Krebs cycle, the gateway into the universally conserved core of metabolism [22][23][24][25][26][27][28][29]. Recent experiments are confirming what has long been predicted by theoretical thermodynamics [30][31][32][33]: the chemistry of the universal core of metabolism can indeed occur spontaneously in the absence of genes and enzymes. Starting with CO 2 and H 2 (or reducing equivalents), intermediates in the acetyl CoA pathway and reverse Krebs cycle form spontaneously [34][35][36]. Non-enzymatic equivalents of glycolysis, the pentose phosphate pathway [37,38] and gluconeogenesis [39] have been identified as well. Multiple syntheses of amino acids from α-keto acids by direct reductive amination [40,41], and by transamination reactions [42] can also take place. Long-chain fatty acids can be formed by hydrothermal Fischer-Tropsch-type synthesis which chemically resembles the process of fatty acid elongation [43][44][45]. Recent work suggests that nucleobases might also be formed following the universally conserved biosynthetic pathways, using metal ions as catalysts [46].
The endergonic barrier to the first steps of CO 2 reduction can be lowered by proton gradients across Fe(Ni)S barriers [47], driving CO 2 fixation through vectorial electrochemistry in a similar way to cells [23,29,[48][49][50]. Mathematical modelling shows that, in principle, dynamic proton gradients could drive the autotrophic growth of protocells in a 'monomer world' (i.e. lacking RNA and peptides; [24]) in which CO 2 fixation is catalysed by organically chelated FeS clusters associated with the membrane [51,52]. Experimental work confirms that protocells with mixed-amphiphile membranes are stable under the necessary conditions [53,54], and that cysteine at pH 9 acts as a ligand to form redox-active 4Fe4S clusters equivalent to those in ferredoxin [55]. Together, these findings suggest that the coupling of proton gradients to autotrophic protometabolism could drive protocell growth from H 2 and CO 2 in far-from-equilibrium environments such as alkaline hydrothermal vents J o u r n a l P r e -p r o o f [23,25,28,29,50]. The key point is that amino-acid biosynthesis would take place via a biomimetic protometabolism in autotrophically growing protocells.
Here we rationalise the possible emergence of the genetic code in the context of autotrophically growing protocells. Protometabolic reactions mimicking the core of extant metabolism and nucleotide polymerization are assumed to take place and can generate short sequences of RNA, which would initially have only possessed ribozymatic activity and lacked genetic information. Beginning with the universal core of autotrophy, we show a close relationship between the first position of the codon and the distance from CO 2 fixation; between purines at the second position and the hydrophobicity of amino acid; and between the properties of the codon and the length of the amino acids encoded. These simple rules allow us to assign the cognate amino acid to the large majority of codons, corroborating the idea that the code coevolved with autotrophic metabolism, with direct biophysical relationships between amino acids and random RNA strings.

The first codon position corresponds to distance from CO 2 fixation
The standard genetic code specifies 20 different amino acids which are universal across all domains of life. Yet abiotic syntheses of amino acids both in laboratory and environmental settings show substantially greater structural variation than in life, raising the question of why specific amino acids were incorporated into a genetic code. A protometabolic perspective on the genetic code neatly resolves this issue. By assuming that the chemistry that constitutes the conserved core of metabolism was the predominant prebiotic chemistry, the amino acid complement available is specified by this network. Further selection may depend on the selfreactivity of chemical species such as that proposed by Hendrickson, Wood and Rathnayake (2021) discussed further in the SI.
The protometabolism hypothesis places a new emphasis on the importance of extant metabolism for understanding the emergence of the genetic code. Figure 1 shows a metabolic map, beginning with CO 2 and proceeding to the 20 standard amino acids. Rather than reconstructing the pathway through phylogenetic analyses, as in earlier work [57], we J o u r n a l P r e -p r o o f instead emphasised the chemistry underlying the reactions. Amino-acid synthesis pathways were manually curated from the MetaCyc database [58] on the basis of their conservation between archaea and bacteria. Structuring metabolism around CO 2 fixation in this way generates a striking pattern. Amino acids encoded by G at the first position of the codon are clearly seen to be the first amino acids produced by the flux of carbon from the acetyl-CoA pathway into the reverse incomplete Krebs cycle. Amino acids encoded by A at the first base of the codon are predominantly the second set of amino acids produced from CO 2 and H 2 , followed by C-encoded amino acids; U-encoded species are largely the furthest from CO 2 fixation.
This map makes an important distinction from the code coevolution hypothesis in its classical forms, which as noted earlier assume biosynthetic transformations occurring on peptidated tRNA molecules. Rather than the first codon position reflecting families of amino acids with similar biosynthetic relationships, our map points to a temporal patterning of amino acid recruitment into the genetic code. This has the potential to explain why some amino acids with distinct biosynthetic origins still have the same nucleotide at the first position in the codon, e.g. histidine and the glutamate derived amino acids. Initial weak CO 2 flux (in the context of growing protocells [51,52]) would mean that the highest concentration of amino acids would be those encoded by G at the first position. Note that we have limited the three hexacodonic amino acids (serine, arginine and leucine) to single first codon groups that best reflect their biosynthetic and temporal groups (A, C, C at the first codon, respectively). This is justified in the SI. To exclude bias in visual interpretation we have totalled the number of reaction steps from CO 2 for each amino acid. Figure 2 shows that, while there is overlap between the species, there is a clear rise in the number of reaction steps required for the synthesis of each set of amino acids.
Given that each set of amino acids is associated with a specific first-position codon, we looked to see if a corresponding temporal ordering could be achieved for the cognate nucleotides. A direct relationship between the biosynthetic emergence of amino acids and nucleotides could explain why the earliest set of amino acids emerged at the same time as J o u r n a l P r e -p r o o f guanosine, for example, but we could not identify any temporal ordering with nucleotides.
Guanosine and adenosine nucleotides are derived through the same number of steps, and just one step less than uridine. Worse, cytidine species are derived secondarily from uridine, so their recruitment of an earlier set of amino acids makes little sense.
In reality, the number of steps is a very loose guide to their likelihood. We are currently modelling the kinetic and thermodynamic probability of each step in the core pathways of metabolism, to determine the most likely patterns of flux. Glycine is the key amino acid for purine synthesis in the de novo pathway. The earliest synthesis pathway for glycine was probably direct assimilation from a methenyl-pterin species [59,60]. The CO 2 fixation pathway is rich in guanosine derived cofactors, notably methanopterins and folates, the C 1 carriers in methanogenesis and acetogenesis respectively, as well as flavins and molybdenum cofactors [59,61,62]. Together, these point to a positive feedback loop in which guanosine synthesis feeds back on CO 2 fixation and so glycine synthesis. Greater flux through any C 1 H 2 X species could increase glycine concentration, while in turn improving the yield of purine nucleotides.
Given that the temporal pattern predicts that purine-encoded amino acids were the earliest amino acids to be recruited to the genetic code (Figure 1), combined with the identified positive feedback loop between guanosine cofactors, glycine, and purine synthesis, it seems possible that in the earliest stages of metabolic and code evolution there was an unequal purine to pyrimidine ratio (as others have postulated previously [5,12]). How the specific ordering of G -A -C -U came to exist remains unclear from a purely metabolic point of view. A possible explanation is that the nucleotides which form the strongest binding interactions (G and C both form three hydrogen bonds via classic Watson-Crick base paring) recruit before those with weaker interactions (A and U form two hydrogen bonds). If so, then these binding interactions could reflect early (chemical) selection for codon:anticodon interactions with strong binding interactions between small protometabolic equivalents of mRNA and tRNA.

The second codon position reflects amino acid hydrophobicity
Building on the assumption of a temporal phenomenon at the first position, we found a repeating pattern, in which the hydrophobicity of the amino acid encoded corresponds to the hydrophobicity of the base at the second position of the anticodon. This pattern further specifies codon assignments. We initially observed the pattern using the hydropathy scale of Kyte and Doolittle (1982). Specifically, the more hydrophilic the amino acid within the set of amino acids, the more likely it was to be associated with U. Conversely, the more hydrophobic the amino acid, the more likely it was to be associated with A. This broad association corresponds to earlier observations on stereochemical interactions within the code [1,10,19]. What is new here is that the pattern repeats across the four time periods corresponding to the bases at the first position of the codon ( Figure S1). This means that considering the timing of amino acid emergence with its hydrophobicity serves to specify two nucleotides in the codon-anticodon pair.
The repeating pattern is not perfect. In particular, the amino acids cysteine and arginine deviate from their expected trends. This may be addressed by considering that the Kyte-Doolittle hydropathy scale is based on the folding behaviour of modern proteins.
Cysteine forms internal disulphide bridges and reactive thiols are often buried in the hydrophobic core of proteins to protect from oxidation [64]. Arginine's guanidinium group, remains charged across all pH ranges ≤ 10 [65] meaning that it is undoubtedly hydrophilic, yet its capacity to form cation-π interactions suggests some degree of hydrophobic interaction. Given these uncertainties relating to protein context, we investigated the validity of our hypothesis by considering a range of other hydrophobicity scales. Specifically, we compared the scales proposed by Hopp [66], Janin [67], LogP values determined by Tayar [68], Engleman [69] and Eisenberg [70]. Figure S1 shows that the pattern, while still present, is obfuscated by a lack of uniform units of hydrophobicity, and variation in hydrophobicity assignments for some amino-acid species.
To mitigate these variations in hydrophobicity, we took inspiration from Trinquier and Sanejouand (1998), who attempted to determine what effective property was best preserved J o u r n a l P r e -p r o o f in the genetic code. While we do not agree with all their reasoning, their determination of a mean hydrophobicity ranking from 43 discrete scales is an inspired choice. Figure 3 shows violin plots based on these data. Three features stand out. First, when considering the mean values (red points) for each amino acid, the pattern of hydrophobic assignment for each 2ʹ nucleotide is consistent across all four first codon positions, with the only deviation being histidine (which we return to later). Second, the pattern of amino acid to nucleotide assignment is robust for the earliest two groups of amino acids, encoded by G and A at the first codon position (upper panels), but is less obvious for the later C and U groups (lower panels). Third, the hydrophobicity rankings exhibit substantial variation, confirming that hydrophobicity scales are highly variable, and so could easily misrepresent the actual biophysical interactions. Whether hydrophobicity itself or some related property such as partition energy [72] is reflected in the genetic code is therefore unclear; but it is nonetheless sufficient to explain some codon assignments. Regardless of whether hydrophobicity or partition energy provides the best measurement of the physicochemical properties of amino acids, we are specifically referring to direct interactions between amino acids and the second position of the anticodon, rather than to a later stage of enzyme catalysis as proposed by Caldararo and Di Giulio [71,72].

The third codon position corresponds to size in non-redundant codons
With our focus on the distinction between purines and pyrimidines, we noticed that the length of the amino acid can specify its allocation between non-redundant codons. Specifically, where sister codons NNR and NNY encode different amino acids (e.g. GAY = aspartate, GAR = glutamate) the identity of the nucleotide at the third position corresponds to the length of the amino acid encoded. Length in this context is the computed distance of the amino acid in its extended linear form [73] and does not consider alternative structural states. Figure 4 shows this simple metric of amino acid length (maximal distance from carboxylic acid to the end of the R group) is consistent in six of these seven cases: the longer amino acid is assigned the NNR codon. This size dependency is direct for the codon J o u r n a l P r e -p r o o f (large amino acid, large purine base at the third position) but is inverted for the anticodon (large amino acid, small pyrimidine base at the third position of the codon). That implies an extension of the stereochemical effect at the third position of the anticodon, which could specify the code assignment for an amino acid based on how well it fits into some unknown pocket.
Fascinatingly, this pattern appears to hold for the proteinogenic but non-standard amino acids. Pyrrolysine is encoded at the amber stop codon UAG. Tyrosine is encoded at the UAY codons. Pyrrolysine is longer than tyrosine, and in line with this pattern, should have a Y at the third position of the anti-codon. Selenocysteine is encoded by UGA, cysteine by UGY codons. Whilst they have the same structure, the selenium atom is larger than the sulfur and this may be a sufficient difference to dictate a size-based discrimination. Further to this, these amino acids obey the expected rules of stereochemistry at the second position.
Examining this size-based relationship, we looked to see if there was any set of parameters in the structure of codons that could explain a binding pocket for amino acids.
We failed to identify any patterns, but instead noticed that the position of redundant and nonredundant codons in the genetic code is not random. Near identical rules have been reported before [74,75] in the context of translational optimisation, but our interpretation is subtly different as it links to both amino acid selection and code structure.

Predicting the full genetic code through temporal and stereochemical rules
If all 64 possible codons were to encode a separate species, there would be 64 different amino acids; as such, degeneracy is built into the code. The rules outlined above reduce the informational density of the genetic code to 24 units (Figure 6A and B). Given that the standard code consists of 20 amino acids, three of which are hexacodonic (with two locations in the code, taking us up to a coding capacity of 23), and that there is a need to specify stop locations (getting up to 24), this reasoning constrains the 64 possible nucleotide permutations to the level that is observed throughout all life. This in turn suggests that the structure of the genetic code is in part determined by the nucleotide triplets themselves, perhaps due to conformational states they can form and how they interact with amino acids.
These codon sets can be used as template to assign specific amino acids. If we assume that the temporal pattern holds in its entirety, then following the order of G-A-C-U specified by the first position of the codon (Figure 2), we can assign the four groups of amino acids to the quadrants shown in Figure 6C, where the amino acids in each of the four groups are placed around the codon wheel. We assigned the hexacodonic amino acids (Ser, Arg and Leu) to the single codon group most in line with their biosynthetic origins. Next, we used the hydrophobic interactions at the second position of the anticodon to assign the amino acids in each quadrant, so that the most hydrophobic amino acid was assigned to the most hydrophobic nucleotide (A), and conversely, the most hydrophilic amino acid to most hydrophilic base (U) using the averaged hydrophobicity ranks from Figure 3.
These assignments lead to a 'first draft' code, which nonetheless bears considerable similarity to the standard genetic code (Figure 6D). The main issue here is that histidine (circled in red) is in the wrong place, as noted earlier for its hydrophobicity, which biases the J o u r n a l P r e -p r o o f assignments in the entire C codon quadrant. Correcting for the hydrophobicity of histidine is sufficient to correctly assign all C group amino acids ( Figure 6E). The reason that histidine specifically deviates is unknown. The question marks remaining in Figure 6E are all in the U quadrant. These U-encoded amino acids are difficult to assign, as there are too many gaps in this section of the code.
Where there are two amino acids of similar hydrophobicity, which differ only at the third position, our assignments are based on size, with the longer amino acid assigned to NNR codons and the shorter one to NNY (Figure 6E). Together, these assignments generate a code that has clear similarity to the extant genetic code. The remaining gaps distributed around the codon wheel are conveniently the codons utilised by the hexacodonic amino acids. The emergence of hexacodonic nucleotides could therefore be a simple space filling phenomenon (Figure 6F), in which the amino acids filling these gaps are assigned to the code as appropriate, still following the stereochemical rules. Given that the AGR codons for arginine are commonly mutable in variant genetic codes [76] and that the reassignments are typically to stop codons or glycine or serine, both of which fulfil the same hydrophobicity position in alternative quadrants, this seems reasonable. There are a number of other factors not elucidated here that we discuss briefly in the SI, such as why methionine and tryptophan become single codon restricted, and why the biological non-coded amino acids generated in metabolism did not become incorporated into the genetic code. Applying these simple rules generates a codon table that is remarkably close to the standard genetic code. In fact, barring the assignment of AGR, our predicted code looks like several of the mitochondrial codes [77].
These observations are all drawn from patterns in the modern genetic code but are here interpreted through an autotrophic origins-of-life lens. Notwithstanding the remaining questions, our consideration of the code from an autotrophic point of view clearly makes sense of the emergence of translation from protometabolism. While we acknowledge some circularity in deriving the code from patterns found within the code, the simplest and arguably the only sensible interpretation of these patterns is that (i) they are real and strong enough to J o u r n a l P r e -p r o o f shine through billions of years of evolution; (ii) they reflect direct biophysical interactions between cognate amino acids and specific nucleotide sequences that developed over time as protometabolism itself expanded; and (iii) the precise interactions may not be between a naked anticodon and its amino acid but could be between short RNA aptamers with binding pockets and amino acids (as both codon and anticodon are involved in the patterns, and Watson-Crick base-pairing may play some role).

Conclusion
We have presented here a hypothesis on the origins of the genetic code, which is grounded in a reinterpretation of longstanding observations of patterns within the genetic code [1,2,19].
The rules we have identified effectively recapitulate the extant genetic code on the basis of spontaneous protometabolism from H 2 and CO 2 at the origin of life and some form of stereochemical interactions between nucleotides and amino acids. We predict that these would involve direct non-covalent binding interactions between short aptamers of RNA and amino acids. Further work to confirm each rule in this specific context is underway, using NMR and molecular dynamic simulations. But the fact that longstanding patterns in the code are amplified and clarified by the assumption of strictly autotrophic protocellular origins lends credence to our interpretation. Some complexities arise from the rules identified. In particular how could a temporal pattern on the codon, and a stereochemical pattern on the anti-codon co-emerge? The fact that these patterns are observed on both the codon and the anticodon may point to a threecomponent mechanism of early translation, as observed in life (i.e. codon, anticodon, amino acid). Other important elements that are missing from this hypothesis concern the RNA species involved (proto-mRNA, proto-tRNA, proto-ribosome) which constitute the mechanistic components of the genetic code. We envision that only RNA species with the complexity of small interacting aptamers could ensure that amino acids are incorporated into the code in the temporal ordering observed or could underpin the weak stereochemical interactions that still shine through the code. While these specific interactions remain J o u r n a l P r e -p r o o f enigmatic, we note that they require no more than short polymers of RNA that lack any information content. Our scenario therefore offers a new perspective on the origins of information in biology. Any form of direct interactions, in which random RNA sequences interact in a non-random way with amino acids means that the sequence of amino acids in a nascent peptide would be templated by the RNA sequence. RNA aptamers with no intrinsic information content formed within growing protocells would be expected to have functions relating to protocell growth, such as CO 2 fixation, interactions with nucleotide cofactors, or copying the RNA sequences themselves (as the RNA polymerase is a Mg 2+ -dependent protein, it is feasible that a short Mg 2+ -binding peptide could partially mimic its function). The assumption of a spontaneous protometabolism in growing protocells therefore makes sense of the code within the codons, and simultaneously offers a framework that enables the transition from deterministic chemistry to genetic information at the origin of life.

Specifying codonic amino acids
A protometabolic hypothesis on the origins of life assumes that prebiotic chemistry has a direct continuity with the universal core of extant metabolism. This assumption indicates that the amino acids that are produced in metabolism would also have been available in a prebiotic system. Further, if this protometabolism was the most reliable, or the simplest, or the most efficient chemistry available then the standard amino acids would have been the most readily available for incorporation into a genetic code.
Deselection of biological non-proteinogenic amino acids have been discussed by Hendrickson, Wood and Rathnayake, 2021. We agree with a substantial portion of their work, particularly their arguments for the deselection of ornithine, homoserine and homocysteine. Deselection of citrulline may also be similar. We do not have an explanation for why some α-keto acids that are found in metabolism such as 2-oxobutanoate which could form homoalanine were never transaminated nor incorporated.

Single codon restrictions for hexacodonic amino acids
The temporal pattern we observe, and indeed the original code coevolution hypotheses are complicated by the hexacodonic amino acidsserine, arginine and leucine. These amino acids have 6 codons, a set of 4 cognate codons, and a second location of 2 cognate codons.
There is always an alternate 1 st codon nucleotide. This is somewhat at odds with a temporal pattern as biosynthetic pathways mostly remain fixed. In order to deconvolute this pattern, we restrict each of the 3 amino acids to a single 1 st nucleotide group which best reflects their biosynthetic origins. Arginine is derived from glutamate; all other glutamate derived amino acids are in the C group, so the group that makes the most sense is C. Serine is secondarily derived from glycine, the amino acid closest to carbon dioxide; we therefore assigned it to its earliest group, A. Leucine is in both group C and U. There are no distinctive details that we can used to assess which group is most likely; as such we restrict it to the earlier group, C.

Metabolic Pathway curation for protometabolic map
Metabolic pathways were manually curated from the MetaCyc metabolomics database [58] with appraisal from supporting literature detailed on the site for some specific pathways.
Each metabolite of interest was appraised individually, and synthetic pathways were appraised on specific criteria. First, only pathways that were reported to operate in the autotrophic direction were selected, eliminating degradation pathways; some pathways were reported in mixed direction and were considered as potentially autotrophic. In addition, these pathways were required to connect to an autotrophic system starting with CO 2 fixation and the Krebs cycle.
Second, the pathways must have been reported to be present in both bacterial and archaeal domains of lifepathways restricted to eukaryotes were discounted. Further to this, pathways were appraised on the degree in which they map between domains.
Pathways which are found in a few clades of bacteria and/or archaea were discounted if pathways with greater conservation were present.
Third, pathways were appraised on their chemistry. In some instances, metabolites lack a conserved synthesis pathway between bacteria and archaea. In these instances, the chemical nature of the pathways was assessed for their degree of similarity. If the sequence of chemical reactions was conserved (even if the catalysts themselves were not) those pathways were considered to be equivalent. Examples include the acetogenesis and methanogenesis pathways of CO 2 fixation, which differ in the use of reductant and C1 carriers. In addition, if there are multiple pathways with functionally identical chemical species and similar interconversions, these pathways were considered identical i.e., the succinylase vs acetylase variants of lysine synthesis.
Fourth, the effective simplicity of the pathway was also taken into consideration.
Pathways which utilise inorganic species were selected over pathways that degrade existing biomolecules. This was most of relevance for thiol incorporation into cysteine and methionine, where sulfide assimilation pathways were prioritised over thiol transfers between such species.

J o u r n a l P r e -p r o o f
Finally, in the few instances where there was no rationalising the ancestral pathway between bacteria and archaea these pathways were considered to be separate. The shikimic acid synthesis pathway and the formation of proline are the primary examples here.
A summary of selected pathways can be found in Table S1.
This table was used to determine minimum number of steps from CO 2 to each amino acid and nucleotide of relevance. The number of steps is the number of enzymatic steps from each species assuming relevant other species are present i.e., in the formation of histidine, once AICAR is formed, it is assumed ATP is available to react with it. In instances where multiple products converge the longest more representative pathway was selected i.e., the synthesis of tryptophan was calculated via the indole group rather than the considerably shorter route via serine.

Hydrophobicity assignments for temporal amino acid groups with single hydrophobicity scales
We wanted to determine whether amino acids in each temporal group could be assigned into their cognate codons on the basis of hydrophobicity. Multiple hydrophobicity scales exist. We chose 6 scales ( Figure S1). These scales show that hydrophobicity can be used to assign amino acids to relevant second anticodon nucleotides with moderate accuracy. Later

Single codon restriction
Using the rules identified in above specifies a genetic code that is highly similar to the extant genetic code. The major outlier are the single codon restrictions for tryptophan and methionine. Given these amino acids are non-restricted in many mitochondrial variant codes, the single codon restriction may be due to genomic regulation. Alternatively, given these two amino acids enact a strong metabolic expense; Tryptophan is the furthest from CO 2 and is also the largest amino acid in terms of fixed carbon, and methionine is the initiation amino acid (for reasons unknown) which comes with its own metabolic implications.