Optimal encoding rules for synthetic genes: the need for a community effort

Mol Syst Biol. 3: 134

A paradigm shift is underway within the methodology of heterologous protein expression. Specifically, researchers are moving away from conventional techniques of cloning genes from cDNA libraries and moving toward the rational design and de novo synthesis of entire protein‐coding sequences from pre‐annealed oligonucleotides (Libertini and Di Donato, 1992; Gustafsson et al , 2004). It was the invention of polymerase chain reaction (PCR) that allowed efficient construction of synthetic genes. Since then, the steadily increasing accuracy and decreasing cost of oligonucleotide synthesis (now as low as $0.10 per base; Carlson, 2003; Carr et al , 2004; Kong et al , 2007, see Figure 1) has created a research environment in which gene synthesis offers three main advantages over molecular cloning: cost efficiency, scope and flexibility of redesign (Libertini and Di Donato, 1992). As a result, the emerging field of synthetic biology is highly motivated to improve this approach, as it seeks to expand the sophistication of human‐engineered genetic architectures, leading ultimately to the synthesis of entire genomes (Yount et al , 2000; Smith et al , 2003).



Figure 1. 
The cost, per base, of commercial oligonucleotide assembly from 1999 to 2006. The price of gene synthesis has decreased almost 30‐fold in the past 7 years (data for years 1999–2003 are taken from Carlson, 2003). Data for years 2004–2006 reflect the lowest price found in advertisements placed within Science magazine).



Current research into synthetic gene construction has focused largely on improving PCR‐based methods. Areas under active investigation include the following: increasing the accuracy of gene products by reducing errors in oligonucleotide construction and PCR synthesis/amplification (Ciccarelli et al , 1991; Young and Dong, 2004), reducing the relatively high cost of post‐synthesis sequencing (Young and Dong, 2004), increasing the length of genes …

A paradigm shift is underway within the methodology of heterologous protein expression. Specifically, researchers are moving away from conventional techniques of cloning genes from cDNA libraries and moving toward the rational design and de novo synthesis of entire protein-coding sequences from pre-annealed oligonucleotides (Libertini and Di Donato, 1992;Gustafsson et al, 2004). It was the invention of polymerase chain reaction (PCR) that allowed efficient construction of synthetic genes. Since then, the steadily increasing accuracy and decreasing cost of oligonucleotide synthesis (now as low as $0.10 per base; Carlson, 2003;Carr et al, 2004;Kong et al, 2007, see Figure 1) has created a research environment in which gene synthesis offers three main advantages over molecular cloning: cost efficiency, scope and flexibility of redesign (Libertini and Di Donato, 1992). As a result, the emerging field of synthetic biology is highly motivated to improve this approach, as it seeks to expand the sophistication of human-engineered genetic architectures, leading ultimately to the synthesis of entire genomes (Yount et al, 2000;Smith et al, 2003).
Current research into synthetic gene construction has focused largely on improving PCR-based methods. Areas under active investigation include the following: increasing the accuracy of gene products by reducing errors in oligonucleotide construction and PCR synthesis/amplification (Ciccarelli et al, 1991;Young and Dong, 2004), reducing the relatively high cost of post-synthesis sequencing (Young and Dong, 2004), increasing the length of genes that can be synthesized (Kodumal et al, 2004), developing microchip-based technology and/or microfluidic devices that allow for the simultaneous assembly of multiple genes (Tian et al, 2004;Zhou et al, 2004;Kong et al, 2007), and automating the whole pipeline from gene design to synthetic gene screening (Cox et al, 2007). All frontiers show signs of rapid improvement (e.g., Xiong et al, 2004;Engels, 2005;Wu et al, 2006a), therefore the current challenges for gene synthesis are essentially optimizations of existing concepts.
In stark contrast, it appears that we have much left to learn when it comes to the conceptual design of gene sequences. A significant fraction of the biologically and commercially important genes that have been redesigned report little or no success in increasing protein expression (e.g., see Alexeyev and Winkler, 1999;Flick et al, 2004;Wu et al, 2004;Hillier et al, 2005). More surprising, some of these 'improvements' have led to a direct and observable reduction in protein production (Griswold et al, 2003). Even those that do report increased protein yield require careful scrutiny, because many have not controlled for altered mRNA levels in their system (e.g., Deng, 1997;Alexeyev and Winkler, 1999;Feng et al, 2000;Humphreys et al, 2000;Nalezkova et al, 2005). Thus, although excellent progress in the practice of gene synthesis enables experimental implementation of the technique, the scientific community remains far from a complete understanding of what constitutes a rational design strategy for a protein-coding gene. Instead, the very concept of a 'translationally optimal codon' has grown to incorporate dimensions of translational speed, translational accuracy and sustainability of yield that could vary from one experiment to another. Meanwhile, we have learned that a codon's position within a coding sequence, its 'neighborhood' of other codons, its structural role within the mRNA sequence and the nature of the genomic system in which it is to be expressed can all influence the effects of 'synonymous' codon choices. Given that we can physically construct any gene, what rules define the appropriate sequence to manufacture? Here, we examine current progress and emerging challenges in both theory and practice, showing how this topic exemplifies the interdisciplinary challenges of 21st century biology.

Why redesign the coding sequence?
Modern expression vectors have undergone extensive manipulation to maximize mRNA transcription. Yet a relatively weak correlation can exist between expression levels of mRNA and those of translated protein products (e.g., Futcher et al, 1999;Nie et al, 2006). Thus, it is now widely understood that persistent poor expression of protein product can result from problems occurring at a post-transcriptional stage, especially at the point of translation (Kurland and Gallant, 1996;Gustafsson et al, 2004). The issue here is that the 'digital' portrayal of translation found in biology textbooks oversimplifies a bio-mechanical process in which different populations of tRNAs essentially compete to translate an appropriate codon of mRNA within the context of a ribosome (e.g., Rodnina et al, 2005). Different organisms can vary enough in their relative contents of isoaccepting tNRAs to change the dynamics of this competition, such that different choices from a suite of synonymous codons can influence the speed and accuracy of translation. For this reason it can be useful to redesign a protein-coding sequence to suit its new context when moving it between genomes.
What should we build? The theory of synthetic gene design The most direct method to find an optimal encoding for heterologous expression would be to comprehensively screen all possible alternative sequences. This is however impractical for sequences of any appreciable length because of the nearinfinite encoding possibilities: approximately 3.7 Â10 21 different nucleic acid sequences could encode a single peptide comprising 150 amino acids, thus top-down screening procedures must be guided by bottom-up gene design.
To this end, a wealth of software has been developed to help bench scientists achieve reverse translation (Arentzen and Ripka, 1984;Mount and Conrad, 1984;Danckaert et al, 1987;Pesole et al, 1988;Presnell and Benner, 1988;Weiner and Scheraga, 1989;Bains, 1990;Tamura et al, 1991;Libertini and Di Donato, 1992;Makarova et al, 1992;Nash, 1993;Raghava and Sahni, 1994;Withers-Martinez et al, 1999;Hoover and Lubkowski, 2002;Fuglsang, 2003;Gao et al, 2004;Grote et al, 2005;Jayaraj et al, 2005;Richardson et al, 2006;Villalobos et al, 2006;Wu et al, 2006b;Puigbo et al, 2007). Broadly speaking, this software can be divided into two categories according to algorithmic purpose: one seeking gene designs that facilitate empirical sequence manipulations, the other seeking designs that translate well into protein products. Perhaps the two most salient features of this software are the diversity of opinion as to what rules will optimize translation and a general lack of awareness by each software solution that numerous competitors exist (Figure 2).
emphasis is needed to refocus efforts toward a 'narrow and deep' systematic comparison of different recoding strategies for a few genes. Meanwhile, the nearest we have to such a dataset are the numerous gene variants produced by evolution. It has been long recognized that codon usage frequency appears to be unequal for most synonymous codons within naturally occurring genomes (Grantham et al, 1980). Much of this bias is a passive reflection of the mutation biases at work in a genome (Sharp et al, 1993;Knight et al, 2001), however it can be tricky to ascertain which features of which sequences have been shaped by natural selection. Not only do precise predictions from evolutionary theory rely on parameters that we may never know with certainty, but the noise to signal ratio implicit within any 'naturally optimized' sequence can confound the most careful analyses.
Where to next? Specific objectives for future progress Although the major unknowns of synthetic gene technology are mostly those of design theory, the current problem is an excess, and not deficit, of ideas. Major progress thus seems poised to occur when empirical studies start to compare these ideas systematically.
An important step would be to standardize experimental protocols and reports so that the emerging patchwork of results can be examined as a coherent whole. Specifically, experiments must standardize their measurement of mRNA expression levels for the target genes (as a baseline for interpreting protein yields), and measure protein production in absolute rather than relative terms (e.g., mg/l or percentage of total protein rather than 'n-fold increase/decrease') if they are to be compared.
A further step would be to identify one or more standardized (model) experimental systems for use by any and all research groups that are willing to share information. An ideal expression system would not be pre-engineered in any way that could confound interpretation of results (e.g., by containing enriched tRNA pools), it would employ a protein product that is amenable to clear, quantitative assay and could include an internal control (such as a dual reporter system in which only one gene has been redesigned) to add further confidence to measurements of protein yields.
The idea of standardization extends into the philosophy of bioinformatics software that predicts gene design. Current software typically requires a combination of logically independent gene optimizing steps as a mandatory, pre-packaged whole. This renders the comparison of results difficult and suggests the need for secondary design algorithms designed to isolate specific gene features (e.g., changing codons while maintaining overall GC content, or varying GC content while maintaining RNA structural motifs).
It is noteworthy that the underlying nature of all gene design software is similar and simple: a user must input a protein sequence and a genetic code. The protein sequence is then reverse translated into a nucleotide sequence using one or more algorithms, and the resulting nucleotide sequence is returned to the user. Independent applications must duplicate at least this much functionality. A promising direction of future software development in this field would be an emphasis on integration into a unified, distributed, modular web service for synthetic gene design. Specifically, programmers could take advantage of purpose-built web technologies, such as XML (a data sharing language) and SOAP (a language for wrapping independent applications), to facilitate interconnection of disparate, pre-existing software. New algorithms could be added as pathways through which a synthetic gene might travel en route to final design. This would provide users with a common interface through which they could choose the specific algorithm(s) to use at each step of synthetic gene design. Far from restricting the diversity of independent ideas for design services offered by different groups (on different web-servers), this type of coordination through a common interface would focus attention where it belongs: on the overlapping (and sometimes directly competing) concepts of how to design genes for optimal expression.

Critical assessment
Our suggested shift in research emphasis toward standardized protocols and integration of competing design strategies would create a foundation with potential that exceeds the capabilities of any one group or traditional collaboration. How then can the diverse interests of those interested in synthetic gene design be harnessed into a common framework for progress?
We advocate the introduction of a competitive model, similar to the CASP approach that has been used within the protein folding research community (Moult, 2005). Given a standardized experimental protocol, it would be possible to pick genes of major research interest that are proving problematic for heterologous expression. For example, a recent study of Plasmodium falciparum, the causative agent of the most deadly form of malaria, reported that '12 targets, which did not express in Escherichia coli from the native gene sequence were codon-optimized through whole gene synthesis, resulting in the expression of three of these proteins' (Mehlin et al, 2006). Presumably, malaria researchers would be motivated to call for theoretical predictions of redesign that could help their situation. Theorists and software developers should in turn be motivated to demonstrate their algorithms' worth as the marketplace of redesign ideas becomes increasingly saturated, and those who research the optimization of gene assembly protocols (regardless of sequence content) would be motivated to absorb a significant fraction of the effort required for synthesizing these predictions. The net result would be a distributed (community wide) version of the direct screening approach favored by early pioneers of synthetic gene technology (Stemmer et al, 1993;Humphreys et al, 2000), in which each segment of the community directly benefits from a united focus. If all designs were deposited within the SGDB (Synthetic Gene Database) (Wu et al, 2007), then this could quickly transform the knowledge base for synthetic gene technology. Fortunately, recent advancement in multiplex gene synthesis technology has implied the feasibility of simultaneous synthesis of thousands of genes for large-scale experimental tests (Tian et al, 2004;Zhou et al, 2004;Cox et al, 2007;Kong et al, 2007), so the potential for large-scale comparison of predictions may be nearer than we think. This is an ambitious vision, but the motivation is strong. Current synthetic gene technology offers the potential to become a foundational tool of systems biology. However, until we know how to optimize coding sequences, we cannot construct a single synthetic gene with confidence, let alone produce a whole synthetic genome.