Edinburgh Research Explorer Deep learning for optimization of protein expression

Recent progress in high-throughput DNA synthesis and sequencing has enabled the development of massively parallel reporter assays for strain characterization. These datasets map a large number of DNA sequences to protein expression levels, sparking increased interest in data-driven methods for sequence-to-expression modeling. Here, we highlight advances in deep learning models of protein expression and their potential for optimizing strains engineered to produce recombinant proteins. We review recent works that built highly accurate models and discuss challenges that hinder adoption by end users. There is a need to better align this technology with the constraints encountered in strain engineering, particularly the cost of acquiring large amounts of data and the requirement for interpretable models that generalize beyond the training data. Overcoming these barriers will help to incentivize academic and


Introduction
Production of recombinant proteins is a central goal in microbial engineering.A quantitative understanding of the relation between DNA sequence and protein expression is key for designing robust and predictable strains.Although the design of nucleotide sequences with bespoke properties is a long-standing goal of synthetic biology, and a prerequisite for accelerating the design-build-test cycle, predictive design is notoriously challenging because it relies on the ability to predict phenotype from genotype specifications.
Owing to recent progress in batch DNA synthesis and sequencing (Box 1), a number of strategies for massively parallel reporter assays have produced datasets with thousands and even millions of genotype-phenotype associations.There is a growing interest in using such large data for building sequence-to-expression models that predict protein expression directly from nucleotide sequence.Deep learning models, in particular, have found success in a range of applications and are attracting substantial interest from the synthetic biology community.Once trained, such models can be queried in silico to infer relations between sequence and expression levels, which enables the rational design of sequences with improved phenotypes [1,2].
Here, we discuss recent progress in deep learning models for sequence-to-expression prediction.Our focus is on phenotype prediction from libraries of short, noncoding, constructs typically employed for optimizing protein yield via control of transcriptional and translational efficiency.These include regulatory sequences such as promoter operator sites, ribosomal binding sequences, and other genetic elements (Figure 1a).We exclude the large body of work on variant effect prediction [3,4] or whole protein-coding sequences [5], which have been reviewed extensively elsewhere [6][7][8].

Sequence-to-expression models
The traditional approach to strain engineering relies on phenotyping libraries of sequence variants, and selecting a subset of top producers for further downstream validation, scale-up, or iterative design (Figure 1b).Recent years have witnessed the birth of new model-guided strategies that employ sequence-to-expression predictors that learn the shape of the phenotypic landscape (Figure 1c).Such models can then be queried iteratively within a sequence optimization loop [9,10].
Among various strategies for sequence-to-expression modeling [11], deep learning has rapidly emerged as a promising technology for building highly accurate predictors.At their core, deep learning models are data regressors that predict a continuous variable from a set of inputs.As in any regression problem, key to building accurate sequence-to-expression models is the a priori knowledge of sequence features that correlate with expression.Although there is a rich literature on the genetic determinants of gene expression in various contexts [12][13][14][15], such relations do not typically have the predictive power required by synthetic biology applications [16].Purely data-driven models have emerged as an effective alternative for prediction of heterologous expression, as they can detect highly nonlinear correlations between sequence and phenotype.Though such correlations do not have a mechanistic basis, they are useful for prediction, particularly because such regressors can be incorporated into search algorithms to find sequences with increased expression levels [17][18][19].
Deep learning models made early successes in predicting many phenotypes such as transcription factor binding affinity [20], ribosome loading [21], and RNA splicing [22].An early work in the field employed deep learning to predict expression from 500 000 untranslated regulatory sequences in BY4741 Saccharomyces cerevisiae grown in synthetic defined media lacking histidine and leukine [23].A number of more recent works in the synthetic biology community have built models for a variety of design tasks that use short genetic elements to control heterologous expression.For example, a large dataset with over 100 000 000 synthetic yeast promoter sequences using Y8205 cells grown in rich media (YPD) using glucose, glycerol, or galactose as feedstock was employed to train deep learning models and study how transcription factors interpret cis-regulatory sequences to control gene expression [24].The same year, another work built an ensemble of convolutional neural networks (CNNs) to predict expression from approximately 1000 000 promoter sequences, derived from two natural promoters with varied transcription factor binding sites, in

Box 1 Large-scale genotype-phenotype data
In recent years, there has been enormous progress in high-throughput assays that can generate large amounts of data suitable for deep learning.These generally employ deep mutational screens with thousands of sequences in parallel.Instead of designing the sequence space with handpicked mutations, deep mutational scanning exploits massively parallel reporter assays and examines thousands to millions of variants in a single experiment [14,21,[28][29][30][31].To this end, a library of mutated variants is first synthesized, cloned into the appropriate vector, and introduced in a system where the protein encoded by the gene carries out a function that can be selected for; the selection enriches cells with active protein variants and depletes those with inactive ones.The library is retrieved from both input and post-selection cells, and high-throughput DNA sequencing is used to quantify the frequency of each variant: variants with high activity increase in frequency, whereas variants with low activity decrease in frequency.Finally, separation technologies, such as cell sorting, are used to place variants into bins, with the variants in each bin scored by DNA read counts [16,32].

Figure 1 Current Opinion in Biotechnology
Strategies for optimization of protein expression.(a) Maximization of yields and titers typically requires the design of bespoke noncoding sequences to control transcriptional and translational efficiency.For training sequence-to-expression models, the variant library must be assembled together with phenotypic readouts, and subsequently cleaned by, for example, removing variants with missing measurements or incomplete sequence readouts.(b) Traditional library-based approach to strain optimization.Libraries of random variants are generated using random mutagenesis and phenotyped to identify the top performers, which are then subject to further optimization and other downstream analyses.(c) Model-guided approach to strain optimization.A trained sequence-to-expression model can predict the landscape of protein yield, which can then be used to optimally navigate the sequence space toward maximal yield.
S. cerevisiae grown in YNB-U medium.Crucially, the work implemented a model-guided sequence optimization strategy (Figure 1c) to generate large and sequencediverse promoter sets [25].
Recently, the work by Vaishnav et al. explored in detail the relationship between promoter sequence, expression phenotype, and fitness, providing a framework for designing regulatory sequences and addressing a range of questions on the evolution of regulatory sequences [1].They generated a large genotype-phenotype dataset with more than 20 000 000 randomly generated promoter sequences, and their expression levels in Y8205 S. cerevisiae measured in complex (YPD) and synthetic media lacking uracil (SD-Ura).The dataset was employed to train CNNs that capture fitness landscapes and generalize beyond the sequence space employed for training.This allowed to use model predictions as a surrogate for a fitness function in molecular evolution studies.
Beyond promoter sequences, several studies have focused on prediction of expression from libraries designed to control translational processes.Höllerer and colleagues combined phenotypic recordings ¨ with deep learning to predict function from genetic sequences in a rhamnose utilization-deficient TOP10.
Escherichia coli strain was cultivated in lysogeny broth (LB) [26].They used a site-specific recombinase to record the effect of gene regulatory elements on DNA, enabling readout of both sequence and translation kinetics of over 300 000 E. coli ribosomal binding sites (RBS).This resulted in a high-resolution dataset to train an ensemble of residual neural networks (Resnet) that predicted RBS activities and quantified the uncertainty of prediction.A different work by Angenent-Mari et al. demonstrated the predictive power of deep learning models trained on RNA toehold switches.Using a dataset with over 100 000 sequences in BL21 E. coli grown in LB medium, they trained deep neural networks that outperformed traditional thermodynamic models.A key innovation was the use of a nucleotide complementarity matrix representation to visualize the learned patterns of RNA secondary structure, which were employed to identify success and failure modes of different designs.A related work [27] built CNNs and language models to re-engineer poorly performing riboregulators.
A recent work explored the relation between model accuracy and data efficiency in a library of ∼200 000 UTRs in recA-deleted MDS42 E. coli grown in rich MOPS medium [2].Unlike previous studies, this library was not randomized but built with a design-of-experiments approach that balanced coverage and depth of the sequence space [14].The work trained a large panel of machine learning models of varied complexity, including nondeep and deep learning models, revealing that accurate models can be trained on as few as a couple of thousand variants.Moreover, using expression screens in E. coli and S. cerevisiae, they showed that controlled sequence diversity can improve predictive performance across larger regions of the sequence space with important gains in data efficiency.

Discussion
The growing interest in deep learning for strain design results from the availability of large screens of genotype-phenotype associations (Box 1).Recent works have demonstrated that accurate prediction together with sequence optimization can substantially accelerate the design of biological circuitry.However, the current literature also reveals a need to better align this technology with the tasks, resources, and constraints encountered in microbial engineering.A number of domain-specific challenges, such as the high cost of data acquisition or the inherent variability of biological data, have not been considered explicitly in the literature and prevent the adoption of deep learning by end users.Next, we explain these gaps in more detail.

Low-N prediction
Current literature shows a strong trend toward highly accurate models trained on large datasets (Figure 2a).This trend appears to be inherited from the machine learning field, where high-scoring models are preferred even if they require very large datasets for training.The rationale behind such large screens is to unbias the genotypic space and expand the high-confidence regions of the models, for example, by including weak regulatory elements that would normally be excluded from smallscale datasets.But in microbial engineering, the cost of acquiring such large datasets is beyond reach for most laboratories.While further progress in laboratory automation [33] or reductions in DNA synthesis and sequencing costs may lower this entry barrier in the future, in the short term, there is a need to develop 'low-N' predictors that can help incentivize the adoption of machine learning by end users.One route for improvement is the development of low-dimensional sequence representations that correlate well with expression [34], as this would enable training models with smaller datasets.
Moreover, expression measurements display biological and technical variation that place a hard ceiling on the accuracy that should be expected from deep learning models.It is somewhat surprising that sequence-to-expression models are normally scored with absolute metrics of accuracy, as it is common practice in other applications of machine learning, instead of using metrics relative to the measured variation across biological replicates.Lower requirements on model accuracy may allow the use of smaller training data, particularly considering that good prediction scores can be achieved with datasets of few thousand variants [26], and even datasets as small as hundreds of variants can produce models that may be sufficiently accurate for design [2].

Design of training data
Sequence-to-expression models developed so far have employed off-the-shelf deep learning architectures, all of which have been shown to deliver high predictive accuracy provided that the training data are sufficiently large (Figure 2a).As it is common in applied machine learning work, selection of optimal architectures requires extensive benchmarking of different models tested on data relevant for the particular application at hand (Figure 2b).This is no exception for synthetic biology, and current literature suggests that such empirical approach for model selection is highly effective.However, far less attention has been placed on the design of the data itself.Establishing what makes a good dataset for training is particularly important when data sizes are severely limited.The majority of models built so far have employed fully randomized sequences for training.While this prevents models from inheriting bias from the training sequences, random libraries require a larger number of samples to balance the depth and breadth of coverage of the sequence space.Such considerations on experimental design have been discussed in the protein engineering literature [36], and have recently arisen in the context of sequence-to-expression models [2,37].

Model generalization across laboratories
The data required for training sequence-to-expression models are typically collected in few growth conditions and using lab-specific experimental setups and strains.A key caveat of this approach is that it prevents models to generalize predictions for sequences that have been phenotyped by different laboratories, which likely employ different growth conditions, culture volumes, expression hosts, and varied range of other experimental design decisions.As a result, laboratories must often train their own models using in-house data.This is inefficient and causes models to be poorly interoperable across laboratories.Moreover, such models cannot be employed to predict expression levels in growth conditions different from those employed for training, which can severely limit their scope of applicability, particularly because protein expression can be highly dependent of growth media.
The success of AlphaFold2 highlighted how highly standardized datasets, such as the structural data in the Protein Data Bank, can enable the construction of widely applicable models [38].In the case of genotype-phenotype data, the innate biological variation and the current lack of standardization protocols for data acquisition, prevent the construction of such generalpurpose models.A potential solution may emerge from recent progress in the field of natural language processing [39], where the use of pretrained models on large datasets has shown an exceptional ability to generalize to new data by fine-tuning and retraining with small, task- specific, data.Such approaches have already shown some promise for genomic data [40].

Interpretability
Purely data-driven models may be sufficient for a specific design task, but in strain engineering, designers often need to identify and interpret the specific molecular processes that contribute to the observed phenotype.This is particularly challenging for prediction of protein expression levels, as these depend on the joint action of transcription, translation, and their multiple interconnected regulatory mechanisms.It is unclear if individual models for each of these processes, all of which require bespoke screening data, can be coupled into larger predictive models.Moreover, deep learning models are notoriously difficult to interpret.Classic methods based on motif detection have been employed in conjunction with sequence-to-expression models [23], and recently the field of Explainable AI [41] has produced many scoring methods that have been employed to quantify the importance of input sequences [2,26].Other methods borrowed from imaging applications (e.g.saliency maps) have also shown some success in explaining the output of sequence-to-expression models [35].However, interpretability methods have generally been employed on a case-by-case basis, and their ability to provide insights for design is still limited.
The poor interpretability of machine learning algorithms contrasts with some of the thermodynamic models that have been widely adopted by the community for sequence-based prediction [42].Such models can incorporate sequence information and contain sufficient mechanistic details for meaningful biological interpretation.Most recently, the newly developed Promoter Calculator demonstrated the benefits of combining such thermodynamic models with machine learning [11].Thermodynamic models, however, require substantial effort to build and they are difficult to retrain to data from in-house strains or growth conditions employed by end users.A number of whole-cell models have provided mechanistic descriptions of microbial physiology [43,44].These have a remarkable ability to reproduce wild-type phenotypes in various growth conditions, as well as various modes of failure for synthetic constructs [45], but they are unable to account for sequence information.A notable exception is the whole-cell E. coli model [46], but the large efforts required for its construction and simulation make it impractical for design tasks [47].Overall, there is a strong need for sequence-to-expression models that combine the best of the three approaches: the flexibility of user-trainable machine learning models, the mechanistic interpretability of thermodynamic models, and physiological insights of whole-cell models.

Conclusions
Deep learning methods are being adopted by most disciplines in synthetic biology, including, for example, cell-free systems [48], microbial communities [49], and metabolic engineering [50,51].Protein production is no exception and recent work has led to several sequenceto-expression models that map DNA sequences to protein expression levels.These models promise to revolutionize the design-build-test cycle with phenotypic predictions that can be directly linked to sequence design.Moreover, such models can be trained to in-house data, and provide flexibility to incorporate historical data and refine predictions using iterative design approaches.Such progress has largely followed the trends in the general AI and machine learning community, with a focus on complex models trained on large datasets.Current sequence-to-expression models have already demonstrated their ability to deliver highly accurate predictions.The next wave of deep learning models should explicitly address the inherent challenges of biological data and the needs of end users in microbial engineering.Budgetary constraints impose hard limits on the size of genotype-phenotype screens, and few academic or industrial laboratories can afford to acquire large data solely for the purpose of model training.One way forward is to prioritize the design of training data over the design of model architectures.This would mark a departure from the general approach in machine learning, and offer exciting opportunities for the development of application-specific tools for predictive modeling.Moreover, given that phenotype data are difficult to transfer across laboratories due to the use of different strains, growth conditions, or measurement hardware, the more widespread adoption of deep learning may ultimately rest on local models that can be trained with in-house data of feasible size.Failure to do so risks an 'AI winter' that has pervaded other disciplines in the past, and misses an opportunity for laboratories that could otherwise benefit greatly from this technology.While machine learning models are not a replacement for mechanistic approaches, the ability to acquire high-throughput strain characterization data necessitates fresh approaches to extract actionable insights from such large data, a task for which deep learning is particularly well-suited.

Figure 2 Current
Figure 2