Machine learning for metabolic engineering: A review.

Machine learning provides researchers a unique opportunity to make metabolic engineering more predictable. In this review, we offer an introduction to this discipline in terms that are relatable to metabolic engineers, as well as providing in-depth illustrative examples leveraging omics data and improving production. We also include practical advice for the practitioner in terms of data management, algorithm libraries, computational resources and important non-technical issues. A variety of applications ranging from pathway construction and optimization, to genetic editing optimization, cell factory testing and production scale-up are discussed. Moreover, the promising relationship between machine learning and mechanistic models is thoroughly reviewed. Finally, the future perspectives and most promising directions for this combination of disciplines are examined.


Introduction
Metabolic engineering is enjoying an auspicious moment, when its potential is becoming evident in the form of many commercially available products with undeniable impact on society. This discipline has produced: synthetic silk for clothing (Hahn, 2019;Johansson et al., 2014), meatless burgers that taste like meat because of bioengineered heme ("Meat-free outsells beef," 2019), synthetic human collagen for cosmetic purposes ("Geltor unveils first biodesigned human collagen for skincare market", 2019), antimalarial and anticancer drugs (Ajikumar et al., 2010;Paddon and Keasling, 2014), the fragance of recovered extinct flowers (Kiedaisch, 2019), biofuels (Hanson, 2013;Peralta-Yahya et al., 2012), hoppy flavored beer produced without hops (Denby et al., 2018), and synthetic cannabinoids (Dolgin, 2019;Luo et al., 2019), among others. Since the number of possible metabolites is enormous, we can only expect these successes to significantly increase in number in the future. Traditional approaches, however, limit metabolic engineering to the usual 5-15 gene pathway, whereas full genome-scale engineering holds the promise of much more ambitious and rigorous biodesign of organisms. Genome-scale engineering involves multiplex DNA editing that is not limited to a single gene or pathway, but targets the full genome (Bao et al., 2018;Esvelt and Wang, 2013;Garst et al., 2017;Liu et al., 2015;Si et al., 2017). This approach can open the field of metabolic engineering to stunning new possibilities: engineering of microbiomes for therapeutic or bioremediation uses (Lawson et al., 2019), designing of multicellular organisms as biomaterials that match a specification (Islam et al., 2017), ecosystem engineering (Hastings et al., 2007), and perhaps even fusion of physical and biological systems. None of these examples are likely to become reality through a traditional trial-and-error approach: the number of genetic part combinations that could produce these outcomes is a vanishingly small fraction of the total possible. For example, engineering a microbiome to produce a medical drug involves not only introducing and balancing the corresponding pathway in one or more of the microbiome species, but also modifying internal regulatory networks so as to keep the community stable and robust to external perturbations. Even for the case of single pathways and teams of highly-trained experts, the trial-and-error approach is hardly sustainable, since it results in very long development times: for example, it took Amyris an estimated 150 person-years of effort to produce the immediate precursor of the antimalarial artemisinin, and Dupont 575 person-years to generate propanediol (Hodgman and Jewett, 2012). An approach that pinpoints the designs that match a desired specification is needed.
The main challenge in more sophisticated biodesign is, arguably, our inability to accurately predict the outcomes of bioengineering (Carbonell et al., 2019;Lopatkin and Collins, 2020). New technologies provide markedly easier ways to make the desired DNA changes, but the final result on cell behavior is usually unpredictable (Gardner, 2013). If metabolic engineering is "the science of rewiring the metabolism of cells to enhance production of native metabolites or to endow cells with the ability to produce new products" (Nielsen and Keasling, 2016), the ability to engineer a cell to a specification (e.g. a given titer, rate and yield of a desired product) is critical for this purpose. Only the ability to accurately predict the performance of a genetic design can avoid an arduous trial-and-error approach to reach that specification.
Moreover, while the flourishing offshoots of the genomic revolution provide powerful new capabilities to discover new DNA sequences, understand their function, and modify them, it is not trivial to harness these technologies productively. The genomic revolution has provided the DNA code as a condensed set of cell instructions that constitutes the main engineering target, and functional genomics to understand the cell behavior. Furthermore, the cost for these data is rapidly decreasing: sequencing cost decreases faster than Moore's law, transcriptomic data grow exponentially (Stephens et al., 2015), and high-throughput workflows for proteomics and metabolomics are slowly becoming a reality Zampieri et al., 2017). But many researchers find themselves buried in this "deluge of data": there seems to be more data than time to analyze them. Furthermore, data come in many different types (genomics, transcriptomics, proteomics, metabolomics, protein interaction maps, etc), complicating their analysis. As a result, analysis of functional genomics data often does not yield sufficient insights to infer actionable strategies to manipulate DNA for a desired phenotype. Moreover, CRISPR-based tools (Doudna and Charpentier, 2014;Knott and Doudna, 2018) provide easy DNA editing and metabolic perturbations (e.g. CRISPRi (Tian et al., 2019)). These tools provide the potential to perform genome-wide manipulations in model systems (Wang et al., 2018), and a growing number of hosts (Peters et al., 2019). However, it is not clear how to prioritize the possible targets. Rational engineering approaches have proven useful in the past (George et al., 2015;Kang et al., 2019;Tian et al., 2019), but the detailed knowledge of a pathway can produce on the order of tens of targets, whereas CRISPR-based tools can reach tens of thousands of genome sites (Bao et al., 2018;Bassalo et al., 2018;Garst et al., 2017;Gilbert et al., 2014).
Machine learning (ML) is a possible solution to these problems. Machine learning can systematically provide predictions and recommendations for the next steps to be implemented through CRISPR (or other methods (Paschon et al., 2019;Reyon et al., 2012;Wang et al., 2019)), and it can use the exponentially growing amounts of functional genomics data to systematically improve its performance. Machine learning has already proven its utility in many other fields: self-driving cars (Duarte and Ratti, 2018), automated translation  , face recognition (Voulodimos et al., 2018), natural language parsing (Kreimeyer et al., 2017), tumor detection (Paeng et al., 2016), and explicit content detection in music lyrics (Chin et al., 2018), among others. It has the potential to produce similar breakthroughs in metabolic engineering.
However, a change in perspective is required regarding the relative importance of molecular mechanisms. Whereas the machine learning paradigm concentrates on enabling predictive power, metabolic engineers typically define scientific value around the understanding of genetic or molecular mechanisms (see section 4.0). Nonetheless, the biological sciences (including computational biology) have been particularly challenged to make accurate quantitative predictions of complex systems from known and tested mechanisms. Hence, if accurate quantitative predictions are needed for a more transformative metabolic engineering, it may be desirable to shift some of the emphasis from identifying molecular mechanisms into enabling data-driven approaches. This apparent detour may, in the end, more efficiently produce mechanistic models, if we combine the predictive power of machine learning with the insight of molecular mechanisms (Heo and Feig, 2020).
In this review we provide an explanation of machine learning in metabolic engineering terms, in the hopes of providing a bridge between both disciplines. We explore the promises of machine learning, as well as its current pitfalls, provide examples of how it has been used so far, as well as auspicious future uses. In short, we will make the case that machine learning can take metabolic engineering to the next step in its maturation as a discipline, but it requires a conscious choice to understand its limitations and potential.

What is machine learning?
Machine learning is a subdiscipline of Artificial Intelligence (AI), which attempts to emulate how a human brain understands, and interacts with, the world (Fig. 1). A fully functioning AI would enable us to perform the same processes as human metabolic engineers: choose the best molecules to produce, suggest possible pathways to produce it, select the right pathway design to obtain the desired titers, rates and yield, and interpret the resulting experimental data to troubleshoot the metabolic engineering effort. A fully functioning AI would of course be useful for many other tasks such as: fully autonomous cars and planes, recommending medical treatments, directing agricultural practices, reading and summarizing texts like a human, automating translations from different human languages, and producing music and movies. Obviously, we do not yet have full functioning AIs (or strong AI or artificial general intelligence as it is often referred to (Pei et al., 2019;Walch, 2019 )), and it is a continuing debate whether we will ever have them (Melnyk, 1996), but AI approaches have been quite successful in some bounded tasks such as playing chess and Go better than humans (Silver et al, 2016(Silver et al, , 2018, or predicting protein structures from sequence (AlQuraishi, 2019). Since AI and machine learning are generally applicable tools, some of these partial successes can be very useful for metabolic engineering (see section 3 for examples).
Machine learning is the study of computer algorithms that seek to improve automatically through experience (i.e. learning), often by training on supervised examples (Fig. 2), also known as supervised machine learning. This works by statistically linking an input to its associated response for several different examples: e.g. promoter choice for a pathway and the corresponding final production, protein sequence and its function, etc (Figs. 2 and 3). It is important to realize that the emphasis is set in predicting the response, rather than produce mechanistic understanding. In fact, the algorithm linking input and response is not meant to represent a mechanistic understanding of the underlying processes: for example, modeling the full process of promoters causing the expression of proteins that code enzymes which then catalyze reactions that transform metabolites and result in a predicted production. Rather, the algorithm is chosen to be as expressive as possible to be able to learn any relationship between input and response. Hence, none of the biological information is encoded in the algorithm; all the biological information is provided by the training data, which must be carefully selected (supervised) so the algorithm can learn the desired relationship (promoters to production, protein sequence to function, etc.), generalize it, and be able to predict it for new inputs that were not in the training set (Fig. 3). This difference is crucial with respect to traditional metabolic engineering and microbiology, where understanding the mechanism is considered of paramount importance (see section 2.2.1 for a specific example). In machine learning, we can see the situation in which we can predict that, e.g., a given promoter choice will have the best production, but we cannot explain the metabolic mechanism that provides that optimal production . This state of affairs has its pros and cons, and efforts have been made to introduce biological prior knowledge in the algorithms (see section 4).
There is a continuous interplay between the complexity of a supervised machine learning algorithm and the amount of data available to train it (Fig. 4). If the model/algorithm is not expressive enough (not enough parameters), it will be unable to describe the data accurately (underfitting). If the model displays much more parameters than data instances are available, it will just "memorize" the training data set rather than grasp the underlying general patterns required to predict new inputs (overfitting). In this case, the algorithm will produce exceedingly good results for the training set, but very poor ones for any new input that is used as a test (Figs. 3 and 4). Cross validation (Fig. 3) provides an effective way to choose the number of parameters: both overfitting and underfitting result in very poor predictions.
There are many supervised machine learning algorithms available in the public domain: linear regressions, quadratic regressions, random forest, support vector machines, neural networks, Gaussian process regressors, gradient boosting regressors (the popular library scikit-learn provides a good starting point with an extensive list and explanations (Pedregosa et al., 2011)). To give a concrete example, a classic machine learning algorithm is the decision tree, that can be used, for example, to predict which protein expression levels result in high production (Fig. 5). As can be observed, this algorithm represents a high-level abstraction of how humans are believed to think. Because no single algorithm is best for every learning task (Wolpert, 1996), a significant endeavor when applying machine learning is choosing the optimal algorithm for your problem (and its hyperparameters, see Fig. 5). Ensemble modeling is an alternative approach that sidesteps the challenge of model selection (Radivojević et al., 2020). Ensemble modeling takes the input of various different models and has them "vote" for a particular prediction. Based on their performance, a different weight is assigned to each algorithm. The examples of the random forest algorithm (Ho, 1995) or the super learner algorithm (van der Laan et al., 2007) have demonstrated that even very simple models can increase their performance significantly by using an ensemble of them (e.g., several decision trees in a random forest algorithm).
Learning without supervision also constitutes an important part of Deep learning. Machine learning is a subdiscipline of Artificial Intelligence, which attempts to reproduce how human brains think. Symbolic AI (or Good Old Fashioned AI, or GOFAI), is a part of AI devoted to reproduce thought through symbolic representations of the world. In contrast, machine learning mimics thought using algorithms that learn a task (e.g., identify a dog) through learning from data. GOFAI was dominant in the early states of AI (50s-80s) but has now lost relative influence. Machine learning, however, is now the dominant branch of AI and focuses on improving performance through the acquisition of experience in terms of data. Among the many possible algorithmic approaches in machine learning, neural networks ( Fig. 7) have become most popular since ~2010 because their performance seems not to saturate as easily as other methods (Fig. 8). Neural networks with many layers (Fig. 8) are called deep neural networks, and constitute the basis for Deep Learning.

Fig. 2.
Machine learning basics. Supervised Machine learning algorithms define learning in a narrow way: the ability to predict a response (e.g. the target compound production) from a set of inputs (e.g. protein concentrations for a pathway). The inputs (or features) and response (or output) can be numbers (e. g. protein concentrations) or categories (e.g. different available promoters). All supervised machine learning algorithms follow this general architecture. Because the algorithm linking input and response does not include mechanistic information, but is rather chosen to be as expressive as possible, machine learning can predict relationships between really diverse inputs and outputs: e. g., production and enzyme choice (see section 2.2.2), metabolite rate change and multiomics measurements (section 2.2.1), or protein sequence and protein function. The supervision consists in providing training data consisting of the input and the associated response. This labeling of the input data to teach the algorithm the right associations is the step that is most arduous and costly, particularly for large data sets. This has prompted AI researchers to develop methods that do not require this step (Fig. 6).

Fig. 3.
Machine learning terminology. The standard workflow for supervised machine learning involves first using a training data set (including the inputs, or features, and the corresponding responses, or labels) to train the chosen algorithm. The training data set is composed of instances or examples of the inputs and response to be learnt. Instances depend on the problem to be learnt: they could be different strains and conditions (section 2.2.2 example), time points (section 2.2.1 example) or different proteins. The goal is for the algorithm to be able to predict the response for inputs that it has never seen before (i.e. were not in the training set), which is the ultimate test of its performance. A way to foresee how the algorithm will perform under such a test is to use only part of the training data set (all data except red overlay) for training, and then check the predictions for the remaining inputs (red overlay), to be compared with the known responses. This procedure is called validation and, if performed several times by randomly holding out a fraction of the training data set, it takes the name of cross validation. A 10-fold cross-fold validation, for example, randomly holds out 10% of the training set to test predictions for several draws. Cross validation is a good way to determine the needed algorithm complexity needed (Fig. 4). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 4.
Model complexity vs data availability. The number of parameters (model characteristics that can be changed to fit data, see Fig. 5) in a model provides an idea of its complexity (more parameters → more complex). If the number of parameters is much smaller than the number of instances, the model cannot hope to describe the training data (underfit model). This can happen with "long and skinny" training data: few inputs and many instances. If the number of parameters is much bigger than the instances, the model but will be unable to generalize beyond the training set. The solution for the underfitting case is straightforward: increase the complexity of the model (number of parameters). The solution for the overfitting case is reducing the model parameters. However, if the number of inputs/features is high, it may be impossible to do so. This is often the case in metabolic engineering, where omics data sets displaying tens of thousands of features are available, but only for ~100 instances ("short and fat" training data). It becomes imperative then to choose the most informative features through the feature selection methods provided by unsupervised learning (Fig. 6). This feature selection is needed to avoid the "curse of dimensionality": i.e., the amount of data needed to support results in a statistically sound fashion often grows exponentially with the dimensionality. Poor cross validation scores (Fig. 3) can help identify both overfitting and underfitting.
machine learning, given the significant effort involved in creating labeled data sets. The areas of machine learning focused on this challenge are unsupervised learning and reinforcement learning. Unsupervised learning searches for patterns in a data set with no pre-existing labels, requires only minimal human supervision, and often attempts to create clusterings or representations that aid human understanding or reduce dimensionality (Fig. 6). Examples of unsupervised machine learning algorithms include Principal Component Analysis (PCA), Kmeans clustering (Sculley, 2010), and Single Value Decomposition (Manning et al., 2008). Familiar examples in metabolic engineering include identifying patterns in metabolomics profiles that distinguish between different types of cells: healthy vs. sick (Sajda, 2006), stressed vs. non-stressed (Luque de Castro and Priego-Capote, 2018; Mamas et al., 2011), or high-producing vs low-producing (Alonso-Gutierrez et al., 2015). Reinforcement learning represents a different paradigm regarding learning from experience that posits that humans learn not from properly labeled examples, but rather from interacting and probing their environment. Hence, the aim of reinforcement learning is to use experience and data to update an internal policy that optimizes a desired goal (Fig. 6). A prime example of this approach (Treloar et al., 2020) is controlling a bioreactor which contains a co-culture (environment), through manipulations of the concentration of auxotrophic nutrients flowing into the reactor (actions), and informed by the relative abundances (measurements), to ensure a specified co-culture composition (goal). Perhaps the most known example of reinforcement learning are the Hidden Markov Models (HMMs) that are commonly used to annotate genes and align sequences (Yoon, 2009). Reinforcement learning has also been applied to suggest pathways for specific molecules (Koch et al., 2020) or molecules that fit desired properties (Popova et al., 2018), as well as to optimize large-scale bioreactor fermentations using online Example of a supervised machine learning algorithm: a decision tree. Decision trees come from an abstracted view of how human learning works, rather than a mechanistic understanding. Decisions trees automatically build a decision "flowchart" that, in this case, predicts high or low production based on the protein expression levels. An example training data set and corresponding decision tree are shown in panels A and B, respectively, based on a set of strains (instances) and their production (response) depending on different protein expression levels (input features). Using the training data set, the algorithm decides on the optimal split points (x 1 , x 2 and y 1 ) to predict the production based on the input features. The split points are the parameters of the algorithm: more parameters will allow the algorithm to describe more instances. The algorithm also has a number of "hyperparameters" which are set before training, including the maximum tree depth, and the minimum number of instances required to split a node, among others (see scikit learn library for more details). Decision trees form the base for one of the most popular algorithms: the random forest. The random forest algorithm is just an ensemble of decision trees.

Fig. 6.
Other types of machine learning that do not require labeled data. Unsupervised methods and reinforcement learning were created to avoid the cumbersome process of labeling data for supervised methods. Unsupervised machine learning methods often search for patterns that aid human understanding or reduce dimensionality (e.g. PCA). For example, in this case the algorithm projected the five inputs (e.g. metabolomics data) into a two dimensional plane that groups them according to similarity. This type of dimensionality reduction can be very useful for feature selection (Fig. 4). Reinforcement learning methods attempt to achieve a goal through a continuous interaction with an environment from which they learn through a variety of measurements, and on which they can act through a menu of actions. The result of the actions as viewed by the measurements is used to iteratively update an internal policy that dictates future actions. Fig. 7. Artificial neural networks are a particular type of machine learning algorithms that loosely mimic how neurons work (Fig. 1). Neurons are modeled as having a set of inputs (dendrites) and a single long axon that serves as output (A). Artificial neural network cells mimic that: several outputs combined linearly and a non-linear output (B). The output is combined to other cell inputs, creating an artificial neural network (ANN). Here we see a fully connected network where all cells from each layer are connected to all cells in the next layer (C). This type of architecture results in many parameters (w ij and b j ), which requires large amounts of data to determine (Fig. 4). Deep neural networks are ANNs with many layers (Fig. 8). Given the original biological origin of ANNs, there is a significant interest in the AI field in obtaining further inspiration from biomimicry.
continuous process data (see section 3.3). However, there is still generally a dearth of reinforcement learning examples in metabolic engineering, which represents an opportunity for this type of machine learning. Deep learning (DL, Fig. 7 (LeCun et al., 2015)) is a specific type of machine learning algorithm that has been particularly successful in the past decade (Fig. 8). This algorithm has been shown to improve performance with the amount of training data when other methods plateau. Deep learning is based in Artificial Neural Networks, which attempt to mimic how neurons work (Fig. 7). In the last decade, deep learning has been the basis of the most celebrated AI achievements. However, compared to more classical machine learning methods, deep learning generally requires far larger amounts of data for training: 10,000s or millions of instances, as opposed to hundreds or thousands (although that depends on the number of inputs, see section 2.2.2). The reason for these large data sets hunger is that deep neural networks can include thousands to millions of parameters, which need to be determined from the data (see Fig. 4). In metabolic engineering, the use of deep learning has been sparse for this reason: the data sets tend to be small (<100 instances), with the notable exception of sequence data. Deep learning has been most useful with sequence data: e.g., to predict protein function (Ryu et al., 2019), or translation initiation sites (Clauwaert et al., 2019) (see section 3.0). However, this is expected to change as more high-throughput methods to characterize cellular components become available, provided that data are structured consistently and stored appropriately (see section 2.4). Indeed, techniques to generate high quality omics data are improving rapidly and the cost per sample is decreasing (Stephens et al., 2015), so application of deep learning to metabolic engineering might become commonplace soon.

A couple of illustrative examples of machine learning in metabolic engineering
We will now illustrate how machine learning algorithms work through two different applications that elucidate particularly important points: predicting the kinetics of a metabolic network, and optimizing cell-free butanol production. We have focused on these examples because we believe they most relate to the day-to-day activities of metabolic engineers: leveraging omics data and improving production.

Kinetic learning: relearning Michaelis-Menten dynamics through machine learning
Our first example uses machine learning to tackle a commonly encountered problem in bioengineering: predicting the kinetics of a metabolic pathway. Predicting pathway dynamics can enable a much more efficient pathway design by allowing us to foresee in advance which pathway designs will meet our specifications (e.g., titers, rates and yields). Classic kinetic models predict the rate of change of a given metabolite based on an explicit functional relationship between substrate/product concentrations (metabolites) and enzymes (protein abundance, substrate affinity, maximum substrate turnover rate). Michaelis-Menten kinetic models (Costa et al., 2010;Heinrich and Schuster, 1996) have historically been the most common choice. In reality, the true functional relationship between metabolites and enzymes are typically unknown for most reactions due to gaps in our understanding of the mechanisms involved, resulting in poor prediction capabilities.
Costello et al.  showed that supervised machine learning (Fig. 2) can offer an alternative approach, where the relationship between metabolites and enzymes can be directly "learnt" from time series of protein and metabolite concentration data. In a sense, this approach involves relearning the equivalent of Michaelis-Menten based purely on data. This is a prime example of how a machine learning approach ignores mechanism in favor of predicting power: there is no intention that the function predicting metabolite change rate from proteins and metabolites describes a mechanism, but it offers the best prediction of the final limonene/isopentenol, which is what we require for our engineering. In this case, the inputs (Fig. 3) were the exogenous pathway protein and metabolite concentrations, and the response was the rate of change of the metabolite. The instances involved each of the time points for which the metabolite rate of changes was learnt.
This approach outperformed a classic kinetic model in predictive power using very little data: only three time series of protein and metabolite measurements of 7 time points each (for two different pathways). While it would be desirable to have hundreds of time point measurements, the high cost and time associated with performing multiomic experiments typically constrains data sets to less than 10 time points/samples, which is too sparse for training accurate models. Critical to its success, hence, was the use of data augmentation to increase the number of available instances from the initial 7 time points to the final 200 used for learning. Data augmentation simulates additional instances by modifying or interpolating actual data. In this case, data augmentation involved first smoothing the data (via a Savitzky-Golay filter) and then interpolating new data points from the fitted curve. This augmentation scheme only assumed continuity and smoothness between time points, but provided sufficient data to train a machine learning model using data from only 2 time series that accurately predict pathway dynamics of the "unseen" third strain. The final predictions of metabolite concentrations for the exogenous pathways, although not perfect by any measure, were more accurate than equivalent predictions by a hand-crafted kinetic model. More importantly, while the kinetic model took weeks to produce through arduous literature search, the kinetic learning approach can be systematically applied to any pathway, product and host with no extra overhead.
An opportunity to improve the machine learning model predictions of Costello et al. would of course be to collect more data, but deciding which data to collect is not always clear. For example, instead of using protein and metabolite data only from the exogenous pathways as input features, protein and metabolite measurements from the full host metabolism could be added (surely, host metabolic effects like ATP supply must be relevant). However, using these extra data would not necessarily improve machine learning predictions. This is because many machine learning algorithms suffer from the "curse of dimensionality": that is, the amount of data needed to support results in a statistically sound fashion often grows exponentially with the dimensionality of the input (Fig. 4). Hence, machine learning algorithms may struggle to learn from data sets that have many measurements or "input features" (columns), but few instances (rows). Adding host proteins and metabolites will increase the number of inputs without increasing the number of instances. Unfortunately, most multi-omic data sets used in metabolic engineering fit this description, containing more than 5000 measurements (e.g. proteins or metabolites abundances), but only tens to hundreds of instances (e.g. different time points, strains, or growth conditions, depending on what your algorithm is attempting to learn) (Fig. 4). Therefore, collecting as many instances as possible should be emphasized early on during experimental design (see section 2.4).
In the absence of being able to generate more data, algorithms that reduce the number of input features to the most important ones can be performed, a process known as feature selection. Feature selection (Pedregosa et al., 2011) was used in Costello et al. (Costello and Martin, 2018) to identify a subset of the input features based on their contribution to the model's error. This, more limited, curated set of features was then used to predict metabolite dynamics. The idea behind this is to remove non-informative or redundant input features from the model. An additional approach used was dimensionality reduction, where "synthetic features" are created that transform the original input features into fewer ones (or "lower dimensions") based on their contribution to explaining the data's variability (for example, via principal component analysis). Similar to feature selection, these algorithms simplify the data set in order to better fit a machine learning model. These approaches were integrated into a machine learning pipeline using the tree-based pipeline optimization tool (TPOT) (Olson et al., 2016;Olson and Moore, 2019), which automatically selected the best combination of feature preprocessing steps and machine learning models from the scikit-learn library (Pedregosa et al., 2011) to maximize prediction performance.

Artificial neural networks to improve butanol production in cell-free systems
Our second example involves using deep neural networks to optimize cell-free butanol production (Karim et al., 2020). Here, the authors provide an example of how machine learning can accelerate the design-build-test-learn (DBTL) cycles used in metabolic engineering (Nielsen and Keasling, 2016), by effectively guiding pathway design. In this study, the authors optimized a six-step pathway for producing n-butanol, an important solvent and drop-in biofuel, using a cell-free prototyping approach (iPROBE). iPROBE reduces the overall time to build pathways from weeks or months to a few days (around five in this case), providing the quick turnaround and large numbers of enzyme combinations that can enable successful use of machine learning. Several pathway variants were constructed in vitro and scored based on their measured butanol production through a TREE score which combines titer, rate, and enzyme expression. The challenge, however, lies in analyzing the sheer number of pathway combination possibilities. Testing only six homologs for the first four pathway steps at 3 different enzyme concentrations would result in 314,928 pathway combinations (strain genotypes). Even with the increased turnover provided by the cell-free approach, it would take years for typical analytical pipelines to exhaustively test the landscape of possible combinations. Therefore, a data-driven design-of-experiments approach was implemented using neural networks to predict optimized pathway designs (homolog sets and enzyme ratios) from an initial data set that could subsequently be tested. In this case the input for the neural network was the enzyme homologs used for each of the reaction steps and their corresponding concentrations. The response was the TREE score, and each instance was a pathway design.
The pathways predicted from the neural network model were able to improve butanol production scores over fourfold (~2.5 times higher titer, 58% increase in rate) compared to the base-case pathway. An initial data set of 120 instances (pathway designs) was used to train and test different neural network architectures consisting of 5-15 fully connected hidden layers and 5 to 15 nodes per layer. Genetic algorithms were used to suggest combinations of network architectures, and tenfold cross validation was used to select the best. Once the model was built, the authors used a nonlinear optimization algorithm (Nelder-Mead simplex) to recommend pathway designs that optimized butanol production through the maximization of the TREE score. These machine learning recommendations resulted in 5 of the 6 top performing pathways, and outperformed 18 expert determined pathways selected based on prior knowledge, demonstrating the power of a data-driven design approach for cases in which design choices are numerous.
While the study by Karim et al. only reported 1 DBTL cycle, multiple cycles would have likely resulted in even better production pathways, and also provided more data instances for model training. Indeed, the neural network of 5-15 hidden layers developed by Karim et al. was relatively small compared to state-of-the-art deep neural networks, but this design was limited by having only 120 instances (pathway designs) to train on. If more data were to become available through more DBTL cycles, the neural network could have been made more complex by expanding its depth (hundreds of hidden layers), which would improve prediction performance (Fig. 4). This improved performance, however, comes at a cost: as the number of layers increases, the time to train the network (i.e. learning model weights and parameters) increases considerably. Moreover, the dense hidden layers of deep neural networks render them very difficult to interpret and infer possible mechanisms from. Hence a significant research thrust in machine learning involves new approaches to make models "explainable" (see Section 5.3) (Gunning, 2016;. The use of only 1-2 DBTL cycles seems to be the most common case in published projects (Denby et al., 2018;Alonso-Gutierrez et al., 2015;Opgenorth et al., 2019;Zhang et al., 2020). In our experience, this happens not because more DBTL cycles are not expected to be useful, but because results from a single DBTL cycle are often enough for a publication. Often, in the academic world, there is little incentive (or resources) to continue further.

Requirements for machine learning in metabolic engineering
Here we provide a practical guide on the immediate prerequisites to applying machine learning to metabolic engineering, in the next section we will discuss some practical considerations for experimental design once the machine learning project is in progress, and, in section 5.1, we discuss long term hurdles for the development of the discipline as a whole. In essence, four requirements need to be aligned for a successful application: data, algorithms, computing power and an interdisciplinary environment. Each of them is critical for a real impact.
Data needs to be abundant, non-sparse, high quality, and well organized. Training data needs to be abundant because machine learning algorithms depend critically on training data to be predictive. There is no prior biological knowledge embedded in them. In general, the more training data, the more accurate the algorithm predictions will be. Data augmentation (see section 2.2.1) can certainly help, and should be routinely used in metabolic engineering due to the scarcity of large data sets, but it is no substitute for experimental data. There is, however, no way to know a priori how much data will be enough. Different problems present different difficulty levels to being "learnt" (Radivojević et al., 2020), and this difficulty level can only be assessed empirically. A scaling plot of predictive accuracy vs. instances can be very helpful in this regard. Training data can be abundant but still sparse, depending on the phase space ( Fig. 9) considered. A total of a hundred instances can be enough if only two input features are considered, or completely insufficient if a thousand input features are considered. The "curse of dimensionality" implies that the amount of data needed to support results in a statistically sound and reliable fashion often grows exponentially with the dimensionality (Fig. 4). The data must be high-quality in the sense that it must avoid biases due to inconsistent protocols and provide quantification for repeatability (see section 2.4). Both goals can be systematically achieved through automation (see section 5.2). Data needs to be well organized, following standards and ontologies, and must include the corresponding metadata (see section 2.4). The alternative is that data analysts will spend 50-80% of their effort organizing the data and metadata for analysis, mining their efforts (Lohr, 2014). Since data analysts might be the most effective effort multiplier in your team (Nielsen and Keasling, 2016), and possibly the most expensive (Metz, 2018), it is very useful to optimize their effort.
While there are many machine learning algorithms to choose from ( Fig. 10), there is no clear best algorithm for every situation. Indeed there is a famous theorem (the no free lunch theorem, NFLT) that proves (under some conditions) that no single algorithm is most effective for every type of problem (Wolpert, 1996). While the utility of the NFLT for machine learning has been cast in doubt (Giraud-Carrier and Provost, 2005 ), the standard approach remains to try as many algorithms as possible and compare their results. In this effort, it is very useful to count on libraries that collect a large variety of algorithms and have standardized input, output and other standard procedures (e.g. cross-validation). The most popular among them is, without a doubt, scikit-learn (Pedregosa et al., 2011), a python library that comprises a very wide selection of machine learning methods, is well documented, and easy to use (Fig. 10). These features combined with its open source nature, and its compatibility with Jupyter notebooks (Kluyver, 2016), which facilitate reproducibility and communication, make it our top recommendation for beginners. Furthermore, the open source nature and wide use of scikit-learn means that there are several tools that leverage it to combine and test methods. Tree-based pipeline optimization tool (TPOT), for example, automatically combines all the available algorithms and preprocessing steps in scikit-learn to choose the best option (Olson et al., 2016). Another example is the Automated Recommendation Tool (ART), which leverages scikit-learn, ensemble modeling, and bayesian inference to provide uncertainty quantification for predictions (Radivojević et al., 2020). A proprietary alternative is to use Matlab, for which a machine learning toolbox is available (Ciaburro, 2017), with possible educational discounts. For artificial neural networks, the best supported (and free) frameworks are TensorFlow and Pytorch, backed by Google and Facebook respectively. Keras, a framework focused on providing a simple interface for neural networks, is now the official high-level front-end for TensorFlow (Géron, 2019). Keras has its own hyperparameter tuner, Keras Tuner , and an extremely simple interface for DL with Keras and Ten-sorFlow, AutoKeras .
Computation is another key element, particularly for large amounts of data. Whereas the libraries above (Scikit-learn, Matlab toolbox, Tensorflow, Pytorch) can be run on a standard laptop (e.g. 2018 Macbook Pro, 3.5 Ghz Intel Core i7, 16 GB RAM), as more training data is added this may be insufficient. This is particularly the case for deep neural networks using Tensorflow or Pytorch, which will benefit from the parallelization obtained through Graphics Processing Units (GPUs). The need to scale up all these Python frameworks for high performance computing (HPC) or deployment on cloud computing environments (e.g. Amazon EC2, Microsoft Azure, and Google's Cloud Platform) has promoted the development of several parallel and distributed computing backends for data analysis and machine learning, such as Ray, Spark, and Dask (Rocklin, 2015). Furthermore, as the general applicability of AI has become more evident, new processor architectures are being created specifically for neural network machine learning, including Google's Tensor Processing Unit (TPU), Nvidia's V100 and A100, Graphcore's Intelligence Processing Unit (IPU), and a variety of FPGA-based solutions.
Since very few people master both machine learning and metabolic engineering, interdisciplinary collaborations are truly necessary. Machine learning practitioners and metabolic engineers are trained very differently, however, and this can produce significant friction (see section 5.1). Both disciplines profess different cultures, which are reflected in how they solve problems, but also which problems are prioritized. It is, hence, very important to foster an inclusive work environment that integrates and values contributors with very different skills, and does not penalize knowledge gaps. It is also important to be very clear about the interfaces: which exchanges (e.g., data, designs, predictions) are expected, and when, in order for both sides to be effective. This is Alexnet, one of the first ANNs that leveraged the network depth to improve performance and win the ImageNet image classification contest in 2012. Deep networks lower the amount of parameters by sparsely using fully connected layers, which require many parameters. The first five layers in Alexnet are convolutional layers (Rawat and , which only take input from a limited number of cells in the previous layer. Many architectures are possible for deep learning, and finding the optimal one is more of an art than a science. See Lecun et al. (LeCun et al., 2015) for more details.

Practical considerations for implementing machine learning
As in the case of a genetic selection or screen, machine learning requires careful experimental planning to make it effective. An experimental design that ignores its basic assumptions (e.g., instances are independent and identically distributed) will result in a random walk over possible designs with the same (or even worse) results as a trialand-error approach.
Here, we offer a succinct list of recommendations to consider when planning to use machine learning to guide bioengineering: • Choose the right objective/response. When a response for the algorithm is chosen, you are entering a Faustian bargain with your algorithm: it will try to optimize it to the detriment of everything else (Riley 2019) . For example, setting final titer as the response might provide high titers in the end for a production strain, but at rates so slow that the result is of little practical use. In the case of Karim et al., (see section 2.2.2), the response was a carefully selected mixture of titer, rate, and enzyme expression precisely for this reason. Deciding on the right response is a bit of an art, and less trivial than often assumed. Be careful what you ask the algorithm for, because you may get it! • Choose inputs that truly predict your response. Performing small, directed experiments in the lab to verify that the response of interest (e.g. a phenotype) is affected by a given input (e.g. a treatment) can save a significant amount of time and headaches later in the DBTL cycle, by limiting the number of inputs (and the overall complexity of the model) to terms that matter. Omitting this step might give rise to a frustrating chase of a red herring in the form of statistical noise, or cause serious challenges to the interpretability of the model. • Choose actionable inputs that can be measured. The machine learning process will require you to change your inputs in order to achieve the desired goal (e.g. increase production). Hence, these inputs need to be experiment variables that can be easily manipulated. Since you will need to assess whether you indeed reached the recommended targets, it is highly desirable that these inputs can be easily measured. For example, it is generally better to use as inputs promoter  or enzyme choices (Karim et al., 2020), rather than protein levels (Opgenorth et al., 2019). Promoter or enzyme choices are entirely under the metabolic engineer's control, and their effects on expression may be verified via sequencing; whereas certain target protein levels may be difficult to reach, and usually require specialized mass spectrometry methods to verify. • Choose very carefully how many experiment variables you would like to explore. Choosing too many variables (i.e. input features, Fig. 3) can make the corresponding phase space too large for machine learning to explore in a reasonable amount of DBTL cycles. Choosing too few variables might mean missing important system configurations (e.g. if protein X is not chosen and it needs to be downregulated to improve production, it will be impossible to find the optimum). As a very crude rule of thumb, you should budget for around at least 100 instances per 5-10 variables. This, of course, depends on the difficulty presented by the problem being learnt: more difficult problems will need more instances per variable, whereas easier problems will require less instances per variable. • Verify that your experiment variables can be independently acted upon. Whole-operonic effects can make this unexpectedly difficult (Opgenorth et al., 2019). For example, if recommendations require protein A concentration to be increased three-fold and protein B to be decreased by a factor of two to improve production, but a strong promoter for protein A also produces an increase in protein B, it will be difficult to reach the target protein profile. Hence, modular pathway designs (Boock et al., 2015) that ensure that the full input phase space can be fully explored are highly recommended. Systematic part characterization involving large promoter libraries with a variety of tested relative strengths are a fundamental tool in this endeavor. • Design your experiment to start with ~100 instances for the initial DBTL cycle. Although there are examples of success stories with less than a hundred instances as starting points (Radivojević et al., 2020), this outcome cannot be guaranteed. Actual success depends on the complexity of the problem (Radivojević et al., 2020), and this complexity can only be gauged by testing predictive accuracy as data sets increase. By starting with ~100 instances, one ensures some progress even if predictions are not accurate: this amount of instances goes a long way to ensure statistical convergence. The alternative is a non-predictive model and little understanding whether the problem is lack of data (instances), or other design problems (Opgenorth et al., 2019). Consider automating as much of your process as possible so as to guarantee enough instances. This automation may seem an unnecessary hassle, but it will pay off in the long run. • Sample the initial phase space as widely as possible. Ensure that you cover wide ranges for both input and response variables. Strive to include both bad (e.g. low production) and intermediate results as well as good ones (e.g. high production), since this is the only way that the algorithms can learn to distinguish the inputs needed to reach any of these regimes. The Latin Hypercube (McKay et al., 1979) is a good choice to choose starting points, but other options are also available. • Consider uncertainty, as well as predicted response, when choosing next steps. As the need to quantify prediction uncertainty becomes more recognized in the biological sciences (Begoli et al., 2019), more algorithms provide it along with response predictions (Radivojević et al., 2020). Using this information can improve the whole process. Choose some recommendations with the lowest possible uncertainty even if the predicted outcome is not so desirable (e.g. low production), so as to establish trust in the approach (see sociological hurdles in section 5.1). Choose some recommendations with large uncertainty even if the predicted outcome is not desirable so as not to miss unexpected opportunities. In addition, to obtain an empirical view of how uncertainty in the data affects the accuracy of predictions, it may be instructive to create simulated, in silico "ground truth" data sets displaying different levels of noise in order to test the performance of the machine learning algorithm. • Avoid biases created through inconsistent protocols and beware of hidden variables. Machine learning algorithms learn to map an input to a response (Fig. 3). If different DBTL cycles produce different results for reasons that are not reflected in the input (hidden variables (Riley 2019)), the algorithms will provide poor predictions. Such uncontrolled variables can easily arise in biological data due to lab temperature or climate fluctuations, reagent batch differences, undetected culture mutations, "edge effects" in plate-based assays, and equipment drift. These effects should be assessed and eliminated as part of the experimental design, and is one of the key topics of communication for bench and computational scientists to empower downstream data analysis and predictions. Machine learning can also help by performing simple checks: if an algorithm can predict which well or batch sample the data came from, that means they unduly influence the response. Lack of repeatability is the main stumbling block of machine learning. • Add experimental controls to test for repeatability. Since ensuring repeatability is among the top requirements for machine learning to be successful, it is important to test and quantify it often. Batch, instrument, and operator effects are often the first principal component of data. These effects can be detected by including a few controls of known response in every experiment (e.g., 2-3 base strains in every DBTL cycle). While this approach consumes valuable analytical resources, it ensures that the data can be trusted and does not need to be discarded, saving substantial labor during modeling and analysis. • Plan for several DBTL cycles. Machine learning algorithms shine when they can dynamically probe your system, since they are designed to learn from data interactively. While results can be obtained using two DBTL cycles, they are not comparable to what >5 cycles can provide (Radivojević et al., 2020). If only a limited budget of, e.g. 100 instances, is available, it is better to start with a strong first cycle and several weaker ones (e.g. 40 instances for cycle 1, then six 10 instance cycles) than the usual two DBTL cycle study (e.g. 60 instances for first cycle, 40 instances for the second one). • Standardize your data and metadata. Taking machine learning for metabolic engineering seriously requires large amounts of high quality data. Hence, it is advisable to store it in a standardized manner. There are a variety of data repositories available for this purpose: e.g., the Experiment Data Depot (Morrell et al., 2017), , the Nature Scientific Data journal ("Open for business," 2017), to name a few. Moreover, a labeled data set of high quality is a significant resource for the community, and is more likely to be cited. • Be careful about how you split your data for cross-validation.
Cross-validation of your model (Fig. 3), assumes data sets are independent and identically distributed (iid). This assumption is basic for machine learning, and presumes that both validation and training sets stem from the same generative processes and have no memory of past generated samples. However, it can be violated in practice due to temporal effects on biological systems or group effects during sample processing (Riley 2019). In these cases, alternatives to random splitting need to be considered. Sheridan (2013), for example, showed that randomly splitting compound libraries used for drug discovery overestimated their model's ability to successfully predict drug candidates. The reason for this difference is that compounds added to the public record at particular dates shared higher structural similarity, resulting in models that had already "seen" compounds in the test set when randomly split. Similar considerations need to be made when sample generation occurs in a biased manner, which is quite common in biological experiments. For example, "batch effects" can be avoided by splitting the data first by group (e.g. each batch) to ensure the same group is not represented in both testing and training sets (see scikit-learn group k-fold). Do only worry about this effect if you have a large data set (>100 instances).
Perhaps the best way to get familiar with machine learning, and its potential and limitations, is to experiment with it in a tutorial. The recently published Automated Recommendation Tool (Radivojević et al., 2020) includes three synthetic data sets, three real data sets and a software package that can be used for this purpose. Furthermore, some of these cases are explained in detail in several Jupyter notebooks contained in the github repository (https://github.com/JBEI/ART/tree/ master/notebooks), and can be used as tutorials.

Applications of machine learning to metabolic engineering
Although application of machine learning in metabolic engineering is nascent, early studies have already shown its potential use for accelerating bioengineering. Here, we highlight examples where machine learning is being used to improve different stages of the metabolic engineering development cycle: gene annotation and pathway design, pathway optimization, pathway building, performance testing, and production scale-up (Table 1). We focus on prime examples that best epitomize the potential of machine learning in metabolic engineering, rather than an exhaustive list of applications. The reason for this decision is that this list is quickly growing and might be outdated soon, and there are recent reviews on the topic that provide that information Presnell and Alper, 2019;Volk et al., 2020). We also discuss key challenges and opportunities when applying machine learning for metabolic engineering, with particular focus on practices that could formalize data-driven approaches.

Machine learning for design
The goal of metabolic engineering design is to develop DNA parts and assembly instructions to synthesize metabolic pathways and produce a desired molecule (Nielsen and Keasling, 2016;Woolston et al., 2013). This requires completion of several tasks, including gene annotation, pathway reconstruction and design, as well as metabolic flux optimization, which currently rely heavily on domain expertise and enjoy little standardization (Nielsen and Keasling, 2016). Application of machine learning can improve the accuracy and speed of these tasks, offering a standardized approach that fully leverages experimental data.

Pathway reconstruction and design
Locating and annotating protein encoding genes in a genome sequence is essential for metabolic pathway reconstruction and design. This is conventionally done bioinformatically, for example using Hidden Markov Models (HMMs) (Finn et al., 2011;Kelley et al., 2012;Yoon, 2009). Initially, genes are identified in a genome by searching for known protein coding signatures (e.g. Shine-Dalgarno sequences), and this is followed by annotation based on sequence homology searches against a database of previously characterized proteins. More recently, however, deep learning approaches have been used to identify and functionally annotate protein sequences in genomes by leveraging large high-quality experimental data sets (Armenteros et al., 2019;Clauwaert et al., 2019;Ryu et al., 2019). DeepRibo, for example, uses high-throughput ribosome profiling coverage signals and candidate open reading frame sequences (input features) to train deep neural networks to delineate expressed open reading frames (response is part of predicted ORF or not for every nucleotide) (Clauwaert et al., 2019). This approach showed more robust performance compared to a similar tool, REPARATION (Ndah et al., 2017), that uses a random forest classifier instead of deep neural networks. DeepRibo also improved prediction of protein coding sequences in different bacteria (e.g. Escherichia coli and Streptomyces coelicolor) compared to RefSeq annotations, including higher identification of novel small open reading frames commonly missed by sequence alignment algorithms. Another example is DeepEC, which takes a protein sequence as input and predicts enzyme commission (EC) numbers as output with high precision and throughput using deep neural networks (Ryu et al., 2019). A data set containing 1,388,606 expert curated reference protein sequences and 4669 enzyme commission numbers (Swiss-Prot (Bairoch and Apweiler, 2000) and TrEMBL (UniProt Consortium, 2015) data sets) was used to train the deep neural networks, which improved EC number prediction accuracy and speed compared to 5 alternative EC number predictions tools, including Cat-Fam (Yu et al., 2009), DETECT v2 (Nursimulu et al., 2018), ECPred (Dalkiran et al., 2018), EFICAz2.5 (Kumar and Skolnick, 2012), and PRIAM (Claudel-Renard et al., 2003). DeepEC was also shown to be more sensitive in predicting the effects of protein sequence domain and binding site mutations compared to these tools, which could improve the accuracy of annotating homologous proteins that have mutations with previously unknown effects on function (e.g. from metagenomic data sets).
The design of metabolic pathways involves identifying a series of chemical reactions that produce a desired product from a starting substrate, and selecting different enzymes that catalyze each reaction. While nature has evolved many pathways for producing diverse molecules, the known and characterized biochemical pathways can still be insufficient to produce certain molecules of interest, especially nonnatural compounds or secondary metabolites. Therefore, retrosynthesis methods that start with a desired chemical and suggest a set of chemical reactions that could produce it from cellular metabolite precursors are being pursued to design new metabolic pathways Lee et al., 2019). The latest and most sophisticated of these methods use generalized reaction rules to describe possible biochemical transformations (Delépine et al., 2018;Kumar et al., 2018). However, the number of possible reaction combinations is intractable since it grows combinatorially with the number of reactions. Choosing the right reaction combination is a non-trivial problem, which is typically tackled via optimization or heuristic methods. A possible solution to this search problem comes from solving the same problem in organic synthesis, through the use of deep neural networks (Segler et al., 2018). Segler et al. preprocessed 12.4 million reaction rules from the Reaxys chemistry database to train three deep neural networks implemented within a Monte Carlo tree search (heuristic search algorithm used in decision making) to discover retrosynthesis routes for small molecules. This deep learning approach found pathways for twice as many molecules, thirty times faster than traditional computer-aided searches (Segler et al., 2018). The predicted synthesis routes better adhered to known chemical principles than traditional computer-aided searches and could not be differentiated by expert organic chemists compared to synthesis routes taken from the literature, highlighting the potential of deep learning to be applied for metabolic retrosynthesis (or retrobiosynthesis). Indeed, a similar Monte Carlo Tree Search method has recently been extended to predict synthetic pathways within biological systems (RetroPath RL), enabling systematic pathways design for metabolic engineering (Koch et al., 2020).
Pathways designed via retrosynthesis still face the difficult challenge of finding enzymes for novel biochemical reactions, for which no enzyme is known. In this case, the solution involves enzymes that may catalyze the novel reaction through enzyme promiscuity, or new enzyme functions must be designed or evolved that perform the desired chemistry. While chemoinformatic techniques (e.g. density functional theory, DFT, and partitioned quantum mechanics and molecular mechanics, QM/QM) can be used to predict the interaction between metabolites and proteins in silico (Alderson et al., 2012), these techniques are computationally intensive and require substantial domain expertise. Therefore, the task of searching for promiscuous enzymes is increasingly being performed using more general and computationally efficient techniques from machine learning. For example, given a reaction and enzyme pair instance, Support Vector Machines (Faulon et al., 2008) and Gaussian Processes (Mellor et al., 2016) have been developed to predict whether the enzyme catalyzes the reaction, with the latter model having the added benefit of providing uncertainty quantification. These models predict positive or negative enzyme reaction pairs from protein sequences (e.g. K-mers) and reaction signatures (e.g. functional groups, chemical transformation properties) (Carbonell and Faulon, 2010) by learning patterns about promiscuous enzyme activities through training. They can also be applied to predict substrate affinity for proteins (K m values) (Mellor et al., 2016), an important kinetic parameter for determining enzyme activity, which is difficult and time consuming to measure experimentally. This is critical for pathway design as sequences with the most desirable kinetic properties can be selected when multiple candidates catalyzing a given reaction are available.
In the case that no enzyme can be found for a target reaction, new enzymes may be designed or discovered through protein engineering. A common laboratory method for protein engineering is directed evolution, where beneficial mutations accumulate in a protein through iterative experimental rounds of mutation and selection until the desired protein function is achieved (Yang et al., 2019). In essence, a series of local searches (via sequence mutation and screening) are performed on an enormous and highly complex functional landscape with the hope of finding a local optima (i.e. protein variant with desired properties). However, experimental approaches can only explore an infinitesimal part of this landscape and computational approaches are needed to guide experimental efforts. Machine learning can be used to guide directed evolution and decrease the number of experimental iterations needed to obtain a protein with the desired function. This is achieved by leveraging previous screening data to learn a protein's sequence-function landscape and predict new sequence libraries that contain variants with higher fitness. For example, instead of experimentally performing sequential single point mutations or recombining mutations found in best variants (common directed evolution approaches), Wu et al. (2019) trained a machine learning model to perform in silico evolution rounds that ranked new protein variants by predicted fitness for experimental testing. Instead of relying on a single machine learning method, multiple models (linear, kernel, neural network, and ensemble) were trained in parallel, and the ones showing the highest accuracy were used to perform in silico evolution rounds . This enabled deeper exploration of the possible variant functional landscape, resulting in the successful evolution of an Notes: RNN = recurrent neural network; CNN = convolutional neural network.
C.E. Lawson et al. immunoglobulin-binding protein and a putative nitric oxide dioxygenase from Rhodothermus marinus. ML-assisted directed evolution has also been used to maximize enzyme productivity (Fox et al., 2007), change the color of fluorescent proteins (Saito et al., 2018), and optimize protein thermostability (Romero et al., 2013) making it a promising approach for searching large sequence-function spaces in an efficient manner for proteins variants with desired properties. In addition to directed evolution, deep learning has also recently been applied for the rational design of proteins (Alley et al., 2019;Biswas et al., 2020;Costello and Garcia Martin, 2019). For example, Alley et al. (2019) developed UniRep, which uses recurrent neural networks to learn an internal statistical representation of proteins that contained physicochemical, organism, secondary structure, evolutionary and functional information, by training on 24 million UniRef50 (Suzek et al., 2015) amino acid sequences (instances). The resulting representation was applied to train models (random forest or sparse linear model) using UniRep encoded proteins that predicted the stability of a large collection of de novo designed proteins and also the functional consequence of single point mutations on wild-type proteins. UniRep encoding was also used to optimize the function of two fundamentally different proteins (to wild-type), a eukaryotic green fluorescent protein from Aequorea victoria, and a prokaryotic β-lactam hydrolyzing enzyme from Escherichia coli, highlighting the generalizability of this approach for rational protein engineering (Biswas et al., 2020). Other generative models based on deep learning have been used to suggest protein sequences with desired functionality and location .

Pathway optimization
Following pathway design, metabolic flux optimization is required to maximize product titers, rates, and yields (TRY). In this endeavor, machine learning provides an orthogonal approach to computational approaches leveraging flux analysis and genome-scale models, which have been successfully used in the past to increase TRY (Maia et al., 2016). The combination of both approaches has the potential to be more effective than each of them separately (see section 4 for a discussion).
A common approach to increase TRY involves fine tuning gene expression through the modification of promoter and ribosome binding site (RBS) sequences. Despite decades of progress in understanding the regulatory mechanisms controlling gene expression (Snyder et al., 2014), quantitative prediction of gene expression based on sequence information remains challenging. While computational models do exist to predict gene expression (Leveau and Lindow, 2001;Salis et al., 2009;Rhodius and Mutalik, 2010), they rely on a comprehensive understanding of transcription and translation processes. This knowledge is often unavailable, especially for non-model organisms. Therefore, many gene expression optimization efforts rely on trial-and-error experimental approaches based on promoter and RBS library screening , that also suffer from the large combinatorial space of possible sequences.
Machine learning has also guided the design of promoter and RBS sequences in a data-driven manner for improved control of gene expression. In particular, neural networks have been used to predict gene expression output from input promoter sequences or coding regions (Kotopka and Smolke, 2020;Meng et al., 2013;Tunney et al., 2018). Meng et al. (2013) used a simple neural network trained with 100 mutated promoter and RBS sequences as inputs to predict promoter strength (response). This machine learning model outperformed mechanistic models based on position weight matrix or thermodynamics methods (Leveau and Lindow, 2001;Salis et al., 2009;Rhodius and Mutalik, 2010), and was able to optimize heterologous expression of a small peptide BmK1 (used in traditional Chinese medicine) and the dxs gene involved in the isoprenoid production pathway (Meng et al., 2013). Additionally, optimization of promoter strength and inducer concentration/time has been achieved using partial least squares regression (Jervis et al., 2019a), whereas prediction of riboswitch dynamic range from aptamer sequence biophysical properties has been achieved using a combination of random forests and neural networks (Groher et al., 2019). In this latter riboswitch design example, instead of directly using sequence information to train the random forest, the authors calculated known riboswitch biophysical properties from aptamer sequences (entropy, stem melting temperature, GC content, length, free energy, etc.) and used these as input features for model training, in order to predict switching behavior. This allowed for the interpretation of which input features were most important to the model prediction using variable importance (e.g. melting temperature was more important than free energy), enabling inferences on possible mechanisms.
More recently, machine learning models have been used to optimize multi-step pathways for chemical production (Zhou et al., 2020). For example, Zhou et al. (2018) used neural network ensembles to improve a 5-step pathway for violacein production (pharmaceutical) by selecting promoter combinations to tune gene expression. Using an initial training set of only 24 strains (out of a possible 500) containing different promoters for each gene, the model predicted a new strain that improved violacein titer by 2.42-fold after only 1 DBTL iteration. Their ensemble approach allowed top producing strains to be predicted from a combination of over 1000 ANN, which improved model accuracy and also allowed optimization of violacein based on both titer and purity. In another example, Opgenorth et al., (2019) used an ensemble of four different models (random forest, polynomial, multilayer perceptron, TPOT meta-learner) to optimize a 3-step pathway for dodecanol production from two DBTL cycles. The model was trained using data generated from 12 strains (48 data points total) with different RBS sequences for each gene, where an optimization step was used to recommend improved strain designs to build and test in the second cycle. Additional machine learning models have guided the optimization of multi-gene pathways, including limonene production in E. coli using support vector regression (Jervis et al., 2019b), lycopene synthesis in E. coli using gaussian processes (HamediRad et al., 2019), and tryptophan production in S. cerevisiae using ensemble models . Together, these examples highlight the potential of systemically leveraging high-throughput strain construction, testing, and machine learning to optimize multi-step pathway expression for improving product TRY.
To enable broader use of ML-driven pathway optimization and design by the metabolic engineering community, Radivojevic et al. (Radivojević et al., 2020) developed the Automated Recommendation Tool (ART). ART is specifically tailored to the needs of the metabolic engineering field: effective methods for small training data sets and uncertainty quantification. ART's ability to quantify uncertainty enables a principled way to explore areas of the phase space that are least known, and is of critical importance to gauge the reliability of the recommendations. We expect that further development of tools tailored to the specific needs of the field will enable broader application of machine learning.

Machine learning for building and testing cellular factories
Machine learning can also be used to improve the tools that build and test cellular factories. A major challenge in gene editing using CRISPR-Cas technologies, for example, is predicting the on-target knockout efficacy and off-target profile of single-guide RNA (sgRNA) designs. Several approaches exist to make these predictions, including alignment-based methods (Aach et al., 2014), hypothesis-driven methods (Heigwer et al., 2014;Hsu et al., 2013), and classic machine learning algorithms (i.e. non-deep learning) (Chari et al., 2017;Doench et al., 2016). However, their generalizability has been limited by the small size and low quality (high noise) of the training data. Higher-throughput screening methods combined with deep learning have recently improved the accuracy and generalizability of sgRNA activity prediction tools. For example , developed DeepCpf1, which predicts on-target knockout efficacy (indel frequencies) using deep neural networks trained on large-scale sgRNA (AsCpf1) activity data sets. While previous machine learning tools had been trained on medium-scale data (1251 target sequences), the authors high-throughput experimental approach generated a data set of indel frequencies for over 15,000 target sequence compositions, which was sufficient to train deep neural networks. Seq-DeepCpf1 was shown to outperform conventional ML-based algorithms, and steadily increased in performance as training data size increased, highlighting the value of data sets with >10,000 high-quality training instances. Seq-DeepCpf1 was also extended by considering input features other than target sequence composition known to affect sgRNA activity (in this example, chromatin accessibility (Jensen et al., 2017)) that further improved prediction accuracy and performance on independently collected data sets from other cell types (a metric of model generalizability). This highlights the value of expanding input features beyond the obvious choice.
In addition to predicting on-target knockout efficacy, the off-target profile of sgRNA activity is also important to forecast, in order to prevent undesirable perturbations that result in genomic instability or functional disruption of otherwise normal genes. This has been performed using both regressive models and deep neural networks (Listgarten et al., 2018,Lin and. To combine on-target knockout efficacy and off-target profile predictions into one tool Chuai et al (Chuai et al., 2018), developed DeepCRISPR. DeepCRISPR uses both an unsupervised deep representation learning technique and deep neural networks to maximize on-target efficacy (high sensitivity), while minimizing off-target effects (high specificity). Unsupervised representation learning allows DeepCRISPR to automatically discover the best representation of input features from billions of genome-wide unlabeled sgRNA sequences, instead of specifying what input features should look like (e.g. target sequence composition). This sgRNA representation was then used when training a deep neural network using labeled data consisting of target sequences and epigenetic information (input features) to predict both on-target and off-target activities (responses). Overall, DeepCRISPR outperformed classic machine learning methods and exhibited high generalizability to other cell types, highlighting the value of unsupervised representation learning to automate feature identification.
Machine learning methods could also be used to optimize the DNA assembly and transformation protocols critical for building engineered strains. Although DNA and strain construction has traditionally been accomplished empirically (Chan et al., 2013) or guided by rule-of-thumb approaches (Engler and Marillonnet, 2014), the ability to assemble and test DNA constructs and their transformation efficiencies under different conditions in high-throughput could enable data-driven optimization. For example, machine learning could leverage comprehensive overhang ligase fidelity data sets (Potapov et al., 2018) to expand the identification of high-fidelity overhang sets for gibson assembly, potentially allowing more DNA fragments to be assembled in a single reaction. Machine learning could also leverage large data sets that examine transformation efficiency under a range of different conditions (e.g. media compositions, temperatures, incubation times, electroporation conditions, plasmid designs) to improve plasmid delivery and expression. This would be particularly useful for expanding genetic systems to a broader range of host organisms that have potential for industrial applications (Brophy et al., 2018;Wang et al., 2019).
Once cell factories are built their performance needs to be tested. Cell factories can be assayed for various components such as target molecules, transcripts, proteins, and metabolites. The throughput of these assays varies greatly from over 10,000 samples per day to fewer than 20 samples per day . Together, the data from these assays provide a comprehensive picture of how the engineered cells function. However, constructing large numbers of strains followed by high-throughput screening often produces noisy data sets arising from several factors, including small plate-based formats (e.g. edge effects), analytical measurement errors, and laboratory handling errors and biases. One way to reduce these errors is manual inspection, but this approach is not scalable for large data sets and often not reproducible due to person-to-person variability. Therefore, machine learning methods that predict outliers and biases from data and perform data processing in a standardized and reproducible manner are desirable. For this, the use of unsupervised learning algorithms that do not depend on "good" and "bad" labeled data examples have been used, such as clustering analysis methods (Fig. 6). The sci-kit learn library implemented in python has a set of machine learning tools available to perform outlier detection, including Isolation Forest, Local Outlier Factor, One-Class SVM, and Elliptic Envelope, that can be integrated into workflows to provide rapid and robust data quality processing. Additionally, supervised learning approaches based on deep neural networks have been applied to improve multi-omics data processing, for example protein identification from tandem mass spectra (Gessulat et al., 2019) and peak detection during metabolomic data processing (Melnikov et al., 2020). Given the large volume of data generated overtime from lab workflows and analytical instruments, further efforts to standardized data processing using machine learning should result in improved data sets for cellular factory design and analysis.

Machine learning for scaling up cellular factories
One of the largest challenges in metabolic engineering is maintaining the performance of laboratory strains when scaling up to commercial production plants (Chubukov et al., 2016;Wehrs et al., 2019). The typical procedure consists of cultivating lab strains in successively larger fermentation systems from bench-scale (~250mL-5L), to pilot-scale (~20-200L), to full-scale processes (>1000L). Critical to successful scale up is understanding how process variables (feed rate, pH, temperature, fermentation time, mixing regime, media composition, aeration rate, etc.) impact host physiology, cell growth, and product TRY. Accordingly, a central task of bioprocess scale-up is to identify and fine tune these process variables to maintain robust and stable production of the desired chemical. This process is often heuristic, and scale-up process development is often seen as more of an art than a science (Crater and Lievense, 2018;Humphrey, 1998). The fundamental reasons for this, is that large scale fermentations are expensive and difficult to predict. A fermentation is a massively multiparametric process that can be affected by the slightest change in any of the number of factors involved in bioreactor conditions. For example, a change in feedstock or water source, inoculation volume, or even altitude of the bioreactor can impact the progress of the fermentation process. Performing thorough fermentation optimization studies in bioreactors is not only expensive, but also time consuming. Each 2L bioreactor test can cost over 1000 USD and last over a week. Hence, scientific methods are needed to accelerate fermentation process development in bioreactors, beyond the current artisanal procedure. Fortunately, modern fermentation systems used during scale up and at commercial plants contain sophisticated process controls, comprehensive data collection and archiving systems, and automation, which can be leveraged for training machine learning algorithms.
The use of machine learning to mine the wealth of online and offline bioprocess data to shed light on the cause of scale-up process failures, and to improve process outcomes, is common (Charaniya et al., 2008;Baughman and Liu, 2014). For example, Coleman et al. (2003) used historical process data to develop a three-step optimization method using decision trees, an ANN ensemble, and a genetic algorithm to identify which process input variables were most important for fermentation modeling, and to select input values that increased product output. To avoid overfitting, process inputs (different fermentation, media, and inoculum conditions -13 total) were sub-selected using decision tree analysis on a data set of 69 fed-batch fermentations, which identified inputs that best corresponded with each process output (biomass density, product concentration, and productivity). This feature selection preprocessing step is common for bioprocess data sets to remove highly correlated or redundant process inputs prior to model training to prevent overfitting (Melcher et al., 2015;Coleman et al., 2003;Charaniya et al., 2008). The subsetted inputs were then used to train ANN ensembles to quantitatively predict each process output. This resulted in a data-driven process model that was used to identify novel input conditions that maximized process outputs via optimization (genetic algorithm). A similar approach combining ANN modeling followed by optimization using a genetic algorithm was taken by Pappu et al. (Pappu and Gummadi, 2017) to optimize fermentation parameters for producing xylitol. The model accurately predicted xylose consumption, biomass density, and xylitol production following training on 27 fermentation batches with multiple inputs, and was used to select new process inputs (pH, agitation speed, and aeration rate) that increased xylitol titers from 59.4 to 65.7 g/L. These examples highlight the ability to generate predictive process models in a data-driven fashion, providing an alternative to more traditional physical-based kinetic models (e.g. Monod or Droop model) that often fail to capture poorly understood relationships between microbial growth and multiple process variables (Kovárová-Kovar and Egli, 1998).
Bioprocess data is highly heterogeneous and requires appropriate data pre-processing to be used for machine learning. Many bioprocess parameters are collected online as continuous measurements (optical (cell) density, pH, dissolved oxygen, oxygen uptake rate, flow rate, offgas production, etc.) while others (e.g., chemical concentrations, substrate consumption rates) are measured offline at discrete time intervals. Additionally, some parameters, such as product concentrations, are only measured at the final time point, while others are categorical or binary (e.g. ON/OFF nutrient feed setting). This results in highly heterogeneous data sets with respect to time and between fermentation runs that require pre-processing to extract temporal trends that compactly and smoothly represent the data, preventing model overfitting (i.e. many more features than instances). For example, instead of using each time point measurement for model training, first and second order derivatives can be used to more compactly represent temporal trends (Cheung and Stephanopoulos, 1990a); (Cheung and Stephanopoulos, 1990b), as can wavelet decomposition methods (Bakshi and Stephanopoulos, 1994), which outperforms more classical smoothing approaches such as Savitzky-Golay. For low and very low signal-to-noise ratios, more recent methods of denoising can be applied, such as mean envelope filter (Merino et al., 2015) or spectral noise reduction by vector casting (Gebrekidan et al., 2020). Other approaches, including discrete Fourier transform and symbolic aggregate approximation (SAX) can be applied, which represent temporal trends as representative segments (e. g. mean over time window) instead of the entire time-series (Charaniya et al., 2008). In addition to reducing the number of timepoints used for model training, temporal offsets between data sets can arise, for example, due to lag phases in growth between fermentation batches. This can be corrected using dynamic time warping strategies that align time profiles between data sets to avoid incorrect comparisons (Chakrabarti et al., 2002); (Keogh and Ratanamahatana, 2005).
The availability of continuous online bioreactor data has also enabled control and optimization of bioprocesses through reinforcement learning. Currently, bioprocesses are controlled manually or using proportional-integral-derivative (PID) controller or model predictive control (MPC) (Qin and Badgwell, 2003) methods that automatically modulate one or more process variables (e.g. feeding rate) to control an output (e.g. temperature, production concentration). While these techniques have been widely used for complex multivariable control applications, they are built upon fixed models of the environment that do not get continuously updated and improved as they see more data. Therefore, there is growing interest in using model-free reinforcement learning methods to learn, through trial and error, the best control algorithm from large online data, and to optimize process operations (for a detailed overview see ). For example, a control policy was learned from online ethanol data to control final ethanol titers during yeast fermentations that had a lower overshoot, faster tracking, shorter transition, and smoother control signal than an advanced PID controller (Li et al., 2011). Reinforcement learning methods have also been demonstrated in simulated systems to control co-culture species biomass abundances and optimize product yields (Treloar et al., 2020), control reactor temperatures (Xie et al., 2020), and to optimize a downstream product separation unit (Hwangbo and Sin, 2020). However, current reinforcement learning methods alone still suffer from requiring large amounts of data for complex multivariable processes, and are often impractical or too costly to implement in real world applications (Shin et al., 2019). Therefore, approaches to improve the sample efficiency of reinforcement learning methods are needed; promising examples include combining them with model-based controllers (Xie et al., 2020) or through transfer learning, where offline model simulations are initially used to train control policies followed by the efficient adaptation of these policies with real online data (Petsagkourakis et al., 2020).
In sum, despite the challenges of high experimental cost and unpredictable nature of fermentations, the wealth of data generated in a single fermentation makes application of machine learning to scale-up an appealing proposal. Machine learning can be used to identify optimal fermentation parameters (i.e. selecting the most appropriate process conditions) and recommend appropriate responses during process upsets (via adaptive process monitoring and control) using the large amount of data that is available. This area may benefit significantly from coupling machine learning with mechanistic modelling (see next section) such as computational fluid dynamics simulations (Haringa et al., 2016(Haringa et al., , 2017.

Two paradigms at odds
Whereas the machine paradigm concentrates on enabling predictive power, metabolic engineers typically define scientific value around the understanding of mechanism, because it is perceived to be the road to better performance. Mechanisms are defined as the causally related set of processes and parts that result in the observed phenomena. Understanding these mechanisms has been crucial in the history of microbiology because it results in knowledge that can indeed be leveraged to predict the behavior of a biological system (pathways, strains, products, etc.) and can also be transferred to different systems where the same mechanism is involved. For example, if fosmidomycin is toxic and inhibits 1-deoxy-d-xylulose-5-phosphate reductoisomerase (DXR) in the mevalonate pathway in E. coli, you would expect fosmidomycin to inhibit DXR in another host (Murkin et al., 2014). The kinetics of this inhibition mechanism can also be used to quantitatively predict the corresponding changes in mevalonate pathway flux, based on a Michaelis Menten equation that relates fosmidomycin concentration and DXR reaction rate (i.e. inhibitory dissociation constant, K i ).
While there are a variety of different mechanistic mathematical models that are useful for guiding design, including gene expression models (Ay and Arnosti, 2011), genome-scale models (GSM) (King et al., 2016,Thiele andPalsson, 2010), kinetic models (O. D. O.D. , whole cell models (Karr et al., 2012;Macklin et al., 2020), and process models (Koutinas et al., 2012), many of them fail to provide the accurate quantitative predictions needed to systematically drive metabolic engineering projects in practice. For example, predicting metabolic flux changes due to gene knockouts with GSM remains challenging (O'Brien et al., 2015), even after attempts to improve prediction accuracy by deriving constraints or objective functions from experimental data such as transcriptomics (Machado and Herrgård, 2014). Moreover, kinetic model predictions based on assumed quantitative relationship between inputs (e.g. fosmidomycin concentration) and outputs (e.g. DXR reaction flux) often do not hold in reality (Costa et al., 2010;Heijnen, 2005) and are nearly impossible to parameterize for every enzyme across all growth conditions. A key reason why these models fail is because their mathematical relationships between inputs and outputs are based on ideal conditions (e.g. in vitro for Michaelis Menten equation) that do not capture the complexity of the intracellular environment (e.g. regulation). They also lack the ability to automatically leverage more data to learn and improve prediction performance. If the model predictions fail, it takes a human head to creatively figure out how to correct the model, which often happens too slowly, leaving design to rely on trial-and-error experimental approaches. Therefore, new quantitative prediction frameworks are needed to drive the commercial success of metabolic engineering projects in industry, and bring about the field's full potential (as discussed in the introduction).
Machine learning's flexible data-driven framework can help overcome the challenges facing predictive biology. Machine learning links inputs and outputs (Fig. 2) without needing to understand what happens in between (i.e. the mechanism). Instead of using knowledge-derived mathematical relationships, machine learning models empirically derive input/output relationships (equations) through training on data that can be collected in a higher throughput manner (titers, rates, yields, expression levels, etc.) and can automatically improve prediction performance as more data becomes available. Of course, machine learning approaches have their own limits. They require a large amount of data that is expensive to collect, and which constitutes currently the largest practical bottleneck (see Section 5.1). Moreover, most machine learning algorithms, particularly deep neural networks, are black boxes and difficult to interpret, although this is also improving (see Section 5.3). Therefore, troubleshooting machine learning models to try and achieve further predictive power once performance has plateaued is challenging, especially since a clear connection to mechanism is not available. Accordingly, the preferred type of model is both predictive and mechanistic, and it is by leveraging machine learning with mechanistic models that these types of models can be created.

Integrating biological knowledge and machine learning
It is by integrating machine learning and mechanistic models that the benefits of both approaches can be combined: predictability that systematically increases as more data is available, and mechanistic insight. It is not entirely clear how to proceed about reaching this goal, but there are some budding attempts (Fig. 11). A more comprehensive list of approaches that integrate data-and knowledge-based models can be found in the review by Zampieri et al. (2019).
One interesting avenue to explore is whether machine learning can be used to parameterize mechanistic models. A couple of studies (Andreozzi et al., 2016;Heckmann et al., 2018) demonstrated the potential for this by leveraging a set of machine learning models to predict enzyme catalytic turnover numbers from input features composed of network properties, enzyme structural properties, biochemistry, and assay conditions. Enzyme turnover numbers were then used to parametrize genome-scale models which improved proteome predictions. Similarly, Chakrabarti et al. (2013) used a machine learning approach to identify feasible kinetic parameters for an ORACLE (optimization and risk analysis of complex living entities) kinetic model of metabolism. More generally, deriving biological knowledge from machine learning methods would enable an efficient way to advance scientific understanding from the increasing data deluge coming from multi-omic approaches. While it is not obvious how an actual mechanism can be learnt from purely data-driven machine learning approaches that are based on correlations rather than causation, some recent examples have demonstrated promising results in identifying relationships that are candidates for follow-up experiments to distill mechanisms. For example, Ma et al. (2018) developed a visible neural network (VNN), which couples the model's inner workings to those of a real system, by incorporating knowledge from gene ontologies into a VNN to simultaneously simulate cell hierarchical structure and function. The resulting VNN was optimized for functional prediction (e.g., growth rate) while respecting biological structure (subsystem hierarchy) and was capable of identifying subsystem activity patterns. Another study (Zelezniak et al., 2018) leveraged metabolic network information to predict metabolite concentrations (response) from protein levels (input) in S. cerevisiae mutants through a multilinear regression: metabolite concentrations were expressed as a function of expression levels of the closest enzyme neighbors in the metabolic network. A more general approach called explainable artificial intelligence, XAI (see Section 5.3), presents enourmous potential for providing mechanistic insights within data-driven machine learning models. An algorithm of this type was able to detect enhancer activity in the Drosophila embryo and alternative splicing in human-derived cell lines by systematically capturing high-order interactions between features that are predictive of the response (Basu et al., 2018).
Another possible approach is to incorporate input features derived from mechanistic models into machine learning models to improve their predictive power. For example, Culley et al. (2020) developed a machine learning pipeline for predicting S. cerevisiae growth rate that leveraged transcriptomic data and genome-scale model predicted fluxes as input features. They show that using fluxes predicted from parsimonious flux balance analysis (pFBA) as features combined with transcriptomics data improved the predictive power of neural networks over using transcriptomics data alone. In a similar direction, it would be worthwhile exploring whether synthetic data augmentation based on mechanistic simulations can increase predictive accuracy of machine learning models while learning hypothesized mechanisms underlying the data. Also, mechanistic models can be a useful tool for feature selection for machine learning models. It has been shown that GSMs can be fruitfully leveraged to identify a subset of reactions to then be optimized through machine learning methods .
Finally, incorporating known physical or biological constraints on the solution space of machine learning algorithms can ensure biologically meaningful solutions or rule out possible machine learning solutions that are known to be biologically infeasible. In a study by  various machine learning algorithms were used to predict central carbon metabolic fluxes measured through 13 C Metabolic Flux Analysis (response) from culture and genetic information (input). The best performing machine learning model flux predictions were then changed as minimally as possible to satisfy the stoichiometric constraints provided by a GSM. Similarly, machine learning was used to reconcile empirical genetic interaction data with FBA model predictions (Szappanos et al., 2011).

Inspiring new machine learning from metabolic engineering
Metabolic engineering can also provide inspiration for new machine learning and AI algorithms. Biomimicry was the inspiration for neural networks (Fig. 7), so it is not unreasonable to think that biology can be the inspiration for more and better algorithms. Gene regulatory networks, for example, involve a sophisticated network of molecular interactions that regulate and determine the cell behaviour to sense and react to environmental cues and optimize survival. A full mechanistic understanding of the general principles of how this is achieved for different cells, environments, and threats, could provide valuable insights for new machine learning approaches.
Indeed, metabolic engineering is in a better situation to inspire new machine learning algorithms than other disciplines. While there is no hope that understanding how a neural network identifies a cat will reveal physiologically meaningful information on the brain identification processes, in metabolic engineering we are quite close to the mechanism. Indeed, some of the mechanistic models provide predictions that may not be completely accurate, but are qualitatively acceptable (Lerman et al., 2012;Karr et al., 2012;Macklin et al., 2020). We believe using machine learning to complement the parts of mechanistic models that are less tested can significantly increase their accuracy. These hybrid models can lead to new inspiration for new machine learning architectures and general approaches.

Major bottlenecks for further application
While the need for improved predictive power fosters the further application of machine learning in metabolic engineering, there are some fundamental obstacles to a wider application. These obstacles are both technical (data and algorithmic challenges) and sociological.
The foremost challenge is undoubtedly the scarcity of the large data sets needed to train machine learning algorithms. The majority of metabolic engineering projects typically involve much less than 100 strains/conditions. Whereas training instances can be multiplied by shrewd data augmentation (see section 2.2.1), it seems unlikely that the current status quo will be able to provide the amount of data routinely found in other fields (several million instances/images in ImageNet (Deng et al., 2009)). This will undoubtedly limit the benefit that metabolic engineering can leverage from machine learning. Another challenge is data quality, which is as important as quantity. High repeatability and low uncertainty are critical characteristics of high-quality data: an experiment must produce similar responses under identical inputs, or there is little hope that an algorithm can be predictive. Furthermore, data sharing is often hampered by the lack of biological data standards needed for this exchange. For example, in the case of multiomics data, there are databases for genes (e.g. Genbank IDs (Benson et al., 2011)), proteins (e.g. Uniprot IDs (The UniProt Consortium, 2017)), metabolites (pubchem IDs (Kim et al., 2016)), and reactions (e.g. BIGG database (King et al., 2016)), but these databases are often not comprehensive (e.g. not all proteins are submitted to Uniprot) and are not fully interlinked (e.g. BIGG metabolites not always have a pubchem entry). While there are efforts to solve this problem (e. g. Metanet X (Moretti et al., 2016), or BioCyc (Karp et al., 2019)), this issue rarely reaches the high profile needed to attract the investment required to completely solve it. Moreover, if the state of data standardization is not good, metadata standardization is in an even worse state. Without an investment in this piece of infrastructure, there is little hope for a disruptive impact of machine learning in metabolic engineering (Fig. 12). A possible solution to several of these problems involves automation (see section 5.2).
A second hurdle involves the adaptation of machine learning algorithms to the special needs of metabolic engineering. Uncertainty quantification is one of the needs of a discipline with small training data sets that is beginning to be met (Radivojević et al., 2020). Explainable AI (XAI) involves creating models such that the reasons for their predictions can be understood by humans (see section 5.3). This is particularly important in metabolic engineering, where we often have, or can easily investigate, the mechanism responsible for a given response. This investigation is much more complicated for other fields like, e.g., artificial vision or astrophysics. The integration of prior biological knowledge into machine learning algorithms, and its extraction from machine learning results is also an area that could provide significant advances in both metabolic engineering and machine learning, as discussed in section 4.
Another, often overlooked, obstacle involves the sociological challenge of having two very different groups working together: machine learning researchers and metabolic engineers. These two crowds are typically trained very differently and there is little intersection among them. Communication is, hence, often complicated by these differences. Furthermore, they are different not only in their skill toolbox, but also in which problems arouse their interest. This creates problems in aligning interests and managing projects. Interaction is, however, necessary: it is becoming impossible even for machine learning researchers to keep abreast of the literature on their field, and the new metabolic engineering tools (e.g., CRISPR-based gene editing, cell-free engineering) are posing a similar challenge in this field. Only through an interdisciplinary effort can the best of both disciplines be combined to create something bigger than the sum of the parts.

Integrating machine learning and synthetic biology with automation
As indicated above, the training data for machine learning must be high-quality, in the sense that it must avoid biases due to inconsistent protocols and provide quantification for repeatability (see section 2.4). Both goals can be systematically achieved through automation, which is one of the main reasons the intersection of machine learning, synthetic biology, and automation is thriving (Carbonell et al., 2019). Biological and chemical sciences data are nowadays growing at an unprecedented pace, but the databases aggregating biological and chemical findings are usually biased (Rodrigues, 2020). To avoid this bias, it is highly desirable to start veering away from the traditional approach of one entire PhD per molecule or one scientist performing the full metabolic engineering process, in order to adopt the creation and maintenance of integrated engineering pipelines (Fig. 13). This is the path embodied by biofoundries Hillson et al., 2019). This goal can be achieved by extending current automation pipelines for machine learning (Olson and Moore, 2018). Pipelines are fully or semi-automated infrastructure that realize a procedure in a systematic manner: e.g., phenotyping through proteomics, strain construction, fermentation. Automated pipelines facilitate consistent protocols and reproducibility in synthetic biology (Jessop-Fabre and Sonnenschein, 2019), and have the capability to produce the amount of data required by machine learning. Fully automated and integrated DBTL pipelines have already been successfully adopted for the identification and optimization of biosynthetic pathways . In general, we expect machine learning, biochemical analytical techniques and automation to follow a path of parallel development and keep symbiotically interacting in pipelines so that machine learning will be a pillar in every step of biosystems design (Volk et al., 2020).
Automating metabolic engineering often involves multiplexing the bioengineering efforts to parallelize a set of combinatorial experiments. For example, digital microfluidics is a high-throughput liquid handling technique able to quickly automate diverse biological experiments at micro and nanoscales, thus accelerating the DBTL cycles and making synthetic biology programmable (Gach et al., 2017;Kothamachu et al., 2020). The combination of microfluidics with nanofluidics and optoelectronics has been used for the automated growth and analysis of thousands of cell lines in parallel on a single chip . Another implementation of these technologies enables the parallel construction and optical screening of tens of thousands of synthetic microbial communities per day (Kehe et al., 2019). Other efforts focus on parallelizing experiments while maintaining their potential to efficiently scale up. That is the case of automated workflows for media optimization, induction profiling, or microbial bioprocess optimization leveraging the Biolector, a microtiter plate-based cultivation device (Rohe et al., 2012). Another example of productive scale up involves the Automated Microscale Bioreactor (Ambr 250), which can generate comparable cell growth and protein production profiles comparable to those obtained in 1000-L bioreactor industrial scale fermentations .
Some automation technologies focus on the human-to-system interface and embrace AI to further accelerate the experimentation processes. Robotic Process Automation (RPA) is an alternative approach that provides agents (bots) that operate on user interfaces in the way a human would do (van der Aalst et al., 2018). RPA is meant to replace humans in repetitive work that is frequent enough to make fully automation economically feasible. Intelligent RPA (IRPA) is the current effort to fuse RPA with advanced AI methods to drastically extend its scope (Syed et al., 2020). Combining experimentation platforms with AI to accelerate experimental research is at the core of the so-called self-driving laboratories (Häse et al., 2019), which typically use multi-objective optimization techniques (Häse et al., 2018) and iterate over the design, execution, and learning steps of the experiments with complete autonomy (MacLeod et al., 2020). The use of AI-driven automation Fig. 12. The hierarchy of needs for leveraging machine learning in metabolic engineering. It is futile to rely on machine learning to guide metabolic engineering without first establishing the basic infrastructure that it depends on. The very base consists on creating the infrastructure to physically collect large amounts of high quality data. The next step is to have the databases, standard and ontologies to structure and store the data appropriately. Data cleaning and outlier detection follow. The base for simple machine learning algorithms (linear regression), feature selection and algorithm training is at this point set. It is only at this stage that sophisticated machine learning and deep learning can significantly improve the metabolic engineering practice. Adapted from Rogati (Rogati, 2017).

Fig. 13. Traditional metabolic engineering vs pipeline.
The traditional metabolic engineering process involves a single researcher doing all phases of the project from pathway choice to strain building, fermentation, and data analysis. The pipeline approach instead focuses resources on creating a single, flexible, semi automated, pipeline consisting of different connected services supported by specialized teams. The pipeline approach favors repeatability, data quality and the stream of data required by machine learning. Furthermore, the pipeline allows for simultaneous development of multiple strains, so knowledge obtained from one design can immediately be leveraged for all others. BioCAD: Biological Computer-Aided Design; BioCAM: Biological Computer-Aided Manufacturing of Synthetic DNA (Oberortner et al., 2020). technologies with hardware robotics represents a step further. In that sense, a very recent automation effort has used state of the art robotics to completely move the focus from automating the instruments to automate the researcher (Burger et al., 2020).
Cloud labs are tools based on cloud technologies (Xu, 2012) that allow a scientist to remotely conduct biological research through robotic control, by using a high-level interface to ease the requirement for any programming knowledge. As an added benefit, researchers usually get all the intermediate and final results stored on the cloud in digital formats prepared for downstream analysis by local or cloud computing (Mell and Grance, 2011). Past years have seen an emergence of cloud labs (Check Hayden, 2014) and tools (Bates et al., 2017), which has been recently boosted by the social distancing requirements of the SARS-CoV2 pandemic. Thus, a remote or distributed manner of experimentation is arising as an alternative to the local or centralized classic model.
Interestingly, COVID-19 has also promoted the do-it-yourself (DIY) approach to lab automation. In 1981, IBM introduced the personal computer (PC), democratizing computing with an open architecture model (Miller, 1989;O'Regan, 2012), and producing a paradigm shift. An equivalent shift for automation seems to be in motion, due to the combination of the maturity of the open source model with the rise of free open scientific hardware (FOSH), now accelerated by the SARS-CoV2 pandemic (Maia Chagas et al., 2020). This trend in automation emerged from the use of 3D printing for a growing number of scientific and engineering applications in the laboratory (Silver, 2019). This pure DIY approach has already produced successful high-throughput automation platforms for bioengineering  and is susceptible to improvement by machine learning techniques such as deep reinforcement learning (Treloar et al., 2020). Some companies, such as Opentrons, are taking advantage of this new market niche and are providing open automation solutions halfway between the extreme DIY and the classical automation (May, 2019) based on proprietary and expensive equipment and consumables (Maia Chagas, 2018). There are already some open automation systems built on top of Opentrons liquid handling robots and devoted to synthetic biology applications, such as the DNA-BOT for automated DNA assembly (Storch et al., 2020). On the other hand, a low-cost modular FOSH liquid handler has been recently combined with machine learning for automatizing droplet experiments with AI-enabled computer vision (Faiña et al., 2020). Considering all of the above, it seems that there are technological developments quickly converging towards open hardware and software automation solutions based on machine learning and specific for synthetic biology.

Novel machine learning techniques to watch
Deep learning, with applications using several interconnected layers of ANNs (see Figs. 7 and 8), has been the subfield of machine learning driving the recent boost of AI. The number of such layers of ANNs is the depth of the neural network. With increasing depths, deep neural networks often have a large number of parameters. For example, a state-ofthe-art system for natural language processing (NLP) (Manning, 1999), the autoregressive language model GPT-3 , has almost one hundred layers and 175 billion parameters. These DL systems are intricate black boxes making decisions that are not easily interpretable from a human perspective. If a prediction deviates from the expected answer, it is generally not easy to understand why it failed, or how to correct the issue. These algorithms are only as good as the data they are trained with, so biases in the data have a significant impact on the predictions (Rodrigues, 2020), with a growing need for developing bias quantification metrics along with methods for overfitting detection and data debiasing (Ellingson et al., 2020).
The lack of interpretability has prevented machine learning in general and DL in particular from expanding in some fields that require trust in the underlying technology, such as in defense, healthcare, and other sensitive applications. Different novel approaches are under active research to overcome this critical drawback. Some of these try to make classic machine learning methods such as random forests more interpretable without a loss of efficacy (Basu et al., 2018). Another technique is even able to extract explicit physical relations by applying symbolic regression to components of a Graph Neural Network (GNN) trained by encouraging sparse latent representations in a supervised setting (Cranmer et al., 2020). In drug discovery, the lack of transparent and reproducible workflows has hindered widespread adoption of machine learning models, but this is being solved by novel scalable pipelines with traceable models stressing uncertainty quantification (Minnich et al., 2020).
In 2017, DARPA launched its explainable artificial intelligence (XAI) program as a comprehensive strategy to tackle the machine learning interpretability problem. DARPA's XAI aims at developing superior AI systems able to have a symbiotic relationship with humans (Gunning and Aha, 2019). A recent evolution on top of the XAI paradigm is the concept of Responsible AI (Barredo Arrieta et al., 2020), which imposes further constraints on the implementation of AI systems, like transparency, accountability, and ethics. However, the movement towards greater interpretability involves significant trade-offs in terms of performance, with a toll on fidelity and accuracy . Ultimately, that compromise could be rendered unnecessary by advances in high performance computing (HPC), since AI and HPC are converging in approaching the exascale era (Gwynne, 2019). Indeed, the joint effort of XAI developments with exascale computing, by bridging the gaps between cutting-edge research and sustainable policies, could pave the way for designing practical solutions to global challenges such as climate change (Streich et al., 2020).
XAI has numerous applications in unraveling the profound mechanics of natural or artificial systems, such as the molecular mechanisms underlying genome biology (Basu et al., 2018). A related DL framework is the use of physics-informed neural networks (PINN), which are trained to solve supervised forward and inverse problems involving nonlinear partial differential equations (PDE), thus supporting the union of data-driven and mathematical models (Raissi et al. 2019(Raissi et al. , 2020. In the case of very noisy data, Bayesian Neural Networks can be combined with PINNs (called then B-PINNs) to both avoid overfitting and quantify uncertainty (Yang et al., 2020).
Transfer learning (TL) (Ando and Zhang, 2005;Caruana, 1997;Pan and Yang, 2010) is the technique of knowledge transfer from a domain with enough training data to another related domain of interest that lacks such data. This transfer considerably enhances the learning performance by avoiding costly data-labeling efforts. This area is under rapid expansion but already offers many consolidated models from which to choose carefully depending on the type of application and its data (Zhuang et al., 2020). For example, TL has been used to tackle the problem of predicting associations between genotype and phenotype (Petegrosso et al., 2017). Clearly, TL could be key for different metabolic engineering projects if used to transfer predictive capabilities from one organism to another, avoiding the cost and time expenses of getting large multiomics data sets from scratch. Finally, TL can be combined with XAI methods, for instance, for gathering pathway and metabolic information in model organisms and translate it to others so as to get comprehensive genome-scale metabolic models in an efficient manner.

Conclusion
Machine learning provides an opportunity to make metabolic engineering more predictable and efficient. In this review, we have attempted to provide an introduction to this discipline in terms that are relatable to metabolic engineers, as well as providing illustrative examples along the traditional phases of metabolic engineering (from pathway choice and construction to scaling). We have also included practical advice including library suggestions and experimental design recommendations. Finally, we have examined the perspectives for this combination of disciplines, which are particularly relevant and difficult to predict, given the current explosive growth of both machine learning and synthetic biology.
In our opinion, metabolic engineering could take two courses in the future: incremental or disruptive. In one, traditional methods prevail, progress is incremental, and more molecules are arduously brought into commercial use at an increasing rate. In another one, metabolic engineering fully embraces and integrates the possibilities afforded by automation and machine learning. This choice leads to a disruptive change that makes production of new molecules a relatively easy task dwarfed by the more ambitious goals enabled by the new predictive capabilities. Metabolic engineering is used to engineer microbiomes, create new biomaterials, provide fundamental understanding of emergent properties and evolution, and suggest new artificial intelligence approaches.
The fundamental challenges for the disruptive path involve enabling streams of high-quality data, developing new algorithms to integrate the advantages of data-driven and mechanistic approaches, and fully leveraging novel tools in machine learning and synthetic biology. In our view, solving these challenges is only possible through a multidisciplinary collaboration of scientists including metabolic engineers, biochemists, microbiologists, computer scientists, electrical engineers, chemical engineers, mathematicians, statisticians, and physicists, among others. We hope to have provided in this review a helpful resource for that multidisciplinary collaboration.