Machine Learning: A Suitable Method for Biocatalysis

: Biocatalysis is currently a workhorse used to produce a wide array of compounds, from bulk to ﬁne chemicals, in a green and sustainable manner. The success of biocatalysis is largely thanks to an enlargement of the feasible chemical reaction toolbox. This materialized due to major advances in enzyme screening tools and methods, together with high-throughput laboratory techniques for biocatalyst optimization through enzyme engineering. Therefore, enzyme-related knowledge has signiﬁcantly increased. To handle the large number of data now available, computational approaches have been gaining relevance in biocatalysis, among them machine learning methods (MLMs). MLMs use data and algorithms to learn and improve from experience automatically. This review intends to brieﬂy highlight the contribution of biocatalysis within biochemical engineering and bioprocesses and to present the key aspects of MLMs currently used within the scope of biocatalysis and related ﬁelds, mostly with readers non-skilled in MLMs in mind. Accordingly, a brief overview and the basic concepts underlying MLMs are presented. This is complemented with the basic steps to build a machine learning model and followed by insights into the types of algorithms used to intelligently analyse data, identify patterns and develop realistic applications in biochemical engineering and bioprocesses. Notwithstanding, and given the scope of this review, some recent illustrative examples of MLMs in protein engineering, enzyme production, biocatalyst formulation and enzyme screening are provided, and future developments are suggested. Overall, it is envisaged that the present review will provide insights into MLMs and how these are major assets for more efﬁcient biocatalysis.


Bioprocesses in Biotechnology
Bioprocess engineering is an interdisciplinary science that combines biology and engineering to optimize the growth of organisms and/or the generation of target materials. Traditional approaches to develop and improve bioprocesses, e.g., the microbial production of antibiotics, have evolved based on different techniques, namely genetic engineering and molecular biology [1].
Biochemical engineering is a multidisciplinary research area that combines natural sciences (e.g., biology, chemistry) with engineering (e.g., chemical engineering, process engineering) to address global challenges and strategic national priorities, including energy, health and sustainability. It involves several topics related to the development of biological processes, from the preparation of raw materials to the synthesis of bioproducts and recovery of biowaste, fostering a circular economy approach, with several applications relevant mainly for industrial biotechnology and medical and health biotechnology [2].
Introduction of large-scale aerated bioreactors operating under aseptic conditions Insight into the molecular machinery of bacteria and cells, e.g., DNA structure, mechanisms and control of protein synthesis 1980s henceforth Production of recombinant biopharmaceuticals, e.g., monoclonal antibodies, interferons, vaccines Production of biopolymers (e.g., bioplastics), biofuels (biohydrogen) and messenger RNA vaccines Process design in a circular economy perspective Introduction of recombinant DNA technology; rational design and directed evolution of proteins. Overall creation of biomolecular diversity. Enhanced therapeutic proteins and enzymes Trends in automation, process integration and data-driven process optimization 1 Significant improvements in hardware, microbial sources, standardization and modelling throughout the XX and XXI centuries; 2 trend revived from 1970s henceforth (fuel crisis and environmental concerns).
Plenty of mechanistic-chemistry-and physics-based phenomenological and empirical models have been presented so far to simulate the kinetics of microbial biomass growth and product synthesis [7][8][9]; to quantify metabolic flux analysis [10,11]; for bioprocess optimization and biomanufacturing [12][13][14][15][16]; for bioreactor design, modelling and scale-up [17][18][19][20]; for the design of bioseparation units [20][21][22]; for protein and enzyme design [23][24][25][26]; and to characterize the kinetics of enzyme catalysis [24,[27][28][29][30]. Nevertheless, and despite advanced mathematical theories such as optimization and statistical analysis, both the research community and industry are still short of efficient and robust modelling models that can accurately translate the complex knowledge of the bioengineering domain into mathematical formulation [2]. However, in recent years, there has been a shift from physical modelling to data-driven modelling with the application of machine training techniques and a large number of data recorded and stored by the biochemical industry. Such an approach, which involves the combination of machine learning algorithms and information sets, shows significant potential for the identification of intricate relationships [31][32][33][34][35][36][37].
Natural products can have different chemical structures and, according to studies carried out on the biosynthesis of these natural products, a wide range of biosynthetic enzymes have been identified [38][39][40]. Biosynthetic enzymes of natural origin are therefore a fundamental source of several effects. From a biocatalytic point of view, the selection of a potential biosynthesis enzyme depends on substrate specificity, cofactors, turnover, stability, functional expression and the ability to perform its autonomous function outside its natural pathway in a specific cell. Still, the biosynthetic enzymes involved in secondary metabolism are often only moderately efficient catalysts, with turnover numbers (k cat ) typically about 30-fold lower than those of primary metabolism, which is probably due to their evolutionary histories [41,42]. Moreover, it has been highlighted that the average enzyme is far from the kinetic perfection (k cat~1 0 6 to 10 7 s −1 ; k cat /K M~1 0 8 to 10 9 s −1 M −1 , K M : Michaelis constant) of textbook examples of enzymes [41,43]. Additionally, native enzymes also display several other notable limitations, among them poor catalytic efficiency over non-cognate substrates that are often of interest for practical applications; limited activity and stability under harsh industrial conditions (e.g., relatively extreme pH, temperature, ionic strength) and substrate and/or product inhibition [42,44,45]. These limitations and strategies to overcome them through enzyme engineering have been summarized recently [46]. Nonetheless, biosynthetic enzymes have been deemed starting points for defined evolution efforts to improve catalysis rates, substrate tolerance and scope, stability in given solvents, e.g., ionic liquids or organic solvents, and to minimize or eliminate cofactor requirements [47].
Several strategies based on machine learning methods (MLMs) have been presented to make enzymes more performant in real-life applications. As MLMs can be used to gain insight into relationships involving the sequence, structure and function of enzymes in a data-driven manner, they enable the identification of proper sequences from characterized enzymes that are likely to display the improved properties sought after [48,49], such as, e.g., structural stability [50]; to predict the effect of mutations on catalytic power and selectivity [51]; to screen potential substrates for a given enzyme [52]; to identify substrate promiscuity from sequence data [53]; and to predict new beneficial mutant combinations in engineered enzymes that broaden the substrate range of a given enzyme [54]. These last three foster enzyme promiscuity. Regarding this particular matter, novel algorithms to predict enzyme-substrate compatibility that rely on curated information from metagenomic enzyme family screens have been proposed. These are expected to contribute to establishing the basis and standards for reliable enzyme-substrate compatibility models and to enhance our ability to predict enzymes' ability to act on non-cognate substrates. It has previously been hinted that there might be an optimal range in which the residues should position themselves; this is related to the active site, which may be pertinent for promiscuity [55]. Moreover, it has been highlighted that widening the catalytic site may diversify the binding orientations of different substrates, impair selectivity and ultimately lead to the formation of multiple isomers. Thus, care has to be taken when increasing the space in the catalytic site for enhanced substrate tolerance as not to compromise selectivity [56]. Research dedicated to gaining insights into enzyme promiscuity, in particular to better understand the structure-function relationship, has gained relevance in the last decade. Accordingly, a database dedicated to gathering information on promiscuous activities and helping to identify new catalytic activities and their underlying mechanisms has been recently presented [57]. Improving operational stability may be achieved through enzyme immobilization. By collecting and processing information from published works involving enzyme and carrier properties, a predictive model can be developed to assist in the design of robust immobilized biocatalysts [58]. A package to assist in the rational immobilization of enzymes was developed based on an algorithm that collects information on enzyme features, e.g., active site, surface and residue clustering, and retrieves the literature on immobilization to assist the user in identifying the proper immobilization approach for a given enzyme [59]. Models to identify peptides that bind to specific materials, more specifically polystyrene, have also been presented, which negate the need for expensive and time-consuming wet-lab experiments [60].
In the following sections, the basics underlying MLMs will be presented and illustrated in addition to their diverse applications in bioprocesses (Section 2), with a particular focus on biocatalysis (Section 3).

An Overview and Basic Concepts
Artificial intelligence (AI) gives computers the ability to make decisions via analysing data independently, following predefined rules or pattern recognition models. In the field of biotechnology, AI is widely used for various research challenges, most notably for de novo protein design, where new proteins with envisaged functions are assembled using amino acid sequences not found naturally according to the physical principles underly-ing intra-and intermolecular interactions [61][62][63][64]; in protein engineering, where selected proteins are manipulated to tailor selected key properties, e.g., activity, selectivity and stability [61,65]. Here, AI has been particularly useful when used to assist directed evolution experiments, namely by enabling a reduction in the number of wet-lab iterations required to generate a protein with the intended features [61,66,67]. Additionally, AI has been used in the field of biopharmaceuticals (drugs and vaccines) to develop new drugs, redefine existing and marketed drugs, understand drugs' mechanism and mode of operation, design and optimize clinical trials and identify biomarkers. AI is also used in the analysis of genomic interaction and the study of interaction pathways, protein-D, cancer diagnosis and analysis of genetics, among other applications [68][69][70]. Machine learning (ML) is a subfield of AI that allows the development of computer programs that learn and improve their performance automatically based on experience and without being explicitly programmed. In various studies, ML improvement strategies from large datasets generated by different techniques have been advantageously used for different purposes, such as the identification of weight-associated biomarkers [71], discovery of food identity markers [72], elucidation of animal metabolism [73] and investigation of many other areas of metabolomic development [74,75]. Many studies highlight the essential advantage of using ML and systems biology in pathway discovery and analysis, identifying enzymes, modelling metabolism and growth, genome annotation, the study of multiomics datasets, and 3D protein modelling [76]. Based on the available data, ML algorithms allow finding patterns, which represent points with several characteristics or descriptors, e.g., enzyme sequences, their secondary and tertiary structures, substitutions, physicochemical properties of amino acids, etc. These properties usually range from tens to thousands in number, and are thus hard and extremely time-consuming to handle using conventional approaches.
ML can be implemented through unsupervised and supervised learning. Unsupervised learning reduces high-dimensional data to a smaller number of dimensions or identifies patterns from data. In turn, in supervised learning, algorithms use data labelled in advance (designated as a training set) to learn how to classify new, testing data. Labelled data thus consist of a set of training examples, where each example is composed of an input and a sought-after output value. Thus, major features or combinations of features are obtained, and can henceforth improve the label accuracy in the training set and further use the gathered information for future input labelling. To put it another way, one or several target characteristics, e.g., enzyme activity, specificity or stability, can be designated as labels. The goal is to design a predictor that will return labels for unseen data points based on their descriptors using a properly tagged training dataset. Supervised and unsupervised methods can be combined under specific conditions to yield semi-supervised learning [77,78].
Supervised learning is by far the preferred approach in enzyme engineering, as the focus is on improving one or more properties of the enzyme [78]. Overall, the process flow of machine learning can be divided in three stages ( Figure 1). Stage 1, which involves data collection, recording and preparation of the input to be fed to the algorithm, is often considered the most laborious phase. The databases BRENDA, EnzymeML, IntEnzyDB, PDB Protein Data Bank and UniProtKB (these and further examples given in [78][79][80][81]) are by far the preferred sources for acquiring information. However, to extract useful information from the retrieved dataset, the data must be adequately pre-processed or cleaned, e.g., managing errors and missing data, detecting and removing duplicates, outliers and irrelevant information, as the quality of data heavily influences the precision of the final outcome [82]. Within this scope, and with the aim to facilitate the use of information throughout multiple scientific areas, steps towards the standardization of data and of semantics have been recently achieved [83]. In stage 2, algorithms process the data that is to be fed to the selected model. The final stage involves model validation using test data. Between stages 1 and 2, the available experimental data are split into two parts: part of the data are used for training subsets and adjusting the parameters of a predictor (stage 2); the remaining data are diverted to stage 3 for the final evaluation [78,84]. The algorithm also has to learn how to classify input data, e.g., assign a label such as spam/not spam. In terms of classification steps with binary labels or labels with a finite number of options, this evaluation is usually based on the number of true confusion matrices: positive and negative true/false. Here, a confusion matrix can be described as a summary of prediction results for a classification problem [78,85]. Classification is evaluated based on the sensitivity and specificity of the results [77]. For the regression steps, where the relationship between independent variables or features and a continuous dependent variable or outcome is addressed, the quality of the prediction is typically evaluated using root mean square deviation [77,78]. The final assessment (stage 3) is carried out on the test dataset. This is paramount as the goal is to ensure the robustness of a model through its successful application to datasets other than those used for training. In enzyme engineering, the occurrence of sequence similarities within training and testing datasets must be justified. Thus, an overrepresentation of a given enzyme family in the training set is likely to lead to a biased predictor that identifies patterns for that sole family. Additionally, similarity between sequences' training and testing datasets is prone to produce overoptimistic results when the performance is evaluated in stage 3 [78]. The first stage involves raw data collection and their cleaning. The cleaned data are split into two sets, one providing training data (Stage 2) and the other testing data (Stage 3). In the second stage, several algorithms are evaluated to find the one that best fits the matter at hand. In stage 3, the model selected in stage 2 is tested again with a new set of data to establish how it performs. Depending on the outcome, the model may require adjustments, so going back to previous stages may be required. Stage 2, which involves either adjusting a predictor or selecting a predictor among several possibilities, is often performed during the training step through K-fold validation. Thus, the training data are subdivided into equally sized K subsets, and then each of the subsets is used for testing while K-1 subsets are used for training. This process is repeated for K times, and finally an average of the performance scores from each subset is determined for each array of hyperparameters evaluated. K-fold validation is intended to mitigate both underfitting (usually high bias and low variance) and overfitting (usually high variance and low bias). Underfitting often takes place when a predictor fails to seize the underlying pattern of the training data and thus is unable to generalize; overfitting takes place when the predictor picks up the details and noise from the training data too well; hence, it is unable to generalize when exposed to unseen data. Underfitting may result from a lack of noise and variance or a too-short training duration. Overfitting is likely to occur due to excessive noise, irrelevant or missing data, data bias or poor quality [78,86,87].

Building a Machine Learning Sequence-Function Model
In the training stage of a machine learning model, the goal is to tune its parameters to optimize its predictive activity. Accordingly, training aims at the accurate prediction of labels for inputs unseen during training; thus, model performance evaluation must be carried out using data that are absent from the training set (e.g., 20% of the data should be saved for performance evaluation). Besides the parameters, values that are learned directly from the training data and are estimated by the algorithm with no manual contribution, building an ML model requires hyperparameters. These are values required to establish the complexity of the model. Oppositely to parameters, hyperparameters (e.g., the number of layers in a neural network or the scaling of input and/or output data) cannot be learned directly from the training data; thus, they have to be set by the practitioner either by hand or more typically using procedures such as Bayesian optimization, grid search and random search [49,[87][88][89]. The selection of proper hyperparameters values is critical since even minor changes can significantly impact model accuracy [90]. Hence, the optimization of hyperparameter values is typically computationally intensive as for each hyperparameter, training a new model is required [49]. In practice, the selection of hyperparameters involves splitting the data remaining after selection of the test set into a training set and a validation set. The former is used to learn parameters, while the latter is used to select hyperparameters that are validated through a proper estimate of the test error. K-fold cross-validation, as described in Section 2.1, is often used, although it requires significant training time. Alternatively, a constant validation set may be used at the risk of a poorer estimate of test error [49,87]. The proper selection of hyperparameters is considered paramount to ensure the success of neural networks, since it determines the correct number of hidden layers, neurons per hidden layer and activation function. Different strategies have been proposed, from a basic random search of hyperparameters to more advanced techniques, such as Bayesian optimization. Irrespectively of the method, the successful implementation of a neural network model depends on the correct selection of hyperparameters, albeit this is often not given proper attention/importance [2].

A Brief Overview of ML Algorithms
ML relies on the use of algorithms, programs that receive and analyse data (input) and predict values (output) values within a suitable span. As new data are fed to the algorithm, it learns, enhances its operation and concomitantly performs better over time. Algorithms encompass accurate, probabilistic techniques that enable a computer to take a given reference point and identify patterns from vast and/or intricate datasets [91,92]. Different algorithms fostering different ways to achieve this goal have been developed. For instance, the simplest machine learning models apply linear transformation to the input features, such as using an amino acid at each position or the presence or absence of a mutation [93] or blocks of sequences in a library of chimeric proteins made through recombination [94]. Linear models were commonly used as baseline predictors prior to the development of more powerful models. On a different level of complexity and concept, neural networks stack multiple linear layers connected by nonlinear activation functions, which allows the extraction of high-level features from structured inputs. Neural networks are hence well suited for tasks with large, labelled datasets.
Irrespective of their intrinsic nature, MLMs display both merits and limitations. Among the former, MLMs can mine intricate functions and relationships and therefore efficiently model underlying processes; are able to take on extensive datasets, e.g., protein databanks, data from analytical methods such as LC-MS, GC-MS or MALDI-TOF MS, which have been paramount within the scope of multiomics research, offering insights into enzymes' roles and structure; can find hidden structures and patterns from data and identify novel critical process parameters and control those. This is paramount to warrant validated ranges of critical quality attributes of bioproducts, which determine the value ranges that must be met for the bioproduct to be released. Among the demerits of MLMs, the requirement of large datasets for proper model training, the need for high computational power and the complexity of the set up and concomitant risk of faulty design may be suggested as the major shortcomings [2,16,[95][96][97].
Multiple ML algorithms have already been applied to enzyme engineering. Random forests, for example, are used to predict protein solubility and thermostability, while support vector machines have been used to predict protein thermostability, enantioselectivity and membrane protein expression. K-nearest-neighbour classifiers have been applied to predict enzyme function and mechanisms, and various scoring and clustering algorithms have been employed for rapid functional sequence marks [78]. Like Gaussian processes, they have been used to predict thermostability, enzyme-substrate compatibility, fluorescence, membrane localization and channel rhodopsin photo-properties. Deep learning models, also known as neural networks, are well suited for tasks involving large, labelled datasets with examples from many protein families, such as protein-nucleic acid binding, protein-MHC binding, binding site prediction, protein-ligand binding, solubility, thermostability, subcellular localization, secondary structure, functional class and even 3D structure. Deep learning networks are also particularly useful in metabolic pathway optimization and genome annotation [49].
Further examples of ML algorithms, details and illustrative applications are given in Table 2. Table 2. An overview of ML algorithms: basic aspects and illustrative applications. Specific issues and comprehensive details are out of the scope of this review and can be found in recent publications [2,91,95,[98][99][100].

Algorithm and Key Issues Examples of Applications
Multivariate analysis Abridges a set of machine learning algorithms, such as principal component analysis (PCA), linear regression (LR) and multiple linear regression (MLR) and partial least-square regression (PLS) [2,101,102].
Largely used and still dominant as ML tools in the bioprocessing industry since their inception in the late XX century [2]. PCA is an unsupervised method that reduces the size of a dataset, allowing new uncorrelated variables (i.e., latent variables) and a respective maximization of their variation. It can be used to discriminate components, find hidden patterns and identify abnormalities, etc. MLR is a supervised method that uses several independent variables to predict the outcome of a single dependent variable when a single independent variable is used to predict the outcome of the dependent variable MLR reduces to LR. PLS is a supervised algorithm related to dimensionality reduction that can directly relate an input dataset and a corresponding output dataset, establishing a linear correlation between the input and output variables within their latent space [103].
PCA: bacterial cell behaviour in the presence of organic solvents [104], bioreactor monitoring [105,106], protein sequence clusters [107], enzyme screening [108], mode of action of antibiotics and discovery of new bioactive compounds [109] and analysis of cereals [110] MLR: prediction of secondary protein structure [111], screening of protease inhibitors [112]; LR: effect of active metabolites in a population [113], effect of linear transformation on the input features, as achieved via placing an amino acid at each position or the presence or absence of a mutation [93]; effects of blocks of sequence in a library of chimeric proteins made through recombination [94]. PLS: monitoring [114] and control of bioreactors [115]; development of a biosensor device for analysis of binary mixtures of phenols [116]; and prediction of steroid diffusion across artificial membranes [117].

Support vector machines (SVMs)
A supervised algorithm that can be used for both classification and regression purposes (more commonly the former). Targets the finding of a hyperplane that optimally divides a dataset into two classes. Able to extract complex nonlinear relationships, as typically observed within bio-applications. Has been used in bioprocessing since the late XX century. Limited use in the presence of large datasets, questionable model interpretability and lack of uncertainty disclosure associated with prediction hamper further dissemination. Has been gradually replaced in several settings by other methods, e.g., artificial neural networks and random forests, as these also provide more accurate predictions [2,[118][119][120].
SVMs: prediction of the secondary structure of a protein [121], prediction of protein binding sites [122], identification of antioxidant proteins [123], chemotaxonomy studies based on secondary metabolites (diterpenes) [124], analysis of metabolic fluxes in microbial cells [125,126] and optimization of the permeability of a membrane used in a bioreactor for wastewater treatment [127].

Algorithm and Key Issues Examples of Applications
Artificial neural networks (ANNs) Mimic the way brain cells process information. Used in either supervised or unsupervised learning. An ANN is a topological structure formed by processing elements (artificial neurons) connected with coefficients (weights) and organized in layers [128,129]. ANNs provide a flexible regression structure to predict the relationship between inputs and outputs and can estimate any function [130]. By providing this specific flexible model structure and a set of input and output data, the parameters of the neural network can be changed iteratively so that the inputs match their correct output and estimates become closer and closer to the training data [2]. Roughly, ANNs can be presented as single-layer perceptron (SL) and multi-layer (ML) networks. The SL contains only two layers (input and output) yet fails to handle complex patterns; hence, more layers (ML), termed hidden layers, can be introduced [131]. To vary the weights to approximate an underlying function, the derivative of the error between the training output and the predicted response with respect to the weights of the network is determined, allowing gradient-dependent optimization solvers to minimize the error [2]. Several network structures have been proposed, e.g., convolutional neural networks, which enable a matrix or tensor of inputs such as an image [20,132,133]; recurrent neural networks, which use so-called internal memory [134][135][136]; deep neural networks, where many hidden layers facilitate the modelling of intricate underlying functions due to the large number of parameters [137][138][139] and clearly embrace the deep learning concept since more than three layers are involved. ANNs are gradually replacing PCA and PLS methods due to their relatively poor accuracy when simulating nonlinear biochemical reaction systems [2].
ANNs: modelling and optimization of enzymatic treatment for nutritional enhancement of rice [140], optimization of fermentation conditions for production of lipopeptide antibiotic [141], optimization of algal biofuel production [20], prediction of the toxicity of ionic liquids towards enzyme activity [142], liquid level control for bioreactor management [118], classification of 3D enzyme structure [132], prediction of protein structure [49,133,139,143], recognition of amino acids in protein engineering [135], de novo protein design [138], learning of protein function-structure relationship [144], protein thermostability [145,146], protein subcellular localization [147], protein functional class [148], protein solubility [149], recognition of promoter sequences [134], calibration of biosensors [136], prediction of flux in metabolic pathways given enzyme concentrations [150], tapping into the relationship between the chemical structure of given molecules and their biological activity for drug design [151] Gaussian processes (GPs) A probabilistic machine learning algorithm in which the estimates obtained are probability distributions as opposed to scalar values. Can be used in both supervised and unsupervised learning. Usually defined as a class of machine learning interpolation techniques with no assumed measurement noise, a GP will provide an exact fit to the dataset.
Estimates are typically made based on the weighted sum of the output data, weighted by the distance of the predictions from the existing data in the input space. The resulting probability distributions provide insight into the uncertainty of a forecast. GP models are attractive given their flexible non-parametric nature and computational simplicity [2,152]. GP is a distribution that, instead of returning unique values, returns functions. The referred distribution is thus conditioned on the training data using Bayesian reasoning, ultimately leading to a predictive distribution [153]. The run time for exact GP regression scales with the cube of the number of training examples, which makes it unsuitable for large (>10 3 ) datasets, but fast and accurate approximations are currently available [154]. Gaussian process prediction is hampered by the inversion of a covariance matrix, which computationally scales with the number of data points. Alternative processes have thus been developed, namely sparse Gaussian processes that approximate the posterior predictive distribution or the precision matrix, which scales with exponentially larger datasets [155] GPs: prediction of protein stability upon mutation [156], screening of Michaelis constant (K M ), and hence substrate affinity, for a given enzyme-substrate pair [157], assistance in directed evolution in a model system where protein function is altered and green fluorescence is transformed into yellow fluorescence [158], identification of channelrhodopsins that express and localize to the plasma membrane and conversion of a channelrhodopsins unable to localize into one that localizes well to the plasma membrane [159], engineering channelrhodopsins to obtain a mutant with high light sensitivity and potential application in optogenetics [160], real-time monitoring of cell culture processes through prediction of glucose and lactate concentrations [161], determination of the dynamics of a metabolic pathway with no need for time-dependent flux measurements [162].

Algorithm and Key Issues Examples of Applications
Ensemble learning (EL) Abridges supervised learning methods by merging predictions from several inducers for a decision. Thus, errors of an individual inducer will be counterweighed by others. An inducer, also called a base learner, is an algorithm that relates input and output data. The often-improved predictive performance of ensemble learning methods prevents overfitting. This minimizes the risk of obtaining local optimal models and widens the search location to obtain an optimal fit [2,163]. EL methods are divided into dependent and independent frameworks, depending on the relationship between each inducer [164]. Random forests (RFs) are among the latter and rank as the most common EL method in biochemical engineering [2]. RFs encompass decision trees, a flowchart-like parallel structure where if-else statements on inputs estimate output predictions as inducers [2,164]. Gradient boosting (GB) encompasses a dependent framework, where the construction of each inducer depends on the previously trained predecessor. Typically requires over 10 3 trees, is memory-demanding and has a high computational cost [164]. Given their different structures, RFs and GB should be used primarily for classification and regression studies, respectively [165,166].
RFs: prediction of protein-ligand docking affinity [167], prediction of flux in a membrane bioreactor [120], prediction of protein structure [168], protein function prediction [169], model for automatic classification of live and dead cells in Chlorella vulgaris [170], classification of compounds with key fuel properties [171], classification of enzymes [172], predictive models for drug combination therapy for tackling microbial infections, amino acid identification for health diagnostics [173], prediction of medium-chain carboxylic acid production from waste biomass [174], development of an environmentally friendly polyester dyeing process upon enzyme-and chitosan-driven surface modifications of the polyester [175]. GB: development of a broad K M predictive model from structural features [176], prediction of the mechanical functionality of protein networks [177].
Reinforcement learning (RL) RL differs from supervised and unsupervised learning. RL fosters a trial-and-error approach where the algorithm learns continuously through iteration and feedback based on a reward and penalty strategy for each tested sequence. The obvious goal is for the algorithm to maximize the cumulative reward through a series of adequate decisions [178]. RL is a relative newcomer to biochemical engineering but has been present in the chemical process industry since at least the early 2000s. Its adaptability without the need for large-labelled datasets suggests it may be easily disseminated in the near future [2,96], particularly (but not exclusively) for fermentation process control and optimization [96,[179][180][181] RL: Identification of the structure of a kinetic model and prediction of the kinetic parameters of a microbial fermentation [182], tuning the metabolic enzyme levels to improve production in microbial fermentation (e.g., synthesis of L-tryptophan) through a model-free approach and with no knowledge of the microbial metabolic network or its regulation [183], search for pathways for the production of valuable compounds by using the bioretrosynthesis space [184], addressing protein-ligand docking [185].
In the following subsections, some representative examples of MLMs in the field of biocatalysis and related fields will be presented and discussed to complement the information given in Table 2.

Machine Learning Applications in Protein Engineering
Protein engineering focuses on the modification of existing proteins through either changes in amino acids in the protein sequence, typically through replacement, insertion or deletion of nucleotides in the encoding gene, or the design of new proteins. Machine learning has also been used to assist directed evolution for protein engineering, allowing protein functions to be optimized with no requirement of prior knowledge of the underlying physics or the biological pathways [49]. Briefly, MLMs speed up directed evolution by gaining information on the properties, namely the sequence-function relationship, of thoroughly characterized proteins that are available in databases. Once an adequately accurate algorithm is selected to model the structure-function relationship, it is trained, so that the parameters are tuned to maximize its predictive capacity, and evaluated to assess its performance. Afterwards, model-based optimization can proceed for the selection of the sequence or sequences that optimize the function. For instance, models of the mutational effects can be used that ultimately classify the mutations as positive, negative or neutral, or the model may identify a combination of mutations with a high likelihood of improving the function (Figure 3). Thus, ML enables a shift from tedious, repetitive and costly wet-lab cycles of mutation/screening where the best variant in each cycle is selected and used in a new cycle until the intended goal is achieved, to a dry-lab environment, which is less demanding in terms of costs and resources [49,67,[210][211][212].
In a recent example, an unsupervised neural network model, MutCompute, was used to predict mutations in two previously engineered poly(ethyleneterephthalate) (PET) hydrolases, aiming to improve their overall catalytic activity at mild temperatures [213]. More specifically, the model can predict positions within a protein where the amino acids can be optimized considering the microenvironment surrounding them. The predictions ultimately suggested three mutations, which, in addition to two previous mutations in the protein scaffold, led to a final mutated enzyme with an increased hydrolytic activity as high as~140-fold that of the wild type and the scaffold mutants. Moreover, the resulting mutated enzyme outperformed other PET hydrolases when used to process different commercial polyester products. Finally, the mutated enzyme was successfully used in a full, closed cycle of enzymatic degradation/chemical repolymerization, starting with tinted postconsumer plastic waste and ending in a clear, virgin PET film after a few days. Jia and co-workers relied on a PLS approach to predict the thermostability of 64 mutants of an ω-transaminase, using experimental data from six single-point mutated enzymes that were used as a learning dataset and using the half-life (t 1/2 ) of the enzyme as a target result [214]. The authors were able to predict that an enzyme mutated at four specific positions with the suggested residues would result in an~8-fold increase in t 1/2 when compared with the wild type and an over 2-fold increase in t 1/2 when compared with the most stable single-point mutated enzyme. The observed pattern was related to both the physical-chemical properties of the residues involved and their position in the protein. The work also highlighted the feasibility of using MLMs to achieve good prediction of enzyme behaviour with a reduced set of experimental data, cutting down on operational costs. The same broad conclusions were reported by Yoshida and co-workers [215] while screening~8000 enzymes for a promising mutant Burkholderia cepacia lipase with improved thermostability. Using data from~200 selected mutants, the authors relied on multivariate analysis to decrease the number of possible combinations to 20 candidates. This was based on the relationship between each residue and a set of physical chemical properties, which was used to establish explanatory valuables and train the model with the resulting thermostability activities as objective variables. The data were then split into two and designated as improved and non-improved. From the 20 candidates, which were experimentally prepared, a triple mutant emerged with significantly higher initial and residual activity upon incubation at 60 • C when compared with the wild type as well as the other mutant candidates. MLMs were also recently used to engineer halogenase WelO5* to alter the selectivity and activity of the enzyme to produce mutants able to functionalize sorophens A and C [216]. These macrolides, which have an anti-fungal role, are not substrates of the wild type. Using Gaussian processes, the authors picked up a double-mutated WelO5*active over sorophen A and narrowed down the number of variants to be screened for improved activity and selectivity to ultimately predict and experimentally obtain a mutant that displayed increases of two and three orders of magnitude in the apparent catalytic constant and total turnover number, respectively, when compared with the initial hit. Again, the authors highlighted the role of MLMs as a swift and cost-saving approach to produce effective biocatalysts.
Further representative examples of the use of MLMs for enzyme engineering can be found in recently published reviews [195,217]. Protein engineering assisted by machine learning. Information on sequence-function relationship of proteins is retrieved from databases and curated, and an algorithm is used to identify the sequences most likely to display the desired property, which is then used to synthesize the optimized recombinant protein.

Process Optimization (Enzyme Synthesis)
The optimization of microbial enzyme production has traditionally been performed with the one-factor-at-a-time approach (OFAT), which is time-consuming and fails to identify interactions among variables. Hence, it has been gradually replaced with statistically designed experiments, e.g., response surface methodology (RSM), where several factors are varied simultaneously using empirical multivariate models [218,219]. More recently, MLMs have been gaining relevance, namely to address situations where detailed insight into the process is missing or the formulation of the reaction mechanism is not feasible [219,220]. Thus, MLMs have been used to determine the correlation between operational (input) conditions (e.g., pH, temperature, substrate concentration and nature, flow rates) and (output) microbial metabolism (e.g., enzyme synthesis) and then predict the output for given inputs. This is typically carried out through supervised learning algorithms, whereas the identification of hidden underlying patterns in data, outlier detection and dimensionality reduction are left to unsupervised learning algorithms, which have been used for various applications, e.g., process control. Overall, the algorithms used enable the prediction of relevant output variables, e.g., enzyme yield, or the most adequate parameter for scaling up or down [2,31,95,221,222] (Figure 4). Operational conditions (e.g., temperature, pH, medium composition) are provided, or scale-up criteria are selected and a target value (e.g., productivity) is defined. The selected algorithm processes the information to identify the conditions that optimize the target value or enable its reproduction throughout different process scales.
As recent example, an ANN was recently selected among other MLM methods to improve the fermentative production of a cellulolytic complex through optimization of nitrogen content (titre of yeast extract), inoculum size and duration of fermentation [223]. The ANN was singled out based on the improved accuracy of the predicted outcome, as illustrated by having the lowest mean square error, as compared to other MLM tested. Moreover, under ANN optimization, cellulase productivity improved~2.8-fold, which bested the~2.4-fold increase in cellulase productivity achieved when RSM was used in either case as compared to the cellulase productivity reported after initial screening studies. Additionally, the ANN model exhibited a coefficient of determination much closer to 1 than the RSM model, again highlighting the superior predictability of the former.
Sarmah and co-workers singled out a GB method among other MLMs to model the growth kinetics of Candida antarctica (currently Moesziomyces antarcticus) for lipase production [224]. The selection was based on the high predictability of the said method as compared to the remaining ones, as illustrated by their low root mean square error, which was selected as an objective function. The selected GB method was shown to require less than half of the number of samples as compared to a fully conventional experimental approach to deliver a predictive Monod growth model that translates into noticeable savings. Additionally, the kinetic model parameters proved highly accurate, as the coefficient of determination was close to 1. The authors identified the peak of enzyme activity production, although no further comparison was performed.
Despite the acknowledged limitations of the OFAT, this approach can be advantageously used if combined with a suitable MLM, as exemplified by Das and Negi [225]. These authors generated data through the OFAT to identify the operational parametersincluding carbon source (hexadecane) concentration, metal ion concentration, incubation time, inoculum size, pH and temperature of incubation-that improved alkane hydroxylase productivity in a submerged fermentation. The dataset was then divided and used to train, validate and test an ANN model to further optimize operational conditions, ultimately achieving a~1.8-fold increase in enzyme productivity as compared with the best result obtained with the OFAT. A similar strategy that combined the OFAT and an ANN was employed by Kumar and co-workers to model and predict xylanase production by Penicillium citrinum xym2 [226]. Thus, a dataset was generated using the OFAT method that comprised the optimization of incubation pH, temperature and time, nitrogen source and titre and additional carbon source titre using xylanase activity as an output variable. The dataset was again used to train, validate and test two ANN models for different input conditions. Single hidden layers and double hidden layers were developed that enabled the accurate prediction of xylanase activity, as reflected by the low mean square errors of 0.0046 and~0.0022, respectively, and provided a good correlation between the observed and predicted data, as exemplified by the correlation coefficients of~0.938 and~0944, respectively. Not surprisingly, the double-layered ANN allowed for more accurate prediction of the actual values of enzyme activity. Although submerged fermentation is the most widely used approach for microbial enzyme production, solid-state fermentation has been gaining relevance, particularly when filamentous fungi are used as the producing strain. Accordingly, Silva and co-workers evaluated ANN and SVM models to establish the effect of several operational parameters, e.g., incubation time, inducer concentration, inoculum concentration, moisture titre and pH of the nutrient solution, on alginate lyase, the output variable, with Cunninghamella echinulata as the enzyme producer [227]. Both models ruled out inoculum concentration as a relevant variable within the range of parameters tested. The ANN, as a predictive model, slightly outperformed the SVM, as the former led to a marginally higher coefficient of determination. This was somehow unexpected, since SVMs are considered more suitable when relatively small datasets are available, as was the case in the study considered.
A different approach to the enhancement of enzyme production using an MLM is illustrated in the work of Beier and co-workers, who used an MLM to associate a network of six selected genes with the induction of cellulase produced by Trichoderma reesei and thus establish the basis for overexpression of cellulase [228].

Biocatalyst Formulation
Enzyme immobilization, e.g., confinement of the enzyme in a restricted region of space, is known to potentiate the use of the enzyme, since it enhances stability and resistance to environmental factors, improves the control of the reaction and enables reusability [229]. However, despite being consistently used for several decades, enzyme immobilization is still vastly empiric, relying on tedious and time-consuming trial-and-error methodologies [230]. The use of rational design procedures is seen as a way to decrease the costs of immobilization protocols, but their implementation requires knowledge on several physical, chemical and biological variables related to the method of immobilization, the carrier and the enzyme itself and how to correlate these with output variables illustrative of immobilization efficacy, e.g., expressed activity, which is also known as activity recovery [230,231]. Thus, MLMs are an attractive way to deal with the vast combination of variables. MLMs can be used to retrieve information from databases that harbour information on enzyme structure and function and the physical-chemical features of carriers and infer key properties of the immobilized biocatalyst, e.g., activity and stability [232]. Additionally, MLMs can be used to process the huge number of data generated when a multidisciplinary approach combining physical-chemical and biological properties is used to determine the optimal immobilization procedure. Accordingly, Ralbovsky and Smith used multivariate analysis to process a complex dataset on pantothenate kinase immobilization into two commercial acrylamide-and methacrylate-based carriers that was generated through Raman hyperspectral imaging [205]. The latter can provide information on the chemicals involved in enzyme immobilization and their spatial distribution. Processing the image dataset with multivariate analysis and PCA allowed the researchers to identify and spatially resolve six diverse elements critical for the successful immobilization of the enzyme and to quantify the coverage of the surfaces by the enzyme. Ultimately, the authors proposed that the output variable quantitation of enzyme coverage can be used as a metric to evaluate enzyme immobilization. For the case study, the authors established that the acrylamide carrier allowed 1.3-fold higher enzyme coverage than the methacrylate carrier, hence rendering the former more effective for panthotenate kinase immobilization. The authors assumed that the often-observed correlation between enzyme loading and enzyme activity held, but no experimental validation was performed. Notwithstanding, the strategy presented provides a direct quantitative metric with which to evaluate enzyme immobilization efficacy. In a follow-up of this work, Ralbovsky and Smith established that the combined use of Raman hyperspectral imaging and PLS could be used to identify and classify samples of immobilized panthotenate kinase formulations. As an example, the set up developed allowed the identification of the carrier to which the enzyme was bound. The authors suggested that the work could be further expanded to identify and gain insight into how other carriers interact with the enzyme [233].
Chai and co-workers also opted for an MLM to predict the immobilization of several enzymes onto metal-organic frameworks (MOFs) [58]. The authors used as input variables properties of metals (e.g., metal ion concentration), ligands (e.g., ligand precursor functional group) and enzymes (e.g., dominant amino acid group), generating up to 12 input variables; as output variables, activity retention, enzyme loading, immobilization yield and reusability were used. However, the method of enzyme immobilization was not included in the model inputs. The dataset was obtained through a literature survey. RF and GP algorithms were used to handle the vast combination of available information, with the former emerging as the most adequate to predict enzyme loading, immobilization yield and reusability. However, neither model was able to accurately predict activity retention (coefficients of determination~0.6). This is understandably the most difficult property to predict, and the poor fit was attributed by the authors to the lack of information regarding input parameters such as orientation and degree of exposure of the active site of the enzyme upon immobilization. Again, this outcome highlights the need for complex (and often costly and thus not widely available) techniques and equipment for the detailed characterization of the enzyme (and eventually remaining components of the immobilized biocatalyst) to implement reliable predictive models for rational enzyme immobilization.

Enzyme Screening
Screening for enzyme activity using databases and/or data from laboratory experiments involves properly handling a vast array of data, and accordingly ML techniques have been developed/adapted for such a role [190]. There are some limitations still, since not all databases are machine-learning-friendly databases, and some are misannotated or not frequently maintained, which makes choosing adequate databases for the use of MLMs critical. In addition, some of the databases are misannotated, populated with disproved results or no longer maintained. Hence, choosing the right datasets for machine learning is critical to avoid feeding inaccurate data to the model. The more a database complies with FAIR (findable, accessible, interoperable and reusable) guiding principles for scientific data management and stewardship, the better the data usage and lower the efforts required for data cleaning [2,78,212,234]. Some recent examples of the application of MLMs for enzyme screening are presented. Poly(ethylene terephthalate) (PET) is among the most frequently discarded plastics and thus presents a major environmental concern; hence, strategies for its biodegradation have been sought after. These include PET hydrolases that are able to depolymerize PET to its monomers close to its glass transition temperature,~65-70 • C, in an aqueous environment [235]. PET hydrolase activity is considered scarce in nature; thus, Erickson and co-workers focused on expanding the range of putative thermotolerant PET hydrolases [236]. Upon examining several databases, e.g., the NCBI and BacDive databases, and selecting features, e.g., physical-chemical and residues, the authors tested several MLMs to identify thermophilicity. Ultimately, an SVM method [237] was chosen, which enabled the identification of 74 putative thermotolerant PET hydrolases, which were later evaluated for activity assessment. Roughly half of these proved active over amorphous PET, which highlights the success of the approach.
The concept of absolute enzyme specificity is a textbook example, yet the search for promiscuous enzymes is currently a hot research topic [237]. Enzyme promiscuity is related to the ability of an enzyme to catalyse reactions other than its native one [238]. Usually, an enzyme's secondary activity is significantly lower than its primary activity under natural conditions [239], but this can be tuned using suitable engineering approaches [240,241]. Broad substrate acceptance by a given enzyme is most appealing from a biotechnological standpoint as this widens the range of applications without the added costs of producing multiple enzymes [237,242]. Esterases have wide applications in industry, e.g., as detergents, food and pharmaceuticals [243]. In a recent study, Xiang and co-workers presented a bioprospecting strategy for the identification of promiscuous esterases from sequences in databases through the combination of different MLMs [53]. The method ultimately led to the identification of ten sequences that were experimentally tested and validated as promiscuous esterases, thus establishing the validity of the proposed strategy.
Prediction of substrate scope for bacterial nitrilases using a structure-based approach was developed by Mou and co-workers [244]. The different MLMs evaluated by the authors to predict the substrate scope performed similarly, although the RF method provided marginally higher precision and sensitivity. The authors thus suggest that their RF approach could be used to predict substrate scope for other enzyme classes.
Many methods used for determining enzyme activity from databases rely on the use of amino acid sequences (either annotated or unannotated, the latter being favoured, as they foster the identification of novel enzyme functions) to retrieve Enzyme Commission (EC) numbers and enzyme family [79,[245][246][247][248][249][250]. However, methods solely based on amino acid sequences have been deemed unable to predict more elaborate enzyme reactions and to establish the function of non-characterized enzymes with poor homology to annotated sequences [249]. An alternative method used to predict enzyme activity relies on knowledge of the chemical structure of substrates and products [251][252][253][254][255]. However, in some of these methods, information about the enzyme sequence is needed to elucidate the chemical structure of the compound involved. In a recent work, Watanabe and co-workers presented an approach that combines the two previously described strategies [249]. Briefly, upon the collection of information on the enzyme sequence and substrate/product chemical structure, the authors tested several MLMs to predict both EC numbers and full enzyme reactions. Ultimately, an RF-based model emerged as the most adequate as the area under the receiver (AUC), one of the metrics chosen to compare the models, presented a score of 0.94 (out of a possible maximum score of 1).

Conclusions
The impact of bioprocess engineering on our daily life has steadily increased, a trend that has been reinforced in recent years and is expected to proceed similarly due to public awareness of the need for sustainable and environmentally friendly production processes. Still, for bio-based production methods to thrive, they must be technically sound, economically competitive and sustainable, which requires the optimization of current processes or the introduction of novel processes and goods. This places a huge burden on the biobased industry, as in established methods of bioprocess development, the translation of lab research to industrial reality is slow and computational power is only used to a limited degree. To effectively handle and process the huge number of data gathered by the bioprocessing community and identify and establish relationships, the use of MLMs has gained relevance. MLMs rely on algorithms that collect, pre-process and analyse data to predict a given output, identify hidden patterns and foster probabilistic techniques that allow patterns to be established from a reference point. Algorithms with differentiated structures and complexities that are succinctly described in this review have been developed and effectively implemented in bioprocess engineering fields, e.g., monitoring/control and analytical roles, downstream processing, prediction of metabolic pathways and of protein structure-function, among others. In the specific field of biocatalysis, notable developments have been recently reported in enzyme synthesis and engineering, screening for enzyme activity and biocatalyst formulation, fostering an increasingly rational approach towards the intended goals, namely improved activity and stability and substrate diversity. These have been achieved through the use of adequate algorithms and interaction with duly constructed databases. Still, for MLMs to be further disseminated, some challenges must be resolved, namely the need for further open-source methodology and databases as well as the proper storage and management of data. Additionally, MLMs require significant energy to generate a precise label with a set of particular data; hence, other learning methods are being looked into. The interpretability of MLMs is often questionable; hence, they should be made more explainable to support decision making during operations. This is typically exacerbated as the structure of the MLM method becomes more complex, hence the need for a user-friendly MLM that can identify the proper structure for a given bioprocess.
Author Contributions: Conceptualization, P.S.S. and P.F.; writing-original draft preparation, P.S.S.; writing-review and editing, P.S.S. and P.F. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.