Key words

1 Introduction

Natural enzymes are attractive templates for sustainable chemistry [1, 2] that need to be redesigned to meet industrial requirements. The most popular approach is directed evolution, where mutations are introduced randomly in the parental enzyme , mimicking natural evolution [3]. Although directed evolution is a revolutionary technique (awarded with the Nobel Prize for Chemistry in 2018), randomness and limited sequence sampling restrict the number of potential beneficial mutations. Computational techniques can drive directed evolution, recycling its results with machine learning [4] and/or selecting residue positions with high potential of improvement (hot spots) [5]. More radically, computational design can also replace directed evolution altogether in what is called rational design: in this case the sequence space screening is done in a computer, leaving only a few candidates for experimental validation [6]. The scope of this book chapter is to illustrate the principles of computational enzyme engineering adopted at Zymvol, our company, both when accompanying (hot spot prediction) or replacing (rational design) directed evolution. The main conceptual steps are described and reconducted to the existing literature.

Both our enzyme search (ES) and in silico design (ISD) pipelines combine multiple solutions from bioinformatics, protein design algorithms, and molecular modeling. The objective of ES is to find suitable biocatalysts for an input reaction on a target substrate, either as is or as a favorable template for ISD. The objective of ISD is to redesign (simulate the effect of mutations) the input enzyme to efficiently implement the target reaction over the target substrate, i.e., to achieve/improve a target chemical transformation. Properties other than activity/specificity/selectivity can be improved, like enzyme stability (to temperature, pH, solvent, etc.) and solubility.

The cornerstone of our technology is the combination of bioinformatics with protein design algorithms and empirical/physics-based simulations that model the interactions between the substrate and the enzyme . Such simulations follow a funnel like scheme with increasing complexity to gradually filter enzyme candidates (Fig. 1). In implicit solvent molecular modeling, both enzymes and ligands are fully flexible, i.e., both the protein side chains and backbone can move. On the other hand, the solvent (water) is modeled as a continuous electrostatic field, ignoring its detailed action as an ensemble of molecules [7]. This approximation is corrected in explicit solvent molecular modeling, where water is treated as an ensemble of molecules which can perturb the system dynamics or participate in the target reaction [7]. Quantum chemical algorithms can model any electronic rearrangement that occurs along the target reaction, improving accuracy [8]. Throughout the pipeline simulations take into consideration not only mutations in the active site but also the potential effect of mutations located far away from the active site. Long range mutations can affect the encounters between the substrate and the catalytic residues in the active site or the migration of a substrate/product in/from the active site [9] (Fig. 2).

Fig. 1
figure 1

Computational workflow for enzyme engineering

Fig. 2
figure 2

Levels of theory adopted at Zymvol

In ES, only bioinformatics and implicit solvent molecular modeling are adopted. In ISD, all the steps are included. However, depending on the nature and complexity of the task in hand, some of them may be excluded. For instance, quantum mechanics is not included when other parameters need to be optimized first, like binding the substrate in the right conformation to convey the desired reaction.

The program input typically comprises: a target transformation for the desired substrate and an initial enzyme sequence to be optimized (in case of ES the latter is not required). The ES output consists of 8–16 enzymes. Typically an ISD round delivers 90 enzyme variants ranked by priority, but customized solutions are available (like residue positions for saturation mutagenesis). Both ES and ISD (one round) typically take up to 4 weeks to deliver results.

In ISD, more prediction rounds (followed by experimental validation) are possible. Each iteration allows us to include experimental data into our platform, aiming at improving the accuracy of our predictions for the task in hand. Three ISD rounds are usually recommended.

For a list of definitions of concepts addressed in this chapter, see Table 1.

Table 1 Definitions of concepts introduced in the manuscript

2 Bioinformatics and Molecular Modeling

In this methodological framework, two steps are included: diagnosis and mutagenesis.

In diagnosis, the design principles and mutagenesis library are the main output. The included steps of diagnosis are the following.

  • Sequence analysis. The amino acid sequence of the parental enzyme is blasted against UniRef90, the resulting sequences are aligned [10] and refined [10, 11]. The latter also provides an estimation of residue entropies, occupancies, and mutual information (MI) analysis. In such a way, only residues with enough entropy are retained for the final library and consensus mutations are suggested to stabilize the protein [12]. Also, potential synergistic effects are captured by MI [13].

  • Protein dynamics . The structure of the parental enzyme is prepared and its dynamical behavior is estimated with an internal protocol that merges information from normal modes [14], concoord [15], and loop sampling algorithms. A thorough analysis of the resulting structural ensemble can pinpoint potential hot spots far away from the active site. This analysis includes but is not limited to dynamical cross correlation matrix [16] and network techniques [17].

  • Substrate docking. The next step is to verify whether the substrate is capable of binding productively to the active site, i.e., respecting all the necessary catalytic contacts. If that were not the case, the first step is to introduce single mutations to alanine (Ala) to facilitate the formation. All noncatalytic residues within a certain threshold from the substrate are included in the computational mutagenesis library, provided that they are not highly conserved.

  • Substrate/product migration. Both the substrate and the product bound in the active site (in separate models) are perturbed with an in-house Monte Carlo algorithm (built within Rosetta [18]) where also the side chain and backbone of the surrounding residues are allowed to move. Simulating the exit of these molecules can pinpoint residues acting as kinetic bottlenecks, which usually display many contacts with the substrate. If they are mutable (i.e., their entropy is large enough), they are introduced in the computational mutagenesis library.

All the selected residues included in the library are allowed to mutate to populated amino acids, i.e., those that appear in the multiple sequence alignment (MSA) at those positions with enough frequency (threshold decided by the user) to preserve protein expression and robustness. The principle is somewhat similar to that behind PROSS [19] and FuncLib [20].

In the second step, mutagenesis, millions of enzyme variants are modeled and screened with an in-house protocol that includes proprietary and third-party software like Rosetta [21] and FoldX [22]. Then, the system is refined and relevant structural and energetic properties are extracted to evaluate every variant. A crude approximation is the simple substrate binding energy vs. catalytic distances plot.

In the last step, hundreds-thousands of variants are selected for the next phase. It should be noted that this is true only if the crystal structure or a high quality model of the parental enzyme is available. As a matter of fact, low/medium quality homology models have a tendency to unfold during molecular dynamics (MD) simulations [23], and the sampled ensemble might not be reliable. With bad homology models, the final list of mutants to be delivered to the lab needs to come from this single stage.

3 Molecular Dynamics (Explicit Solvent)

The selected variants are filtered based on MD simulations. Many features of designed enzymes can be inspected, including hydrogen bond strength, structural integrity and preorganization, solvent exposure. The main feature used to filter variants is the population of near attack conformations (NAC), i.e., structures that resemble the transition state (TS). This technique proved to be successful in tuning selectivity with high-throughput MD (HTMD) [24]. In HTMD many short simulations are run, which was shown to be more effective than traditional MD for NAC sampling [25].

As anticipated, MD is often not a viable solution in industrial applications because of the lack of a suitable crystal structure. Currently, we are evaluating alternative ways to account for NACs, based on Monte Carlo simulations.

4 Quantum Mechanics

Finally, chemical reactions are determined by electron rearrangements that lead to bond forming/breaking. These events cannot be accounted for by bioinformatics, molecular modeling and dynamics. Instead, quantum chemical calculations are necessary. Unfortunately, traditional techniques are as slow as informative so they are not suited to screen hundreds of variants [8]. Therefore, smart solutions are needed to decrease the resource and time burden. This passes through calculating properties, instead of TS energies, that: (1) correlate with activity; (2) converge much faster than energy during geometry optimization. One example from literature is our work with laccases [26, 27]. We developed a scoring method based on the total spin density localized on the substrate after five geometry optimization steps (the electron density roughly converged in that time) and a distance-dependent dielectric potential. Such a shortcut turned out to work well, as can be seen in Fig. 3. Testing four substrates against two laccases, a linear correlation between activity and spin density was found. In 2015, each calculation took ~4 h with six threads. With today’s cloud computing facilities and better hardware, ~500 variants can be tested in 1 day for less than €500. In the past years, this solution was adopted in the successful design of a few oxidoreductases [28,29,30,31].

Fig. 3
figure 3

Electron density-based quantum chemical scoring of activity

5 Areas of Development

At Zymvol we are currently working/evaluating the following areas of development to improve our technology.

  • Machine learning . As nicely shown by Siegel and co-workers for glycosyl hydrolases in a recent machine learning (ML) study, no feature strongly correlates with the experimental data [32]. This suggests that enzymatic function is shaped by many different molecular properties, of which binding energies are not even included. This poses a serious question on the actual knowledge of how enzymes work and whether the key parameters to label activity are actually family and reaction dependent. These doubts were confirmed by ML analysis that we conducted on two data sets: a full mutagenesis study for a small protein (~60 residues) [33] and the activity data of 96 ligands against 16 esterases [34]. Although both systems can be predicted with a good degree of accuracy with their respective ML models, the selected features are distinct. Therefore, we aim at gathering thousands of data points for key enzyme families to conduct further research and develop family-dependent models. On a final note, machine learning techniques are also in the way to speed up quantum chemical estimations to unprecedented levels [35].

  • Long range mutations. It is a matter of fact that mutations that are far away from the active site can lead to remarkable increases in activity, improving catalytic preorganization in evolved enzymes [36]. Despite this success of MD in characterizing long range mutation effects [37,38,39,40,41], no reliable tool seems to be available for their quick detection. In our company we are developing tools that do not involve long and resource intensive MD simulations, shrinking the prediction time for hot spots from weeks to minutes. Moreover, we aim at not only singling out hot spots but actually predict the correct amino acid substitutions. This technology is also expected to have an impact on predicting the effect of distinct immobilization techniques on enzyme activity, which is of great importance in industry.

  • Ancestral reconstruction. Ancestral proteins often show exceptional properties (including promiscuity and thermal stability) due to the variegate environmental conditions that shaped our planet to what it is today. Statistical models are available to reconstruct the phylogeny of a given enzyme and trace it back to its ancestors [42].

  • Quantum computing. Finally, with a look toward a perhaps more distant future, one of the most promising suggested applications of quantum computing is solving classically intractable biochemistry problems [43]. This includes quantum mechanics calculations but also sampling problems. However, building a sufficiently large quantum computer remains a prohibitive challenge nowadays [43].