Between Protein Fold and Nucleophile Identity: Multiscale Modeling of the TEV Protease Enzyme–Substrate Complex

The cysteine protease from the tobacco etch virus (TEVp) is a well-known and widely utilized enzyme. TEVp’s chymotrypsin-like fold is generally associated with serine catalytic triads that differ in terms of a reaction mechanism from the most well-studied papain-like cysteine proteases. The question of what dominates the TEVp mechanism, nucleophile identity, or structural composition has never been previously addressed. Here, we use enhanced sampling multiscale modeling to uncover that TEVp combines the features of two worlds in such a way that potentially hampers its activity. We show that TEVp cysteine is strictly in the anionic form in a free enzyme similar to papain. Peptide binding shifts the equilibrium toward the nucleophile′s protonated form, characteristic of chymotrypsin-like proteases, although the cysteinyl anion form is still present and interconversion is rapid. This way cysteine protonation generates enzyme states that are a diversion from the most effective course of action, with only 13.2% of Michaelis complex sub-states able to initiate the reaction. As a result, we propose an updated view on the reaction mechanism catalyzed by TEVp. We also demonstrate that AlphaFold is able to construct protease–substrate complexes with high accuracy. We propose that our findings open a way for its industrious use in enzymological tasks. Unique features of TEVp discovered in this work open a discussion on the evolutionary history and trade-offs of optimizing serine triad-associated folds to cysteine as a nucleophile.


■ INTRODUCTION
Proteases are enzymes capable of catalyzing the hydrolysis of peptide bonds. 1 They do it differently by harboring a variety of catalytic site architectures in a variety of folds. The catalytic capabilities of proteases make them central to natural metabolism as well as biotechnology.
A cysteine protease from the tobacco etch virus (TEV protease, or simply TEVp) is one of the most frequently used proteases in the latter context. 2 It is instrumental due to a number of appealing features. TEVp has a simple monomeric structure of just 27 kDa without any cofactors. Moreover, TEVp is highly sequence-specific with a natural predisposition to a seven amino acid sequence ENLYFQ|G/S. It should be noted, however, that such notation does not mean that G/S is a Cterminus, since TEVp is essentially an endopeptidase that cleaves itself out of the viral polyprotein. This opens the possibility of using TEVp as a controllable release-by-cleavage agent in biotechnological applications. 3 It has been widely explored, leading to the emergence of complex tools with different modes of activation. 4−7 With the biotechnological use of an enzyme comes a need to improve or adjust its properties. Variants with improved solubility and yield, 8,9 reduced self-cleavage, 10 and improved oxidative stability 11 as well as immobilization strategies were reported. 12 However, the biggest interest lies in the improve-ment of TEVp catalytic activity, since the wild type is slow compared to other widely used proteases that commonly implement serine as a nucleophile. To date, several directed evolution studies have yielded a handful of variants with up to an order of magnitude higher activity. 13,14 No rational computational design has yet been applied to any TEVp property. Such approaches rely on the detailed knowledge of the exact events along the reaction pathway, that is, the mechanism. 15 Rational computational design strategies were previously successfully used to yield enzymes with higher catalytic activity, including PETases, 16 organophosphatases, 17 cephalosporin acylases, 18 and KEMP eliminases. 19 Substrate specificity can also be engineered computationally. 20,21 When applied to TEVp, it may yield a family of high-specific tag cleavage agents that pose interest to biotechnology. However, the main concurrent task is not to damage catalytic activity, since TEVp is already slow. Thus, this objective yet again requires detailed knowledge of the catalytic mechanism. Applying this knowledge to previously discovered variants will allow us to extract concrete origins of improved activity and build upon them. 22 This knowledge also has a fundamental side. By learning why beneficial substitutions did not occur naturally, we can gain a better understanding of evolutionary factors specific to proteolytic enzymes in general.
Hypotheses about the TEVp mechanism were only ever postulated, not demonstrated in simulations. Major differences in assumptions concern the acylation stage and preceding events. Views diverge on what protonation state is characteristic of catalytic Cys and His at the Michaelis (enzyme−substrate) complex of TEVp and related proteases. 23 It is not a cosmetic difference, since it results in a different mechanism overall ( Figure 1). It is also intertwined with the nature of the products; however, this question is discussed much less often.
TEVp belongs to the PA clan of proteases and shares a structural similarity with chymotrypsin-like proteases that harbor the canonic Ser−His−Asp triad. This similarity reaches the very orientation of catalytic triad residues in space, despite TEVp having Cys instead of Ser as a nucleophile. 24 As a result, TEVp is sometimes assumed to follow the same catalytic mechanism as Ser-utilizing PA clan proteases. These enzymes catalyze acylation, which is a rate-limiting stage, through a tetrahedral intermediate stabilized by interactions with oxyanion hole H-bond donors ( Figure 1A). Ser is protonated at the Michaelis complex and deprotonation occurs alongside the formation of the tetrahedral intermediate.
The other proposed mechanism for TEVp, papain-like, follows the fact that the nucleophile is a cysteine (Figure 1B), not a serine. Papain-like proteases are the most well-studied cysteine proteases and are commonly believed to have a Cys in the deprotonated anionic state at the Michaelis complex. 23,25,26 Activation barriers for these enzymes were found to be roughly equal for the acylation and deacylation stages. The acylation part of the papain-catalyzed reaction was most commonly shown to be absent of the tetrahedral intermediate, with proton transfer to peptide nitrogen occurring alongside S−C(pept) bond for- mation and C(pept)-N(pept) bond breakage ( Figure 1C). 26−28 Oxyanion hole in papain was shown not to interact with the substrate at the Michaelis complex but only in the transition state and afterward. However, TEVp is very different in terms of both the tertiary structure and catalytic site architecture from papain-like proteases ( Figure S1).
The question that arises is as follows: what dominates the TEVp mechanism, nucleophile identity, or structural composition? Such a question may be answered with the help of molecular simulations. 29 In this work, we perform a first-timeever molecular simulation of the TEV protease. We focus our attention on the structure of the Michaelis complex with an optimal peptide substrate as a necessary starting point for further advancement. We conclude that there exists an equilibrium of Cys protonation states at the Michaelis complex of TEVp, while without a bound substrate, Cys is strictly deprotonated. Our attention was also drawn to the unusual conformation of a Ser residue that commonly interacts with catalytic Asp in PA clan proteases. In any other PA clan protease X-ray structure, this Ser faces outward with the associated N−CA−CB−OG torsion close to 180°. In TEVp X-ray structures, however, it faces inward, and the torsion value is close to −80°. We found that TEVp Ser may adopt a 180°-like conformation in dynamics and investigated the influence of such transition on the equilibrium of substrates within the Michaelis complex. With regard to this Ser, the TEVp protease turned out to differ even from the most closely related tobacco vein mottling virus (TVMV) protease. We discuss the structural origins of such a difference; however, to discover its functional meaning, modeling of the whole reaction process for both enzymes shall be performed in the future.

■ MATERIALS AND METHODS
Model Building. The TEVp complex with the EN-LYFQSGTV peptide was constructed with AlphaFold2-multimer-v2 (AFm) 30 implemented in ColabFold 31 v. 1.2.0. Multiple sequence alignment was built in ColabFold with MMSeqs2. 32 Three recycles for all five models were used, and the best prediction according to the multimer score was picked for further work. Built-in energy minimization in Amber was not performed. For an expert assessment of model reliability, the 1LVM X-ray model of TEVp was taken. 33 It also served as a donor of positions of coordinated water molecules.
PDB entries closest to TEVp in terms of fold were identified using the protein structure comparison service PDBeFold at European Bioinformatics Institute. 34,35 The TVMVp complex with the NNVRFQSLDTIV peptide and the chymotrypsin C complex with the NIEVLEGNEQ peptide were built similarly, with reference X-ray models 3MMG and 4H4F, respectively. 36,37 Cleavable peptide sequences were identified through MEROPS. 38 AFm models used in the study are available in the Supporting Information. For kallikrein 4 (KLK4), an X-ray model of a complex with a modified sunflower trypsin inhibitor was taken (4KEL 39 ) with Ala reverted to catalytic Ser using the mutagenesis tool in PyMol. 40 For TEVp, TVMVp, and chymotrypsin C without a peptide, mentioned X-ray models were used as starting points. For KLK4, the 7JQO model was used. 41 TEVp-T180V and TVMVp-V180T were generated using the mutagenesis tool in PyMol.
Four starting models for TEVp, both bound and free, were constructed: with either protonated/deprotonated Cys or Xray/AFm Ser168 conformation to be subjected to molecular simulations and analysis together to prevent the initial state bias.
For all other proteins, two starting models were constructed with different conformations of the Ser residue corresponding to Ser168 in TEVp (see the Results section).
For all models, PROPKA was used to generate hypotheses about residue protonation states in corresponding optimal pH ranges. 42 Histidine tautomer assignment and correction of amide flips were verified manually on top of Molprobity recommendations. 43 The SARS-Cov-2 main protease (MPro) complex with the ATVRLQAGNA peptide was built similarly to other systems. For reference, PDB model 7TA4 was taken. 44 For runs with trimmed peptides, initial peptide sequences after P2′ were deleted, resulting in FQSGTV for TEVp and LQAGNA for SARS-Cov-2 MPro.
Trimmed alignments for AlphaFold were constructed as follows: The TEVp sequence was retained and supplemented with 20 random sequences from the initial alignment. Thirty such alignments were generated and submitted to AlphaFold.
Simulation Software. All simulations were performed with Gromacs. 45 For MM simulations, we used Gromacs 2021.3. For QM/MM simulations, we used an interface between Gromacs and DFTB+. 46,47 Both were patched with Plumed 2.8.0. 48 MM Model Preparation and Equilibration. All previously listed models were parameterized in the Amber19 force field 49 ported from Amber to Gromacs notation in-house. 50 Each system was placed in a cubic box with periodic boundary conditions and solvated with tip3p-FB. 51 All crystallographic water molecules were retained, while all added were manually filtered in case of incorrect solvation. Na + and Cl − ions were added to neutralize the net charge and reach 0.15 M ionic strength. Systems were minimized with 5000 steps of the steepest descent. The equilibration phase consisted of seven steps and was performed in three replicates for each combination "protease-Ser conformation-bound/free", resulting in 12 runs per protease. First, an NVT run of 100 ps was performed while positionally restraining heavy atoms by 1000 kJ/mol/nm 2 . A velocity rescale thermostat was utilized 52 for temperature coupling. Then, for five rounds of NPT equilibration of 100 ps each, restraint strength was gradually decreased as follows: 1000, 500, 200, 100, and 10 kJ/mol/nm 2 . A stochastic cell-rescale barostat was used for pressure control. 53 A time step of 2 fs was used in all steps.
QM/MM Model Preparation. QM/MM modeling was performed only for free and bound TEVp models. An intrabackbone boundary between QM and MM regions cannot be done in Amber19sb due to cmap usage to describe rotation in the phi/psi space instead of bond-specific torsion potentials. Thus, equilibration results were reparameterized in Amber99sbildn. 54 Coordinates after the first NPT step were used as starting to prevent dealing with already manifested drift due to possibly unrealistic protonation. The QM subsystem consisted of 149 and 49 atoms ( Figure S2) for bound and free systems, respectively. To prevent link atom hyperpolarization, a "charge shifting" scheme was used. The QM region was described with DFTB3 55 with the 3ob-3-1 parameter set 56 similarly to what was used before in various enzymological tasks. 57−61 Additionally, D4 dispersion correction 62 was utilized.
Systems were subjected to 1000 steps of conjugate gradient minimization and 10000 steps of unrestrained NVT QM/MM dynamics in eight replicates. A time step of 0.5 fs was used. To this point, Cys spontaneously reprotonated in seven out of eight replicates with Ser168 in X-ray conformation and all replicates with AFm conformation.
QM/MM Metadynamics. QM/MM well-tempered metadynamics simulations were performed for each of the four bound and four free models of TEVp. 63 Two collective variables were used. The first one was defined as d(Cys151SG-H)− d(His46NE2-H). H here is either Cys151HG or His46HE2. The second CV was a torsion angle defined by atoms of Ser168: N−CA−CB−OG. The Gaussian potential with a height of 1 kJ/ mol and a width of (0.02 nm, 0.25 rad.) was deposited every 100 steps. The bias factor was set to 12. Detailed Plumed input files can be found in the Supporting Information.
For each model, we initiated eight walkers starting from eight QM/MM equilibration replicates described earlier. Each walker ran for 150 ps with a 0.5 fs time step. Aggregate information from walkers was used to assess and confirm the convergence of simulations for each individual starting model ( Figures S3 and  S4). Aggregate information between different starting models was used to assess and confirm reproducibility (Figures S3 and S4) and build final free energy profiles. This way, for both peptide-bound and -free forms, 4.8 ns of simulations were aggregated. In each scenario, Tiwary reweighting 64 was used to build profiles with the first 20 ps of a walker run discarded as the initial transient. The same scheme was utilized later to project profiles to different variables.
QM/MM calculations were carried out using the equipment of the shared research facilities of HPC computing resources at the Lomonosov Moscow State University. 65 MM Metadynamics. Prior to metadynamics, a 10 ns unrestrained NPT run was performed for each "proteasebound/free" combination starting from six independent restrained equilibration runs as described before. Since the starting bias for the driving variable was removed by the design of starting systems, these runs were made only to ensure the absence of any orthogonal processes that may influence the subsequent main production metadynamics run. We found that 10 ns equilibration is enough for this task. A single 2 μs run was performed for the TEVp−peptide complex, and six states after 1 μs were extracted, three for each Ser168 conformation similar to a general setup. For the X-ray conformation, frames from 1220, 1420, and 1780 ns were taken. For the AFm conformation, frames from 1300, 1600, and 1920 ns were taken (see the Results section). The comparison of free energy profiles after metadynamics runs starting from these states and ones taken after 10 ns showed no qualitative difference ( Figure S5).
MM metadynamics for each system was performed by starting an independent run from each corresponding equilibration. Only one collective variable, Ser168 N−CA−CB−OG torsion, was used. The Gaussian potential with a height of 1 kJ/mol and a width of 0.25 radian was deposited every 1000 steps. The bias factor was set to 12. Simulations were run for 100 ns per replica with a 2 fs time step. Similar time scales and parameters were previously used in similar tasks with a driving variable expressed as a torsion angle. 66−68 Profiles were built with Tiwary reweighting by discarding the first 20 ns as the initial transient. The mean profile constructed from six runs was used as a final estimate ( Figure S6). Convergence was assessed by tracking the free energy difference between two minima corresponding to N−CA−CB−OG torsion being −180°(AFm conformation) and −70°(X-ray conformation, Figure S7).
Phylogenetic Analysis. Phylogenetic tree building and ancestral sequence reconstruction were performed with Fire-Prot-ASR. 69 The tree was visualized in iTOL. 70  A starting point to discern the TEVp catalytic mechanism is a description of the structure of its complex with the substrate, that is, a Michaelis complex. Naturally, no such structure could be obtained by X-ray crystallography or CryoEM without making the enzyme inactive. A plausible complex may be constructed by protein−peptide docking, but this is a notoriously difficult problem with a high possibility of failure. 71 On the other hand, AlphaFold was recently shown to produce high-quality models of protein−peptide complexes. 72 This is why we started with the AlphaFold-multimer-v2 (AFm)produced model of the actual active Michaelis complex. To assess its adequacy, we compared it to the X-ray structure of an active TEVp with the product peptide ENLYFQ (Figure 2). Positions of all residues present in the X-ray model were perfectly reconstructed by AFm (Figure 2A), and the binding pattern of P1′−P4′ residues corresponds well to what would be expected ( Figure 2B). Peptide residues along the scissile bond were correctly positioned in the oxyanion hole with the leaving group nitrogen unoccluded to receive a proton from catalytic histidine further on ( Figure 2B).
While the peptide pose was reconstructed well, we observed a deviation from the X-ray model in the structure of the enzyme itself. The conformation of Ser168 clearly differs between the two models ( Figure 2C). In X-ray models, this residue faces inward, making H-bonds with both Asp81 (oxygen−oxygen distance 2.8 Å) and Thr180 (2.6 Å). N−CA− CB−OG torsio n in X-ray model 1LVM is −76.5°. In AFm models, Ser168 faces outward and H-bonds only Asp81 with an oxygen−oxygen distance of 2.5 Å and the associated N−CA−CB−OG torsion of −173.1°.
To gain support for either of these conformations, we searched for PA clan proteases closest to TEVp in terms of fold and came up with kallikrein 4 (KLK4), chymotrypsin C, and tobacco vein mottling virus (TVMV) protease. In all three enzymes, the conformation of corresponding Ser matches closely the one from the AFm model of TEVp. Moreover, we did not find any mention of the conformation like the one from TEVp X-ray models in the literature. Therefore, at this point, It was unclear whether there is an artifact in the X-ray model or if AFm is simply biased toward a more prominent conformation from PDB. 73,74 Despite our primary focus being on the protonation state of catalytic Cys151, the influence of Ser168 conformation on it cannot be ruled out. Since Ser168 forms H-bonds with catalytic Asp81, their conformational landscapes may be linked. Changes in the relative positions of catalytic Asp81 and His46 may in their turn lead to alterations in His46 pKa, thus directly affecting the catalytic Cys151 protonation state. Thus, these two phenomena may be linked and should be studied together.
We performed a QM/MM metadynamics study to reconstruct the energy profile of two presumably linked processes�Cys deprotonation and Ser rotation along its CA− CB bond (Figure 3). We found that all three theoretically possible Ser conformations can occur in practice and interchange, although the two discussed previously are prevalent. X-ray Ser conformation in the immediate prereaction state is 0.6 kcal/mol less probable than the AFm one (Table 1 and Figure 3A,E,H). With any Ser conformation, the barrier of Cys deprotonation is under 3 kcal/mol, and the deprotonated state ( Figure 3D,F,I) is 0.2−0.8 kcal/mol less probable than the protonated one (Table 1). Cys in the protonated state is highly mobile and capable of forming a stable self-inhibited state by donating an H-bond to the backbone oxygen of S168 ( Figures  3B and4G). This state is especially prominent with Ser168 N− CA−CB−OG torsion values of approximately 180°(AFm conformation) and 60°(minor conformation).
Cys deprotonation is associated with subtle conformational rearrangements in the active center (Figure 4), and these events depend on Ser168 conformation. In the self-inhibited state, catalytic His46 exclusively forms H-bonds with the catalytic Asp81 Oδ2 oxygen, the one closer to Ser168 ( Figures 3B,G  and4A,B,E−G). It is especially so for AFm conformation ( Figure  4B). At the prereaction state, His mostly forms a bifurcated Hbond with both Asp oxygens ( Figure 4C). This H-bond is symmetrical in minor Ser conformation, skewed to Oδ2 in AFm conformation and to Oδ1 in X-ray conformation. In a deprotonated state, His shifts to mostly form an H-bond to Oδ1 oxygen ( Figure 4D). This is most clearly pronounced for minor Ser conformation and less so for AFm conformation. Interestingly, while at the deprotonated state, a transition between AFm and X-ray conformations requires His to shift back to form an exclusive H-bond to Oδ2 ( Figure 4D). With all three Ser conformations, deprotonation occurs more preferably with His switched to Oδ1 (Figure 4A,E−G).
A more dramatic influence of Ser168 conformation is on Asn44 (Figure 4H−N). This residue is also a part of the active site and can form H-bond with either catalytic Asp81 Oδ2 ( Figure 3B) or its backbone oxygen ( Figure 3F). In minor and AFm Ser168 conformations, it is generally the former ( Figure  4L,N), while in X-ray conformation, its behavior changes along the deprotonation process ( Figure 4M). In the self-inhibited state, Asn44 interacts with backbone oxygen, while in the prereaction state, the balance shifts to donate an H-bond to Asp81. It makes evolutionary sense, since Cys deprotonation preferably proceeds with this interaction pattern of Asn44 ( Figure 4M). In the deprotonated state, an equilibrium of Asn44 conformations is present ( Figure 4K,M).
The situation changes in the absence of a bound peptide. In this state, Cys is strictly deprotonated, His46 donates an H-bond to Oδ2 oxygen atom of Asp81, and Asn44 interacts with Asp81 backbone oxygen ( Figure 5). Peptide binding, therefore, plays a major role in bringing Cys and His close to each other and shielding them from water. Without a peptide, the transition between AFm and X-ray conformations also becomes much easier. There are no differences in the system′s behavior dependent on Ser168 conformation.
To this point, using QM/MM metadynamics, we discovered that in the structure of TEVp at the Michaelis complex, two Ser168 conformations are almost equally possible and that the catalytic Cys151 is highly mobile. The same was also found to be  true in a 2 μs MM run ( Figure 6A). Such a conclusion could not be reached based on the analysis of available X-ray models alone. It is therefore possible that analogous Ser residues in other PA clan proteases can also be highly dynamic and feature such a stable alternative state. To explore this possibility, we performed a metadynamics run on both peptide-bound and -free models of previously mentioned proteases: KLK4, chymotrypsin C, and TVMVp. To be able to compare several systems, we were forced to use pure MM treatment. By comparing data for TEVp, we concluded that the MM description is capable of correctly ranking stable states by their energy ( Figure S8). This allows for drawing only qualitative conclusions on the minima. MM treatment is clearly much stiffer, and therefore, barrier heights do not match. Neither of these three proteases displayed the same Ser conformational pattern in the peptide-bound complex as TEVp, making it indeed a unique feature of this enzyme ( Figure 6B,C). There was a pronounced difference even compared to another cysteine protease in the set, TVMVp, which is the closest TEVp homologue with a known X-ray structure. In a peptide-bound  form, in both proteases, Ser168 in the X-ray conformation competes with Asn44 for providing an H-bond to Asp81 Oδ2. However, in TEVp, this conformation is stabilized by the Hbond from Thr180 that cannot be formed in TVMVp, since it has Val in this position ( Figure 6D). The difference between the two cysteine proteases is however diminished when the free models are compared ( Figure 6C). In TVMVp without the peptide, Asn44 interacts with Asp81 backbone oxygen instead of Oδ2, which frees Ser168 from competing with it.
Virtually substituting Thr180 for Val in peptide-bound TEVp makes it very similar to TVMVp in terms of Ser behavior ( Figure  6B). Likewise, restoring Thr in TVMVp brings it closer to TEVp and further from two serine PA proteases under consideration ( Figure 6B). It should be noted that KLK4 and chymotrypsin C originally harbor Thr corresponding to TEVp Thr180, and nothing competes with the Ser168 analogue for aspartate. Yet, their behavior is different possibly due to the exposure of the Ser168 analogue to water even in the peptide-bound form ( Figure S9).
We performed phylogenetic analysis and ancestral sequence reconstruction of TEVp homologues that are viral Cys PA clan proteases. The tree topology correlates well with the identity of residue 180 ( Figure S10). Threonine at this position is the oldest variant (Thr probability in the ultimate ancestor at 79%), and clades with predominant Ala180 and Val180 emerge. The biggest Val180 clade is strongly associated with Ala at position 168 instead of Ser. Hence, any influence of Ser conformation on other parts of the active site is not necessary for proteolytic activity arising from implementing a Cys nucleophile in a chymotrypsin-like fold.
TEVp and TVMVp are indeed very closely related and belong to a mostly Thr180 clade. Val in TVMVp is a novel feature, which correlates well with our virtual mutagenesis results. It however remains unclear whether Ser mobility in TEVp has any connection to function at all or is just a tolerable neutral feature. To arrive at a definite conclusion, an insight into the full reaction process, both acylation and deacylation stages, must be gained in further work.

■ DISCUSSION
In this work, we took an unconventional approach to construct a protease−substrate complex by utilizing the capabilities of AlphaFold-multimer. No assessment of AFm ability to predict protease−peptide complexes was previously made. Here, we show that it performed near perfect for a range of PA clan proteases.
It is intriguing how and why it achieved this result. To fold a protein, AlphaFold utilizes information on residue coevolution that comes from multiple sequence alignment (MSA) embeddings and is contained in an MSA representation. 75 It is then used to construct and update a pair representation. This second representation contains information on residue−residue distances and is processed to ensure triangle inequality. Processed pair representation is then used to update MSA representation in a new update round. AlphaFold-multimer is a follow-up system based on the ideas of AlphaFold designed to predict the structures of multichain protein complexes. As inputs, AlphaFold-multimer utilizes paired alignments constructed for each individual chain in the assignment. 30 Pairing is performed based on phylogenetics using the UniProt species annotation.
Our use of AlphaFold-multimer is, however, different in a way that no alignment is constructed for a short substrate peptide. The same is true for a previously proposed way of using AlphaFold for peptide−protein docking. 72 By attaching a peptide to a protein via a polyglycine linker, researchers were able to obtain good-quality complexes. No alignment was built for polyglycine and peptide regions. In both ways, protein− peptide coevolution is still taken into account normally, since peptide′s sequence is effectively a degenerate alignment with no variation. Both methods work since the physicochemical interactions that drive protein folding and protein−peptide interactions are the same. 76 Protease binding sites′ evolve to better recognize alien substrate loops and cognate substrate loops coevolve with their respective proteases. 77−79 In the paper describing the approach with a polyglycine linker, researchers conclude that the good quality of complexes does not arise from just memorization. We, however, decided to verify this conclusion for our method and application field. AlphaFoldmultimer was trained on PDB structures with a maximum release date of 2018-04-30. First-ever complexes between Sars-Cov-2 MPro (or 3CLPro), a PA clan Cys protease, and its substrates were resolved in 2020. 80 AlphaFold-multimer was able to produce a high-quality prediction nevertheless ( Figure  7A).
The absence of a peptide alignment means that no information on peptide conformation comes from the peptide sequence itself. Therefore, a protein here completely guides the folding of a substrate peptide, which is in accordance with how it happens in nature. 81 Substrate peptides in PA clan proteases come from unstructured loops. A β-sheet is formed between a protease and a region of the substrate loop downstream of the scissile bond. We speculate that no adequate complex can be built by AlphaFold when this region is not provided. Our runs on TEV and Sars-Cov-2 MPro with truncated peptides support our hypothesis ( Figure 7B,C).
A further assessment of a broader selection of various proteases is needed to reach a generalized conclusion on whether AlphaFold-multimer can be broadly recommended in similar studies on protease−substrate complex structure prediction. However, its overall decent ability to perform protein−peptide docking 72 combined with data presented herein makes it possible to speculate that AlphaFold-multimer might be a general solution to such a task. Once its ability to predict Michaelis complexes of proteases is proven, it may even be used to screen for protease decoys against a particular substrate peptide for further fine-tuning using methods of computational protein design. 82,83 AlphaFold also turned out to be right about the possibility of another Ser168 conformation. However, all five models produced the same result, thus giving no hint on the possibility of X-ray conformation and strengthening assumptions that AF is biased toward available structural data in PDB. 73,74 Thus, without X-ray structures, starting from AF models, we could have easily overlooked this alternative state. Could this limitation be overcome? We took inspiration from recent works, suggesting that trimming the alignment may result in different conformations, 84 and indeed obtained both Ser forms discussed in this work ( Figure S11). This shows that AF may be a powerful tool for providing hypotheses on cryptic but inherent conformational switches in protein structures not immediately obvious from X-ray models.
Since the Cys deprotonated state turned out to be quite possible on its own without anything happening to the substrate, the acylation stage of the proteolysis reaction must proceed similarly to the papain-like mechanism ( Figure 8). It is believed to normally feature only one stage of highly concerted simultaneous breakage of C−N(pept) and N(His)−H bonds and formation of S−C and N(peptide)−H bonds. 26−28 Due to the complications of orchestrating such a process, the associated barrier is high and it is usually a limiting stage. TEVp, however, faces another obstacle on the path to becoming a fast enzyme. As we show here, it demonstrates a tendency to fall back into two states with protonated Cys that are effectively inhibited forms of the enzyme. Our QM/MM calculations show that only 13.2% of configurations in the Michaelis complex are productive. Careful considerations of the exact populations of substrates within what may be otherwise considered to be a single state are necessary to construct correct kinetic schemes. This, in turn, may lead to revolutionary new findings on the enzyme′s efficiency as was recently shown for the staphylokinase−plasmin complex. 85 TEVp is not unique in the ability of its nucleophile to lose a hydrogen bond to the catalytic histidine. The same behavior was proposed to be the cause behind the existence of the "slow" form of thrombin. 86 Allosteric activation by a Na + ion turns it into a 10 times more active "fast" form by shifting the populations of conformations of the catalytic nucleophile. 86,87 Therefore, a possible strategy to enhance TEVp and similar enzymes would be to discourage such a process or to even lock Cys in a constantly deprotonated state. While Cys reprotonation is likely a manifestation of a fold being underoptimized for such a nucleophile, the surroundings of active sites of Cys PA clan proteases are clearly different from their Ser counterparts with Asn44 being an obvious example. Thus, some degree of optimization was nevertheless achieved, and it is an intriguing further direction to correlate it with concrete features. Inspiration might be taken from studying the opposite process�rolling back to a Ser nucleophile. A series of such TEVp mutants, leading to improved activity in Cys151Ser, was previously obtained. 22 Overall, the following questions emerge: is there even space to further fine-tune the chymotrypsin-like fold for cysteine as a nucleophile, and why was serine swept out of natural selection in the first place? It is possible that first came crucial alterations in the fold itself and cysteine acted as a compensatory substitution.
TEVp and TVMVp are not the only proteases featuring a cysteine nucleophile in a chymotrypsin-like fold. It is a quite common feature among virus proteases including 3CLpro from SARS-CoV-2 mentioned earlier. It may be tempting to transfer results presented herein for enzyme−substrate complexes on other such proteases. However, caution is advised, since many of them deviate significantly in terms of how acid function is implemented ( Figure S12). Differences such as Glu instead of Asp or water occupying its place may significantly alter histidine basicity, and the extent of such alterations should be better investigated in a separate study. On the other hand, for the free form of such enzymes, evidence mounts suggesting the zwitterionic state of the catalytic site. Neutron crystallography performed for SARS-Cov-2 3CLpro corresponds well to the data we acquired herein. 88

■ CONCLUSIONS
The TEV protease is the most extensively studied member of PA clan proteases with cysteine as a nucleophile. We expand on previous knowledge by providing a detailed insight into the equilibrium of states of the TEVp enzyme−substrate complex as well as the free enzyme. We perform enhanced sampling QM/ MM calculations to show, in agreement with previous works, that catalytic cysteine is deprotonated if the substrate peptide is not present. Substrate binding, however, turned out to induce the possibility of cysteine protonation. The barrier between two protonation states is small, thus prompting rapid interconversion of states, with the protonated state slightly more stable than the deprotonated one. Cysteine deprotonation does not result in a spontaneous attack on the substrate carbon atom. We, therefore, suggest that starting from this deprotonated state, a papain-like catalytic mechanism might be valid for TEVp.
We demonstrate that protonated catalytic cysteine is highly mobile and may switch its H-bonding partner from catalytic histidine to the serine nearby. This state is effectively a selfinhibited state away from the proposed reaction path. The existence of this state is likely an outcome of under-optimization of preferentially serine triad-harboring folds for a nucleophile switch. It remains to be investigated whether this underoptimization is an apex of what can be achieved without complete fold change or if there is still space for minor improvement.
We also uncovered the existence of two almost equally possible conformations of the same serine that TEVp catalytic cysteine may form an H-bond to. The analysis of structurally similar proteases, both serine and cysteine, as well as homologues, indicated that this is not a general feature of cysteine PA clan proteases and may be an individual feature of TEVp. We showed that these conformations affect the structure of the active site as it proceeds with cysteine deprotonation, however, with no meaningful effect on the possibility of the process or populations of states. Such effects may manifest themselves during later stages of the reaction.
Through the calculation and analysis performed in this work, we uncover new details about how cysteine proteases may perform their functions and propose a plausible strategy to improve the activity of TEVp and similar enzymes. Beyond that, we demonstrated a surprisingly sound ability of AlphaFoldmultimer in constructing meaningful complexes of proteases with their respective peptide substrates. This work presents a case highlighting that a combination of different modeling techniques can facilitate advances in the understanding of structure−function relationships of enzymes. ■ ASSOCIATED CONTENT
Fold and active site differences between PA proteases and papain ( Figure S1); QM region composition ( Figure S2); convergence and reproducibility in QM/MM metadynamics runs (Figures S3); comparison of MetaD results when initiating from six independent 10 ns equilibration runs and from six frames taken after the 1 μs of a 2 μs run ( Figure S4); comparison of QM/MM and MM reconstructions of the Ser168 torsion free energy profile ( Figure S5); free energy profiles of Ser CA−CB rotation from MM runs ( Figure S6); convergence of MM metadynamics runs tracked by the free energy difference between Ser states ( Figure S7); comparison of free energy profiles of Ser168 rotation obtained from QM/MM and pure MM calculations ( Figure S8); hydration in the active site of serine PA clan proteases in a peptide-bound form ( Figure S9); phylogenetic tree of viral cysteine PA clan proteases ( Figure S10); AlphaFold2 models from trimmed alignments (Figure S11); and showcase of different acid identities and architecture across distant viral PA clan cysteine proteases ( Figure S12). Supplemental Data including example states from QM/ MM metadynamics runs, AlphaFold2-multimer-produced models, and inputted multiple sequence alignments, input files for PLUMED, and FireProt-ASR output are deposited on Zenodo and can be accessed by DOI: 10