Identifying the Enzymatic Mode of Action for Cellulase Enzymes by Means of Docking Calculations and a Machine Learning Algorithm

Docking calculations have been conducted on 36 cellulase enzymes and the results were evaluated by a machine learning algorithm to determine the nature of the enzyme (i.e. endoor exoenzymatic activity). The docking calculations have also been used to identify crucial substrate-enzyme interactions, and establish structure-function relationships. The use of carboxymethyl cellulose as a docking substrate is found to correctly identify the endoor exobehavior of cellulase enzymes with 92% accuracy while cellobiose docking calculations resulted in an 86% predictive accuracy. The binding distributions for cellobiose have been classified into two distinct types; distributions with a single maximum or distributions with a bi-modal structure. It is found that the uni-modal distributions correspond to exotype enzyme while a bi-modal substrate docking distribution corresponds to endotype enzyme. These results indicate that the use of docking calculations and machine learning algorithms are a fast and computationally inexpensive method for predicting if a cellulase enzyme possesses primarily endoor exotype behavior, while also revealing critical enzyme-substrate interactions.


Introduction
Cellulosic biomass, the most abundant natural polymer on Earth, can be used as a feedstock for chemicals and liquid transportation with a significantly reduced carbon footprint as compared to petroleum [1][2][3].Cellulosic biomass is an inherently versatile feed stock consisting of glucose units linked through a β-1,4-glycosidic bond.This renewable, green feedstock can be used as a cheap and nontoxic raw material in a myriad of chemical and biological reactions for the production of transportation fuel and value added chemicals and materials [1][2][3][4][5][6][7][8].However, present technologies for the biodegradation of crystalline cellulose are not sufficiently optimized for general commercial scale production and industrial applications [9,10].This shortcoming is primarily due to the recalcitrant nature of lignocellulosic material [2,3].The difficulty in degrading cellulose is better understood through examining its molecular level structure and role in nature.Cellulose, which is produced biosynthetically and is vitally important to the global carbon flow, is composed of D-glucose units joined in a linear fashion by β-1,4-glycosidic bonds, Figure 1 [2].The natural resistance of cellulose to degradation is in part due to the stability of the o-glycosidic bond, and to a larger extent the overall structure of cellulose, which is precisely arranged to maximize the strong hydrogen bonds between adjacent cellulose chains and the relatively weaker hydrophobic interaction between cellulose sheets [11,12].This natural resistance to deconstruction (i.e.biomass recalcitrance) is primarily responsible for the high cost of biomass conversion to glucose [13].
In nature, biological organisms utilize several enzymes for the purpose of hydrolyzing crystalline cellulose.An efficient mixture of these enzymes (i.e.cellulases) work synergistically to hydrolyze cellulose as it is degraded from crystalline and/or amorphous cellulose to small, soluble cellulose fragments (e.g.cellobiose) and finally to glucose.In general, these enzyme mixtures consist of three different classifications of cellulase enzymes; endoglucanases (EG; EC 3.2.1.4),cellobiohydrolases (CBH; EC 3.2.1.91),and β-glucosidases (BG; EC 3.2.1.21),which are collectively referred to as glycosyl hydrolases (GH) [14,15,16].The EGs are responsible for disrupting the crystalline structure of cellulose by randomly hydrolyzing accessible internal β-1,4-glycosidic bonds and thereby exposing individual cellulose polysaccharide chains.The CBHs act on the reducing (CBHI) and non-reducing (CBHII) ends of the exposed polysaccharide chains in a processive manner, releasing predominantly disaccharides (cellobiose) and to a lesser extent trisaccharides.The final component of the enzyme mixture is the BGs, which further hydrolyze the cellobiose to glucose.The slowest (i.e.rate limiting step) and most complicated aspect of hydrolyzing cellulose to glucose is believed to be the disruption of the crystalline cellulose substrate by EG and the CBHs [17].
Cellulases are classified into different families according to the primary amino acid sequence homology, secondary and tertiary structure, and their catalytic residues (CAZY database at http://www.cazy.org)[16][17][18][19].All cellulases have multiple glucose binding sites (e.g.-7 to +2) with negative numbers representing non-reducing sugar binding sites, and positive numbers representing reducing sugar binding sites.In this numbering system, the -1 and +1 binding sites are responsible for catalysis of the cellulose fragments through hydrolysis of the o-glycosidic bond [20].Across all cellulase families, a commonly accepted standard for classification of endo-or exo-behavior involves the ability of the enzyme to hydrolyze either carboxymethyl cellulose (CMC) or highly crystalline cellulose (Avicel) [17,21,22,23].If the enzyme of interest shows measurable kinetics with CMC, a bulky derivative of cellulose (Figure 1), the enzyme is classified as an endo-cellulase, while significant activity on Avicel is given an exo-cellulase classification.The ability to discriminate between endo-and exo-activity based on the presence or absence of a bulky substrate (i.e.CMC) is due to the shape of the endo-or exo-active site groove [18,21,24].Endo-and exo-cellulase active sites are formed by the closure of two distal loops flanking the active site region.In the case of exo-cellulases, these extensive loops remain in a closed conformation that results in a stable active site tunnel, while in endo-cellulases, these loops are shorter in length, resulting in a more open active site groove [21].In exo-cellulases, the substrate is threaded through the active site tunnel, which contributes to the processive catalytic action observed in CBH enzymes [24,25].In the case of the endo-cellulases, the distal loops have a reduced length resulting in an open active site resembling a groove or cleft, which allows the enzyme to adsorb onto a cellulose surface and bring its active site into close contact with a cellulose polymer.The critical importance of the distal loop length on the catalytic activity has been shown by experimental studies in which a deletion of 15 amino acids in one of the distal tunnel forming loops of a CBH (Cel6B from Cellulomonas fimi) resulted in a significant enhancement of endo-type activity for the enzyme [26].
Establishing if a cellulase enzyme has predominantly exo-or endo-cellulase behavior is important for future studies including efforts to bioengineer more efficient cellulase enzyme systems for industrial applications.Structural characteristics such as the loops enclosing the active site provide useful insights for specific enzymes, although it is difficult to quantitatively classify all cellulase enzymes using these structural features alone [27,28].Kinetic measurements involving CMC and Avicel provide a better criterion for establishing endo-or exo-type activity, however these experiments are costly and time consuming, thus data is currenty unavailable for many cellulase enzymes.Furthermore, there will undoubtedly be many more cellulase structures available in the coming years from both experimental (i.e.x-ray) and computational (i.e.homology modeling), thus it is of interest to develop a quick, efficient method for classifying exo-or endo-type behavior.One possibility is to utilize available computational tools to develop models that can cheaply and easily predict the mode of action of cellulase enzymes.Furthermore, the use of computational methods allows for the investigation of other enzyme characteristics critical to functionality, which is difficult to study experimentally.
Here, we report the use of docking calculations between cellulase macromolecules, and the CMC and cellobiose ligands.The main objective of this study is to determine if docking calculations can be used as a quick and efficient means for accurately predicting endo-or exo-cellulase behavior.In total, 36 enzymes from 8 families were studied, all of which have been previously classified as possessing endo-or exo-activity by means of kinetic and/or X-ray studies.In addition to predicting the enzyme mode of action efficient computational studies such as docking allows for the identification of enzyme characteristics that may warrant further investigation, such as identification of cellulase mutants that may possess favorable characteristics (e.g.reduced product inhibition).

Docking Calculations
The AutoDock 4.2 [29] program was used to conduct docking calculations on the cellulase enzymes, while the AutoDockTools [29] (ADT) software was used to prepare the input files.Two separate docking calculations were performed on each enzyme, one with cellobiose, and the other with CMC.The cellobiose ligand contained a total of 12 active torsions and the CMC ligand contained a total of 24 active torsions, with the torsion tree root of both ligands residing on the o-glycosidic bond oxygen.The ligands were then modified by adding polar hydrogens, merging all the non-polar hydrogens, and adding Gasteiger charges.The cellulase macromolecule structures were obtained from the protein data bank (PDB accession numbers are found in Table 1) and were modified by removing all crystallographic waters, metal ions, and associated ligands using the ADT program.The resulting macromolecule was then subjected to 1000 steps of steepest-descent and 1000 steps of conjugated gradient energy minimization using the UCSF Chimera package [30].The resulting macromolecular structures were then used to prepare input files for performing docking simulations using the ADT software.In all docking calculations, the macromolecule was kept rigid while the ligand was allowed to sample the specified torsional parameters.The grid box created in ADT was sufficiently large to encompass the catalytic active tunnel/groove region.The dimensions for the grid boxes are given in Table S1 of the Supporting Information, and a representative grid box used for Cel7A [20] is shown in Figure 2. AutoGrid 4 [29] was used to produce grid maps with a default grid spacing of 0.375 Å.The docking calculations used the standard Autodock force field and the Lamarckian Genetic Algorithm (LGA) to search for the best docked ligand conformers.Each docking experiment consisted of 100 independent LGA runs with a population size of 150 and a random initial geometry for the ligand.The maximum number of energy evaluations for each LGA run was set at 25,000,000 with the maximum number of generations ranging from 1000 to 27000 depending on convergence.The maximum number of top individuals that automatically survived was set to 1, the mutation rate was set to 0.02, the crossover rate was set to 0.8, the translational step size was set to 2 Å, and the quaternions and torsion step size was set to 50°.For the analysis of the docking calculations, 100 conformers were considered for every complex, and the resulting docking clusters were calculated with a tolerance of 2.0 Å for the root mean squared deviation (rmsd) on the heavy atoms.Eight docking experiments were repeated four times to check the reproducibility of the docking patterns, and it was verified that consistent results are obtained.Furthermore, for cellulase enzymes that have a cellobiose bound in the crystal structure (14 total studied here), the lowest binding energy cellobiose pose was compared to the crystal structure, and the resulting RMSDs were all < 2.5 Å, indicating the docking results are accurately reproducing experimental data.

Coarse Grained subsite docking analysis
To systematically analyze the CMC or cellobiose docking results, a protocol is required that will effectively work on a wide range of cellulase enzymes and substrates.Therefore, a distance dependent method was created that utilized the center of mass for each of the glucose rings to describe the spatial position and orientation for each of the docking solutions.The binding sites of the macromolecule, ranging from -7 to +2, were represented by spheres of equal diameter.The positions of the spheres were determined such that the spheres encompassed binding site residues identified by X-ray crystallography.The resulting active sub-site spheres smoothly and continuously tracked along the active site tunnel/groove much like beads on a string, as shown in Figure 3.The distances between an active sub-site sphere and the center of mass of the glucose ring were calculated for all of the identified ligand-macromolecule binding poses.The ligand for a particular binding pose was then assigned to the closest active sub-site sphere or pair of spheres.To eliminate assigning a ligand to an active site sphere when the ligand was not bound in the active site tunnel/groove, a distance cutoff of 6 Å was applied (i.e. if the closest active site sphere to a particular ligand is greater than 6 Å then that ligand is removed from the resulting histogram).

Carboxymethyl cellulose docking classification
The coarse grained subsite docking analysis was utilized on the carboxymethyl cellulose (CMC) docking poses, which can be used to indicate endo-or exo-cellulase behavior.Specifically, if the CMC docking study results in one or more conformations with a favorable binding energy residing inside the active site, the enzyme is classified as having endo-cellulase behavior, while the absence of a CMC docking conformation is indicative of exo-cellulase behavior.For all enzymes, a docked conformation is considered 'inside' the active tunnel/groove if it resides within the loops covering the active tunnel/groove sub-sites and has a favorable binding energy.As the length of the active groove may vary substantially between cellulase enzymes, the exact sub-site location that is considered part of the active tunnel/groove will vary between enzymes.

Cellobiose docking pattern classification
Using the coarse graining technique, two general patterns were identified for the docking of cellobiose in the catalytic tunnel/groove of cellulases.The two patterns contained either a single maximum in the distribution, or a bi-modal distribution, which are indicative of exo-and endo-activity, respectively.To be considered a uni-modal pattern, the distribution must contain a single maximum, with the number of conformations decreasing or remaining approximately constant on either side of the maximum conformation sub-site.For a bi-modal pattern, the defining characteristic is that two relatively high occupancy sub-sites must be separated by one or more sub-site(s) with a lower occupancy.One of the high occupancy sub-sites must be at least 20% greater than the intervening low occupancy sub-site(s) and the other high occupancy sub-site must be at least 2% greater than the intervening low occupancy sub-site(s).It is important to note that because the length of the active groove can vary substantially between cellulases (e.g.Cel7 has 9 sub-sites while Cel9 has 6 sub-sites), the location of the distinguishing features of the binding distributions will vary between enzymes.Representative histogram plots of the binding motifs indicative of exo-/endo-activity are shown in Figure 4.

Machine learning algorithm
In addition to the cellobiose docking pattern classification process it is beneficial to have an automated algorithm that can evaluate large amounts of docking data and predict the enzyme mode of action without the need for time consuming graphing and manual analysis.To this end a sparse representation based learning model was utilized to explore the association between the docking results and the experimentally determined enzymatic mode of action.The machine learning algorithm (MLA), based on our previous work [31,32], analyzed the binding patterns for cellobiose in terms of the number of binding events at each binding sub-site and/or the presence of CMC in the binding tunnel/groove for the various cellulase enzymes.This method addresses the group structure of binding events such that the learned regression model has better predictive performance for the enzymatic mode of action for cellulase enzymes (i.e.endo-vs.exo-cellulase activity).
The benefit of this MLA is its ability to select features across multiple tasks by enforcing additional structure sparsity for jointly selected features across multiple tasks via a  2,1 -norm regularization [33,34].In equation 1, the first term measures the regression loss, the second term couples all the regression coefficients of a group of features, and the third term penalizes the regression coefficient of each individual feature to select features across multiple learning tasks.W is the weight matrix that measures the relative importance of a particular piece of docking data in predicting the enzyme mode of action, γ is a trade-off parameter, and  [31,32].It is noted that this new method is very close and motivated by our earlier work [32].However, the MLA utilized here is redesigned to specifically solve the classification between endoand exo-mode of action.This type of learning model naturally leads to feature (e.g.binding sites) selection, feature type selection, as well as the explicit feature relevance (e.g.how relevant a binding site is with respect to predicting the mode of action).

Exo-/Endo-Structural Features
Generally, cellulases from the same family have high primary sequence identity, structural similarity, and conserved catalytic residues.However, the length of the loops and their flexibility over the catalytic tunnel differ between cellulases in the same family, and it is believed that the nature of these loops plays a significant role in the different catalytic activities observed in the enzymes (i.e.endo-vs.exo-) [25,26].To illustrate this point, the amino acid sequence and overall structures are compared for the family 7 cellulases Cel7A [20] and Cel7B, [35] which have exo-and endoactivities, respectively.On the basis of amino acid sequences, it is evident that some of the loops in the endo-Cel7B are significantly shorter than in the exo-Cel7A.Specifically, residue segments 191-204, 244-249, and 381-392 in Cel7A are missing in Cel7B, which results in structural changes that can be seen by comparing the superimposed structures of the two enzymes, Figure 5   Utilizing the primary amino acid sequence and characteristic secondary structure for the 36 enzymes evaluated in this study the machine learning algorithm (MLA) predict the enzymatic mode of action with 81% accuracy.Additionally, grouping the amino acids (e.g.positively, negatively, and uncharged side chain) also resulted in an 81% predictive accuracy.The MLA analysis of the primary amino acid sequence identified loop residues at position 192 (loop I), 244 and 245 (loop II), and 391 (loop III) as possessing positive relative significance in correctly predict the enzyme mode of action.The analysis of the primary and secondary structure aspects represents a rapid method for classifying the mode of action with only the primary amino acid sequence and without the need for docking studies, kinetic experiments, or crystal structure data.Although for a more accurate predictive model additional calculations are found to be necessary.

CMC Docking
A commonly accepted experimental method for determining endo-cellulase activity is to either measure the cellulase enzymatic ability to hydrolyze CMC and/or obtain a crystal structure of the enzyme with CMC bound in the active groove [21,23].Therefore, docking calculations were performed with CMC as a substrate to determine if such calculations could be used to indicate endoor exo-cellulase behavior in silico.It was found that a favorable (i.e.negative binding energy) docked conformation of CMC within the active tunnel/groove of the enzyme can be used as a criterion for predicting exo-or endo-type cellulose behavior, which is represented in Figure 6.As the figure shows, the docked CMC is found inside the groove of Cel6B, while the docked CMC is outside the active groove of Cel6A, indicating these enzymes have endo-and exo-type behavior, respectively [21,36].The results of the CMC docking calculations with all 36 enzymes utilized in this study are shown in Table 1.It is clear from Table 1 that the predicted mode of action is in good agreement with the experimental classification for all but three of the cellulase enzymes investigated; 2 exo-cellulases (PDB IDs: 2RFW and 2RFY) [37] and 1 endo-cellulase (PDB ID: 1QI0) [38].The results of the CMC docking and MLA code reveal a 100% agreement with the 3 kinetically verified modes of action, 87% agreement with the 15 X-ray verified modes of action, and a 92% agreement with all 36 structures, which includes modes of actions determined by structural analysis.The results of the MLA code are found in Figure 7 where larger relative significance values correspond to larger favorable contributions to the predictive accuracy.The results from the MLA code indicate that the endo-enzymes have a larger favorable impact on the predictive model, most likely due to their larger presence in the MLA code (22 endo-vs.14 exo-).This good agreement with experimental classification over a broad range of cellulase families is encouraging given the computationally inexpensive nature of these calculations.

Cellobiose Docking
Although the CMC docking results can predict endo-and exo-behavior with high accuracy, CMC is not a natural substrate of cellulase enzymes, thus it is difficult to extrapolate docking results with CMC to the identification of substrate specific enzyme characteristics.Therefore, docking calculations with cellobiose were performed, which is a better representation of the natural cellulose substrate and is the main product of cellobiohydrolase enzymes.It was identified, as shown in Figure 4, that the docking modes for cellobiose in the catalytic tunnel/groove have either a uni-modal or bi-modal shaped docking distribution.Analysis of the docking results show that 12 of 14 (86%) enzymes that have been experimentally classified as exo-cellulase gave a uni-modal shaped docking distribution, while 17 of 22 (77%) enzymes experimentally classified as an endo-cellulase gave bi-modal shaped docking distributions for an overall 81% accuracy in identifying the enzymatic mode of action (100% kinetic and 87% X-ray).Along with the CMC results, the docking results with cellobiose as a substrate are also given in Table 1.For the endo-enzymes, five systems gave uni-modal/exo-distributions, which are accounted for by the closed nature of the loop region in the X-ray structure.The ability of cellobiose docking to predict the enzymatic mode of action (29/36, 81%) is found to be 11% lower than the CMC docking results.In general, the average binding energies at the various sub-sites in the exo-cellulase enzymes is almost equal or contain one minima at the sub-site with an increased number of conformations in 12 of 14 exo-enzymes, but this trend is not seen in the endo-cellulase enzymes.In 12 of 22 endo-cellulases, the two average binding energies at sub-sites with increased number of conformations are separated by a relatively more unfavorable binding energy with a low number of binding conformations.
Utilizing the MLA code and the cellobiose docking results for all available sub-binding sites yields an overall predictive accuracy of 86%.In addition, the MLA code is capable of indicating the relative importance for each sub-binding site as seen in Figure 8.The MLA code results indicate that the most significant contributors to the predictive accuracy are the sub-binding sites -7 to -4.These particular binding site, while found in both endo-and exo-type enzymes, are more prevalent in the exo-cellulases.The increased occurrence of these substrate binding sites in the exo-cellulases is commonly associated with their threading of the cellulose substrate into the active site tunnel and processive action.These sites are not as critical in the endo-cellulase enzymes due to their role in randomly clipping cellulose strands and the absence of processive ability.It is noted that some endo-cellulase enzymes have been reported to have exo-processive activity, although this is significantly reduced when compared to exo-celluase enzymes.Evaluation of Figure 8 also indicates that sub-binding sites -2, -1, and +2 are important in classifying endo-or exo-activity.These binding sites, while close to the catalytic site (i.e.sub-binding site -1 and +1), also reside under loop II, which has been implicated in defining the enzyme mode of action.Exo [45] Exo [36] Exo [46] Endo [47] Exo [48] Exo Endo [60] Endo Endo Cel45A 1L8F 1OA7

Clostridium Cellulolyticum
Endo [63] Endo Endo Similar to the CMC dockings, the cellobiose docking method described here seems to provide a quick and accurate method for determining if an enzyme has predominately endo-or exocharacteristics.Furthermore, the cellobiose docking results can provide insights into unique enzyme structure-function relationships leading to the observed endo-/exo-behavior, as well as serving to predict other useful enzyme characteristics which may warrant further study, such as inhibition patterns, and residue interactions critical to substrate binding.Also, further investigation of the cellobiose docking results lends insight into why the predicted mode of action was incorrect for a select few enzymes.This is examined in greater detail in the following Cel7 and Cel6 family specific sections.

Family Cel7
Examining the overall cellobiose docking distributions provides useful insights into the different binding mechanisms utilized by endo-and exo-cellulase enzymes.The docking modes for cellobiose at the binding sub-sites of two Cel7 enzymes are depicted in Figure 9.It is evident that Cel7A possess a uni-modal shaped docking histogram that is in agreement with the experimentally classified exo-cellulase mode of action [20].Inspection of the docking results indicate a gradual increase in the number of docked conformations at each sub-site, starting from the active site tunnel entrance (-7 and -6 sites) and ending at sub-site -4.The docking distribution reaches a maximum at sub-site -4 and then starts a gradual decrease until the catalytic active site is reached (-1 and +1).The uni-modal shape of the docking distribution is due to the increasingly favorable interactions from one site to the next as the substrate is threaded from the active site tunnel entrance (-7 sub-site) to sub-sites deep in the catalytic tunnel (-4 site).The uni-modal binding pattern suggests that in order to stop the cellulose strand from exiting the active site tunnel and re-annealing to the micro fibril substrate, the enzyme has evolved a ratchet-like mechanism to coax the substrate strand into the active site tunnel.It is hypothesized that this feature is necessary for the efficient threading of the substrate into the active site tunnel and the subsequent retention of the substrate for a long enough time to facilitate the observed processive nature of the enzyme.The overall shape of the docking distribution is clearly uni-modal and reveals a general structure-function relationship that illuminates how these particular enzymes thread the substrate into the active site tunnel and move in a processive fashion on the surface.
The decrease in the number of docking conformations between sub-sites -3 to -1 indicate the substrate in this region has a reduced amount of enzyme stabilization or enhanced conformational strain.The decreased enzyme stabilization near the catalytic sub-sites (-1 and +1) may allow the substrate to change from the standard chair configuration to the skew boat configuration, which is found to be necessary for catalysis [36,64].A distortion of glucose from chair to half-chair, boat, or skew boat has also been observed in quantum mechanical/molecular mechanical (QM/MM) and QM calculations carried out on the glycosidic bond hydrolysis of a cellulose chain in Cel7A [65].The increased flexibility would also be in line with the hypothesis that flexibility (usually enzyme flexibility) is correlated with increased enzyme activity [66].
While the histogram for Cel7A has a uni-modal shape, the docking histogram shape for the endo-cellulase Cel7B is significantly different.The most notable difference is that the maximum number of conformations is almost equally split between sub-sites -2 and +1, creating a bimodal distribution.This distribution indicates the enzyme may initially absorbs onto the substrate through strong interactions on the reducing side of the active site, which could temporarily 'lock' the enzyme onto the substrate, followed by the attachment of the +1 sub-site on the non-reducing side to the surface or vice versa.Such a binding mechanism would allow the entire active groove to absorb onto a cellulose sheet and bring the active site into close proximity with the cellulose polymer.The lower average binding energy for cellobiose conformations at product sub-sites in Cel7B, as compared to Cel7A, suggests that more cellobiose conformations docked at product sub-site.
Beyond providing insights into general substrate binding mechanisms, further examination of the docking distributions can provide details on key residue interactions.A complete list of amino acids interacting with cellobiose, and the number of conformations at each sub-sites are given in Table S2 and Figure S5A & 5B of the Supporting Information.Examination of the tables reveals four critical Trp residues at the -7, -4, -2 and the +1 sub-sites of the exo-cellulase Cel7A.These particular Trp residues were previously identified and shown to play an important role in the binding of cellulose/cellobiose units [25].The importance of these Trp residues was also seen in all atomistic molecular dynamics simulations on aromatic-carbohydrate interactions in a processive cellulose [67].The ability of the current docking calculations to identify these residues as critical stabilizing interactions is encouraging given the docking calculations are orders of magnitude less computationally intensive than all atom molecular dynamics simulations.
Given the polymeric nature of cellulose, an important characteristic of cellulose binding is the ability of a cellulase enzyme to bind across multiple sub-sites.In the current study, this idea translates to a cellobiose having the ability to bind across two sub-sites, thereby maximizing the binding energy.The values shown in the parentheses in Figure 9 indicate the percentage of cellobiose conformations at the sub-site that span (i.e.bridge or bind) across two sub-sites, with the first cellobiose glucose unit at the site shown in Figure 9 (e.g.sub-site -5) and the second glucose unit binding to the sub-site to the right (e.g.sub-site -4).As the values indicate, with the exception of the -7 entrance site, all sub-sites contain conformations that span across two consecutive sites.It is thought that this continuous nature of Cel7A aids in the exo-processive action of the enzyme by maximizing the enzyme contact with the substrate.
Comparing the binding modes of exo-and endo-cellulase Cel7 enzymes can also serve to predict inhibition patterns.For example, the spatial distribution for the total number of cellobiose conformations docked at each sub-site for Cel7A (PDB ID: 7CEL [20]) and Cel7B (PDB ID: 2A39 [35]) are shown in Figure 10.The increased number of docking conformations found at the Cel7B sub-sites +1 and +2 (product) as compared to the Cel7A sub-site +1 and +2 (product) indicates Cel7B has increased product inhibition characteristics as compared to Cel7A.This is also seen in the docking histogram in Figure 9.This general trend holds for all Cel7B enzymes, which can be seen by examining Figure S5A   In addition to inhibition patterns, our studies can also reveal binding affinity trends.Specifically, Cel7B (PDB: 2A39 [35]) has a significantly increased number of docking poses at the reducing end sub-sites as compared to Cel7A (PDB: 7CEL) [20], Cel7D (PDB: 1GPI) [43], Cel7B (PDB: 1EG1) [42], and Cel7B (PDB: 2RFW) (see Figure 9 and Figure S5A &5B) [37].Furthermore, the average binding energies for cellobiose conformations at reducing end sub-sites are lower in Cel7B (PDB: 1EG1) compared to the non-reducing sub-sites in other Cel7B enzymes (PDB: 1EG1 and 2A39).Specifically, the numbers of cellobiose conformations for 1EG1 are lower and have a higher binding energy as compared to 2A39.In general, these results indicate that Cel7B (2A39) has an increased affinity for the microfibril substrate.This can be experimentally observed as an increase in the equilibrium adsorption constant, K ads , an enhanced amount of time residing on the microfibril, and/or an enhanced product inhibition pattern.Langmuir adsorption measurements by Ding et.al. on T. reesei Cel7A and Cel7B have shown that Cel7B has 80% higher adsorption rates on amorphous phosphoric acid-swollen cellulose (PACS) than Cel7A, confirming that the docking experiments correctly identify enzymes with increased substrate affinity [69].

Family Cel6 and Other Families
The cellobiose docking distributions of Cel6 cellulases are given in Figure 11.As opposed to the Cel7 family, the Cel6 family cleaves cellulose strands from the non-reducing ends, thus the cellulose enters the active tunnel from the +4 sub-site instead of the -3 subsite.The binding pattern for Cel6A [36] shows a uni-modal like structure similar to Cel7A, with the maximum number of conformations residing at the +2 or +1 subsites for Cel6A.Cel6B [21] shows a bi-modal distribution similar to that of Cel7B with the maximum distributions located at -2, +2, and +3 sub-sites.Similar to the CMC docking results, the predicted mode of binding for Cel6C (in Figure S5D of the Supporting Information) is exo-, while experimental evidence shows this enzyme possesses both endo-and exo-type activity.A common trend seen in both docking studies is that certain enzymes previously classified as endo-or 'hybrid' endo-/exo-results in an exo-type docking classification.The reasoning for this anomaly appears to involve the loops flanking the active groove, which are believed to be highly flexible in endo-cellulases.This flexibility may be necessary so that the endo-cellulase can adopt an open conformation to adsorb its entire active groove onto a cellulose sheet.Once adsorbed onto the cellulose sheet, the loops then adopt a closed conformation, clamping the active groove onto the cellulose sheet such that hydrolysis can occur.A consequence of this flexibility is that endo-cellulase enzymes may possess a varying degree of exo-type activity if the loops are capable of closing.That is to say, an endo-cellulase may bind to a cellulose polymer, adopt the closed loop conformation, and then subsequently spend longer in this conformation than is necessary for a single catalytic event.This would allow the enzyme to proceed with a certain amount of exo-processive action.It is hypothesized that the degree of flexibility and the stability of the closed conformation is related to the degree of exo-cellulase activity seen in predominantly endo-cellulase enzymes.
From the above discussion, it is apparent that the static conformation seen in the crystal structure may be in two very distinct conformations for an endo-cellulase enzyme.One is a more open structure that allows the enzyme to absorb onto a microfibril, and is unique to endo-cellulases.The other is a closed conformation for performing hydrolysis on a polymer chain, which is common to both endo-and exo-cellulases.Thus, it is hypothesized that the reason Cel6C and certain other endo-or endo-/exo-cellulases studied here resulted in an exo-type docking classification is due to the static crystal structure being in a more closed conformation.That is to say, the enzyme is in a conformation that is indicative of a catalytic event that is the same in both endo-and exo-cellulases enzymes, instead of a conformation representative of the unique substrate binding performed by endo-cellulases.Given that Cel6C has been found to possess substantial endo-and exo-activity, it is expected that the enzyme when bound to substrate would adopt the closed conformation, resulting in docking and MLA classification as an exo-.Furthermore, the crystal structure of Cel6C used here was solved with a cellobiose bound in the active site, which again will induce or increase the probability of the enzyme adopting a closed conformation for the subsequent catalytic event.Overall, given that Cel6C has been experimentally shown to have substantial endo-and exo-activity coupled to the fact that the crystal structure had a cellobiose substrate bound in the active site, it is reasonable to assume that the enzyme is in a closed conformation, leading to the observed exo-type docking motif.The list of amino acids interacting with cellobiose, and the number of conformations at each sub-sites are given in Table S3 and Figure S5C & 5D of the Supporting Information.
The docking modes for cellobiose at each sub-site for the other cellulase families studied here; Cel5, Cel9, Cel12, Cel44, Cel45, and Cel48, are discussed in specific family sections and Figure S5 of the Supporting Information, and the predicted modes of action from CMC and cellobiose docking are given in Table 1.To verify the reproducibility of the docking experiments, eight dockings were repeated four times.All of the repeating docking calculations resulted in similar docking patterns as seen in the original docking calculations with CMC and cellobiose ligands.These results indicate that docking calculations are a fast, reproducible tool to identify the mode of action for cellulase enzymes.

Conclusion
Both the CMC and cellobiose docking methods reported here can be used to identify an enzyme as an endo-or exo-cellulase.CMC as a docking substrate is found to identify the endo-or exobehavior for all enzymes in this study except three.A second set of calculations were conducted using cellobiose as a docking ligand to directly probe the substrate interactions.The binding distributions have been classified into two families; distributions with a single maximum and bi-modal distribution that are indicative of exo-and endo-cellulase enzymes, respectively.The cellobiose binding motifs and MLA code indicates a comparable accuracy as the CMC docking with 86% and 92% in identifying endo-or exo-behavior, respectively.These results are an improvement over analyzing the primary and secondary structure components, which resulted in the MLA code predicting the enzymatic mode of action with 81% accuracy.These results indicate a fast and computationally inexpensive method for identifying the nature of the cellulase enzyme.Furthermore, the docking results were able to identify important enzyme characteristics in agreement with previous studies, indicating autodock calculations and the MLA code can serve as a cheap, efficient means of identifying cellulase characteristics that may warrant further study in future cellulase bioengineering efforts.

Figure 1
Figure 1 Licorice representation of glucose (left), the repeat cellobiose unit of a cellulose strand (middle), and the repeat carboxymethyl cellulose unit of amorphous cellulose (right).The carbon numbering system is depicted for the glucose molecule while the β-1,4-glycosidic bond is identified for the cellobiose repeat unit.

Figure 2
Figure 2 Representative docking grid box around the active site groove for Cel7A.The enzyme is represented in ribbon form while the surface of the active site groove is represented by the gray surface.The subsites of the catalytic tunnel/groove are indicated above the docking grid box.

Figure 3
Figure 3 Coarse grained spherical representation of the macromolecules active sub-site regions.The macromolecule, Cel7A, is represented by the orange ribbon while the spherical active sub-site beads are represented by the gray spheres.

Figure 4
Figure 4 Representative docking modes of cellobiose with Cel5A (PDB ID: 3AZR) in orange and Cel12A (PDB ID: 1UU4) in dark blue.
(regions I, II and III).It is evident from Figure5that the loops marked I, II, and III are reduced or absent in Cel7B.The reduced or absent loop regions results in a more open groove structure around the catalytic site of Cel7B as compared to Cel7A, which is thought to allow endo-type mode of action on the cellulose surface.The presence of these loops in Cel7A leads to the formation of an active site tunnel, favoring an exo-processive mode of action.However, such structural characteristics are difficult to quantify due to the large variability in the size and position of the loop regions between enzymes.

Figure 7 .
Figure 7. Machine learning algorithm results for the carboxymethyl cellulose docking calculation.A larger relative significance indicates a larger favorable contribution to the predictive model.

Figure 8 .
Figure 8. Machine learning algorithm results for the carboxymethyl cellulose docking calculation.A larger relative significance indicates a larger favorable contribution to the predictive model.

Figure 9
Figure 9 Number of docking conformations for cellobiose at sub-sites in the active site tunnel of Cel7A (orange), and Cel7B (blue).The percentages of conformations that are docked over neighboring subsites are given in brackets.PDB IDs: Cel7A: 7CEL, Cel7B: 2A39.
and S5B in the Supporting Information.As the figures indicate, in comparison to Cel7A enzymes, all Cel7B enzymes have an increased amount of docking poses at sub-sites +1 and 2, indicating Cel7B endo-cellulases have elevated product inhibition patterns when compared to Cel7A exo-cellulases.This docking observation is in agreement with Voutilainen et.al.kinetic studies of cellobiose inhibition on cellulase enzymes that found the inhibition constant of Cel7B (K i = 6 µM) to be three times less than Cel7A (K i = 19 µM)[68].In addition, experimental measurements of Cel7A and Cel7D indicate K d values of 20 mM and 115 mM for cellobiose respectively, indicating the exo-cellulase enzymes do not suffer from severe product inhibition, again in agreement with the docking studies[44].