Molecular Modelling Study of Heteroarylamide/Sulfonamide Compounds with Antitrypanosomal Activity

According to the World Health Organization (WHO), Chagas disease (CD), whose etiological agent is the Trypanosoma cruzi (T. cruzi) parasite, affects about eight million people, mainly in Latin America. The cruzain enzyme is highlighted among the main biological targets, since it is the most abundant of the cysteine protease class from T. cruzi and is involved in the entire life cycle of the parasite, essential in regulating the interaction between parasite and host. The drugs available for the treatment of CD usually have strong side effects, and the nitro(triazole/imidazole)based heteroarylamide/sulfonamide compounds (HA/S) emerge with high antitrypanosomal potential. In this study, the quantitative structure-activity relationship (QSAR) were built using partial least squares (PLS) regression, and the results were robust and adequate for predicting and proposing five new derivatives according to the statistical parameters. The docking results suggest that the best-scored HA/S derivatives showed hydrogen bonds with the residuals that comprise the catalytic region of the enzyme. The molecular dynamics (MD) simulations, performed with different methods, revealed the strong stability of the compound obtained by the QSAR model of this study, in addition to a better binding free energy value than the HA/S obtained from literature.


Introduction
The etiological agent of Chagas disease (CD), a neglected tropical disease, is the parasite Trypanosoma cruzi (T. cruzi). It is estimated that about 10 million people are infected with T. cruzi and more than 25 million people are in areas at risk for vector contamination, mainly in Latin America, where the disease is endemic. 1 The disease is transmitted through insect bites (Triatoma infestans), transfusion of contaminated blood, organ transplantation, congenital transmission and consumption of contaminated food or drink. 2 There is no vaccine available to prevent the disease and the only two drugs on the market, benznidazole and nifurtimox, have serious side effects, low efficacy, and are inefficient during the chronic phase of the disease. Thus, there is an urgent need for the development of new drugs that are safer and more effective to meet the therapeutic needs of infected individuals. 3,4 The cruzain enzyme, the main cysteine protease of T. cruzi, is a cathepsin L-like cysteine protease responsible for the major proteolytic activity, and that is involved in the hydrolysis of peptides and proteins. 5 The enzyme is essential for parasite survival and growth and is among the most important targets for the development of candidates for new drugs for Chagas disease. 6 Animal model studies 7 have also validated this enzyme for the control and elimination of T. cruzi. Cruzain is expressed throughout the life cycle of the parasite and is important for the protozoan's development, survival and differentiation, responsible for its nutrition, immune system evasion and invasion of new tissues. 1 The validation of cruzain as a drug target relies on preclinical proof-of-concept studies, 1,8 which showed that cruzain inhibitors significantly reduce parasite burden in vivo. Consequently, its inhibition is an interesting strategy for treating the disease. 9 Papadopoulou et al. 10,11 synthesized and tested a series of derivatives of 3-nitro-H-1,2,4-triazole-based arylamide and arylsulfonamide (HA/S) class of compounds using structure-activity relationship (SAR) studies. The results demonstrated significant activity in in vitro tests against T. cruzi, and analogous compounds showed higher in vivo activity than benznidazole (reference compound).
Considering that CD is a neglected tropical disease, in this study, quantitative structure-activity relationship (QSAR) calculations were performed to obtain a regression model capable of predicting antichagasic activities, due to the molecular and structural information of the compounds, as well as to understand the action mechanism of HA/S derivatives against T. cruzi. In addition, the docking methods and MD, followed by energetic analyses of affinity between the compounds and the enzyme, were used to understand the molecular behavior of these ligands at the active site of the cruzain enzyme, highlighting the fundamental importance at all stages of the parasite's life cycle.

QSAR 2D
QSAR studies were performed using 22 selected derivatives (CP1-CP22) of HA/S selected on a previous study 11 and experimental biological activities were converted into their corresponding pIC 50 (or -log IC 50 , where IC 50 is the concentration in mol L −1 that inhibited 50% of biological activity) before QSAR analysis. Table 1 shows the HA/S analogues and the respective pIC 50 values.
Two-dimensional structures were built in MarvinSketch program 12 and all geometries were initially optimized by MOPAC 2016 13 at PM7 semi-empirical level. 14 Next, these obtained structures were exported to the software Gaussian 03 15 and new optimizations were carried out at Hartree-Fock level (HF/6-31G(d,p)). 16 Using the optimized structures the calculation of molecular descriptors, geometric, electronic and topological were made using different computer programs, such as Gaussian 03, 15 20 totaling 3986 calculated descriptors for each molecule. In this study, the whole data set was autoscaled along all the variables, i.e., normalized and centered on the mean, so that they could be compared to each other on the same scale.
Next, using QSAR modeling 21 software a matrix obtained with descriptors were reduced by eliminating those that presented the absolute Pearson correlation coefficient (|r|) lower than 0.3 with the logarithm of the biological activity (pIC 50 ). 22 After, the resulting matrix was subjected in the QSAR modeling to the ordered predictors selection (OPS). 21 Finally, we performed a new selection of variables by means of genetic algorithm with the QSARINS program. 23 These steps attempt to remove variables that were little related to biological activity and that did not have any relevant information to the construction of the regression equation by the partial least squares (PLS) method.

Molecular docking
The three-dimensional (3D) structure of the cruzain enzyme was obtained from the Protein Data Bank (PDB) database under code 1ME4 with a resolution of 1.2 Å. 24 The enzyme and crystallographic ligand were prepared in the Chimera program. 25 Five computational packages were selected for evaluation of molecular docking procedures.
Initially, the water molecules present in the enzyme were removed, hydrogen atoms were added to the structures and the charges were calculated according to each program. For calculations performed in the DOCK 6 program, 26 AM1-BCC (Austin Method 1 with Bond Charge Correction) charges were used, while for calculations in the AutoDock 4.2 (ADT) 27 and AutoDockVina 1.1.1 (VINA) 28 programs, Gasteiger charges 29 were used for the ligant and Kollman charges 30 for the target (enzyme). In the FITTED 2.4 (FT) 31 and Molegro Virtual Docker 5.5 (MVD) 32 programs, the standard program charges were used, such that after the preparation of each structure in their respective programs, they were saved separately for the docking study.

Redocking
To validate the packages used, the T10 crystallographic ligand was docked in the 1ME4 structure. To evaluate the convergence of the results, the procedure was performed in triplicate in five different programs (ADT, DOCK, VINA, MVD and FT), in which it was possible to determine the affinity energy, interactions between the receptor and the ligand, as well as the lowest root mean square deviation (RMSD) value using the fconv 1.24 program. 33 These parameters made it possible to select the program that best reproduced the interactions of the crystallographic ligand in the complex with the lowest RMSD value.

Molecular dynamics
The structures obtained on the molecular docking analysis were used as a starting point for classical molecular dynamics (MD) simulations by using two computational packages: Amber 16 34 and Q. 35 Initially, the protonation states of all amino acid residues of the enzyme were determined under pH 7.0 condition and analyzed by PDB2PQR/PROPKA server. 36 Then, molecular systems were built for each MD package as follows:

Amber system
On the Amber 16 package, 34 the all-atom force field Amber FF99SB 37 and general Amber force field (GAFF) 38 were selected as parameters set to describe protein and ligands in all MM (molecular mechanics) simulations, respectively. The restrained electrostatic potential (RESP) charges 39 of each ligand were calculated on the Gaussian 03 15 package using the Hartree-Fock method with 6-31G* basis. The systems were solvated in a cubic water-box with the explicit solvation model TIP3P. 40 We used a distance of 12.0 Å between the cell wall and the solvated atoms of the system. Counter-ions were appropriately added to neutralize the system.
Initially, each solvated complex was energy-minimized by performing out a minimization (25,000 and 10,000 steps of each steepest descent and conjugate gradient approach, respectively) and then gradually heated up to 298 K (ensemble NVT) of MD simulations with a constraint of 25 kcal mol −1 Å −2 on the solute atoms of systems. After, each system was relaxed for 250 ps of MD simulations for density equilibration at constant temperature (298 K). All stages simulations employed a nonbonded cutoff of 9 Å, where particle mesh Ewald (PME) approach computed the long-range electrostatic forces. In all the simulations, the bonds involving hydrogen atoms were restricted by SHAKE algorithm 41 and the motion equations were integrated at each 2 fs using Verlet algorithm. 42 Finally, 50 ns of MM MD simulations were performed out for each system by using NPT ensemble. The SANDER module 34 of AmberTools 16 was used for all MM calculations. The structural analysis of each system was performed out by using RMSD and root mean square fluctuation (RMSF) plots obtained by CPPTRAJ module 43 of AmberTools 16. The last 10 ns of MM MD simulations of each system were used for binding free energy calculations. The snapshots were carefully extracted by using the CPPTRAJ module and submitted to the MMPBSA.py module 44 of AmberTools 16. The binding free energy values were computed by using MM-PB(GB)/SA approaches 45 as implemented on the MMPBSA.py module. These methods have been used successfully on many biological systems 46,47 as well as described elsewhere. 48 We used a single-trajectory MM-PB(GB)/SA protocol where protein-ligand complex, unbound protein, and free ligand conformation states are identical. In the equation 1, the free energy of binding (ΔG bind ) is approximated through: Fom the equation 2 the term ∆E MM refers to protein-ligand interactions on the gas phase, which are computed by internal energy (ΔE int ), electrostatic (∆E ele ) and van der Waals (∆E vdw ) contributions. ΔH is the enthalpy. The solvation free energy and entropic contributions are computed by ΔG sol and TΔS terms, respectively. The ΔG sol (equation 3) is obtained computing polar (ΔG PB/GB ) and nonpolar (ΔG (nonpolar) ) electrostatic interactions by using implicit solvation models (PB or GB) and a linear relation to the solvent accessible surface area (SASA). 49

Q system
As described previously, molecular systems obtained from molecular docking calculations were used for classical MD simulations by using Q package, 35 where protein and ligands were described by all-atom force field OPLS (optimized potentials for liquid simulations). 50 Particularly, to compute the binding free energy, MD simulations were simulated in two different states: (i) ligand solvated only by water molecules and (ii) ligand bound to protein and solvated by water molecules. In each state, the systems were solvated in 18 Å radius water sphere with the explicit solvation model TIP3P 40 and subjected to polarization and radial constraints according to the surface-constrained all-atom solvent model 35,51 to mimic the properties of bulk water at the sphere surface. A time step of 2.0 fs was used, and no positional restraints were applied. Solvent bonds and angles were constrained using the SHAKE algorithm. 41 All classical MD simulations started with a heating and equilibration stage followed by subsequent data collection. All systems were thus gradually heated up to 298 K, and positional restraints on all solute heavy atoms were gradually released. Finally, each ligand was subjected to three MD simulations (for each state) during 3 ns. The trajectories and energy values were updated every 25 steps, and the same interval was used for the sampling of the ligand-surrounding interaction energies.

Linear interaction energy calculation
The binding free energy for each protein-ligand system was computed by using linear interaction energy (LIE) approach (ΔG LIE ). 52 On this strategy, ∆G LIE is calculated by the difference observed on (i) and (ii) states: where in the equation 4 the terms 〈V vdw 〉 and 〈V ele 〉 are the averages of the van der Waals and electrostatic interactions, respectively for each (i) and (ii) stated, scaled by the scaling factors α and β. The constant α has been empirically determined as 0.181, while β varies depending on the properties of the ligand. 52,53 Here, α and β values used are 0.181 and 0.430, respectively.

QSAR 2D
Using the study by Papadopoulou et al., 11 it was found that the biological activity values (pIC 50 ) varied from three logarithmic units, ranging from 3.98 (least potent compound) to 6.8 (most potent molecule). Figure 1 shows the histogram with the distribution of pIC 50 values.
Subsequently, after determining the most stable conformations obtained by the HF/6-31G(d,p) method, several descriptors were calculated to form a data matrix. Thus, after calculating all the descriptors, the next step was to select those that were relevant (statistically significant) for the biological activity in question.
The compounds were divided into two sets called training and testing via the hierarchical cluster analysis (HCA) method. The analysis was carried out on autoscaled data by the Minitab 14 program, 54 using the Ward linkage and Euclidean distance. For this, 15 compounds were randomly chosen for the training set (70% of the compounds), and the rest were used as a testing set (30% of the compounds) to verify the performance of the model. The following compounds were selected to comprise the testing set: 5, 9, 10, 14 and 21 and the other compounds were used for the training set.
After the cutoff by the correlation of ± 0.3 in relation to the biological activity, performed in the QSAR modeling program, 21 of the 3986 descriptors initially calculated, a total of 385 remained. Using the OPS algorithm, 55 we arrived at a matrix of 25 descriptors and three latent variables (LVs). The use of the QSARINS program 23 allowed a reduction to the three final descriptors for the construction of the model.
The model was selected with two LVs containing 81.49% of accumulated information, in which 64.01% referred to the first LV and 17.48% referred to the second LV, and no anomalous sample was identified. Table 2 shows that the model was able to explain 93.2% and predict 88.1% of the total variance. The validation statistical parameters for the regression model, coefficient of determination (R 2 ) and the cross-validation coefficient of determination (Q 2 LOO ), presented values above those recommended in the literature, 56 where R 2 > 0.6 and Q 2 LOO > 0.5. 56 When considering the values of R 2 and Q 2 LOO , the difference observed between them was 0.051, thus the model is within the limit suggested by Kiralj and Ferreira,57   which is an indication that the model does not present data overfitting. Another important factor was the result of the F test value calculation (82.24), which was much higher than its tabulated reference value F tab(2,12) = 3.89 with 95% significance (α = 0.05), wherein it is more than five times larger than the tabulated value, where F (2,12) /F tab(2,12) = 21.14, which indicates that the information calibrated by equation 5 is really significant. Table 2 presents the parameters for the results of the internal validation of the training set. In this step, the predictive residual sum of squares of validation (PRESSval) determines that the smaller the difference between the experimental value and the predicted pIC 50 , the greater the predictive capacity of the model. The QSAR model proposed provides internal quality and statistical significance, because the PRESSval value is lower than the sum of squares of the response values (SSy), as indicated in Table 2. 22 Similarly, the same can be observed for the sum of squares of the PRESScal calibration prediction residuals, which means the lower the value, the better the goodness of fit. 57,58 The root mean square error of calibration (RMSEC) and root mean square error of validation (RMSECV), respectively also show better statistical quality of internal validation as the values approach zero, indicating a small deviation from the experimental value. 59 The adjusted correlation coefficient (R 2 adjust ), which takes into account the number of compounds and latent variables in the QSAR model, shows that the regression equation has good explanatory capacity and since the difference between R 2 and R 2 adjust is less than 0.3, the number of descriptors present in the QSAR model is acceptable. 60 Among the three descriptors selected, two were obtained from the PaDEL program 20 Equation 6 demonstrates that the MATS7v descriptor is the most important in the regression equation with the self-scaled regression vectors. In addition, it also shows that the R6e+ and MATS7v descriptors have a negative contribution to biological activity; on the other hand, the MDEN-23 descriptor has a positive contribution.
One factor that can be assessed in the model is the coincidence between the algebraic signs of Pearson correlation coefficient values (r) for each descriptor with the pIC 50 values (Table 3) and the signs of the regression coefficients in the model (equation 6). Table 3 shows that the model has descriptors whose sign of their coefficients coincides with the information provided by the correlation of the biological activity, which demonstrates the selfconsistency of the model. 61 In the leave-N-out (LNO) validation test, it is recommended that about 20 to 30% of the number of samples be removed, and 5 samples (ca. 20%) were removed in this work. Figure 2 presents the LNO results, the mean and the standard deviation for the values of Q 2 LNO obtained for the three repetitions, removing from 1 to 5 samples (leave-5-out).
The LNO test shows that when removing up to five samples, the QSAR model built can be classified as robust, because it has high Q 2 LNO values, average Q 2 LNO values were close to the individual ones, and the difference of each Q 2 LNO value with respect to the Q 2 LOO value was not higher than 0.1, besides the standard deviations found were also below 0.1. 57   Figure 3 shows the result of the y randomization test, used to measure the possibility of chance correlation. An analysis of Figure 3 shows that the intercept of the regressions for R 2 and Q 2 LOO were below 0.3 and 0.05, respectively, indicating that the variance explained by the models is not due to chance correlation. Thus, the model can be considered robust. 57,62 In Figure 4, from the 50 randomizations, it can be observed that all values obtained for R 2 and Q 2 were below 0.6 and 0.5, respectively. This fact shows that the variance explained by the model was not due to chance correlation. 22 An analysis of the Figure 5 chart of leverage by Student residuals shows the absence of compounds with anomalous behavior (outliers). Note that only compound 22 has a high leverage value, but a low Student residual value, and thus cannot be classified as an anomalous sample, such that removing it would cause a decrease in the quality parameters of the model.
The quality of the model is also evaluated according to the statistical parameters presented in Table 4 for the external prediction set, in which activity values are predicted for a set, called a "testing" set, made up of known compounds that are not used for the construction of the regression model. Table 4 shows that the model has good external prediction capacity, as it presents the value of R 2 pred (coefficient of determination of external validation) above 0.6, showing a good fit for the data. 63 The standard error of prediction (SEP) and average relative error of prediction (ARE pred ) showed low values, indicating a low prediction error. 60 Table 4 demonstrates that the Golbaikh and Tropsha parameters, based on straight-line inclinations, which were obtained by regression between observed and predicted values (k) and between predicted and observed (k') must be between 0.85 and 1.15 and the modulus of the difference between R 2 0 (determination coefficient of the linear relation between the observed and predicted values without an intercept) and (R' 2 0 ) (predicted and observed values without an intercept) less than 0.3. 64 Finally, the fact that the last two parameters are less than 0.1 shows that the proposed QSAR model has excellent predictive ability. 65 3.136 × 10 −3 R 2 pred : coefficient of determination of external validation; SEP: standard error of prediction; ARE pred : average relative error of prediction; k, k': Golbraikh-Tropsha slopes of the linear regression lines between the observed and the predicted activities in the external validation; |R 2 0 -R' 2 0 |: Golbraikh-Tropsha absolute values of the difference of the determination coefficient of the linear relation between the observed and predicted values without an intercept (R 2 0 ), and the predicted and observed values without an intercept (R' 2 0 ); R 2 : coefficient of determination.   Thus, according to the internal validation parameters, it was found that the model can be classified as a good model, since according to the criteria used, the model did not present overfitting, had a small deviation from the experimental, is self-consistent and robust, does not suffer from chance correlation and shows good external prediction ability.
In a QSAR model, it is desirable to evaluate the selected molecular descriptors to assist in understanding the action mechanism for the development of new molecules that act as new therapeutic agents. As such, some conclusions on the variables can be drawn for the studied HA/S derivatives.
The R6e+ descriptor emphasizes the relevance of the electronic effects and takes into account Sanderson's electronegativity among a group of 6 atoms. 55 Equation 5 shows the inversely proportional relationship to biological activity, and the smaller the difference in electronegativity between atoms, the better the activity, explaining the activity of the CP1, CP2, CP4, CP5, CP7, CP10, CP13, CP18 and CP19 compounds.
The MATS7v two-dimensional autocorrelation descriptor expresses van der Waals volume values of a 7-atom group, thus determining the relationship between atomic volume and biological activity. 66 In the set of compounds, it is observed that the lower the value of the atomic volumes, the higher the biological activity when observed in equation 5. The inversely proportional relationship of the MATS7v descriptor and the biological activity can also be seen, proving that the most active CP4 and CP18 compounds present the lowest values of this property.
The MDEN-23 topological descriptor refers to the geometric mean of the molecular distances between all secondary and tertiary nitrogen, and this encodes the presence or absence of nitrogen containing substituents attached to a common skeleton. 67 The proportional relationship of the MDEN-23 descriptor and the biological activity is observed, proving that the most active compounds present the highest values of this property.
Through the results obtained in the QSAR study, equations 5 and 6 and the properties that can contribute to the biological activity, it was possible to propose five compounds derived from the HA/S class, named LMM1 to LMM5, to integrate into the set of molecules.
The LMM1 compound was built with only one chlorine atom attached to the heterocyclic ring and maintaining the structure of compound 3 proposed by Papadopoulo et al. 11 In contrast, the LMM2 to LMM4 compounds were constructed by substituting the nitro group with the amino to avoid possible toxic activities and increasing the number of carbon atoms between the heterocyclic rings with the intention of decreasing the value of the most influential descriptor obtained in the regression equation with the self-scaled regression vectors (MATS7v).
The LMM5 compound was designed to be of the lowest biological activity by replacing the chlorine atoms of the heterocyclic ring with the bromine atom in order to increase the values of the MATS7v and R6e+ descriptors. Next, it was possible to calculate their properties and predict their respective biological activity values using equation 5, as shown in Table 5 and Figure 6.  Thus, Table 5 demonstrates that the predicted biological  activity values for the LMM1 and LMM3 compounds  showed intermediate values to the pIC 50 values, whereas LMM4 shows a higher activity against T. cruzi than compound 4 elaborated by Papadopoulou et al. 11 (Table 1). This is mainly due to the contribution of the MAT7v descriptor to the construction of the model. In addition, it was possible to predict a derivative with a low pIC 50 value, since the carbon chain was increased, the distances between nitrogens were reduced and a bromine atom was added.

Molecular docking
The study performed by Papadopoulou et al. 11 verified the HA/S activity against the amastigote form of T. cruzi, and that cruzain, expressed in all evolutionary forms of the parasite, acts in the invasion, replication and differentiation stages, and is therefore a key enzyme in the search for inhibitors against Chagas disease. 6,7 Thus, due to their importance, the action of these compounds on the parasite's cruzain enzyme was investigated.
For the determination of the most suitable program and procedures for molecular docking calculations of the 27 HA/S derivatives (CP1-CP22 and LMM1-LMM5), the enzyme structure was recovered from the PDB under code 1ME4. 24 Molecular redocking of the crystallographic ligand (T10) for program selection was performed in five different programs, such as DOCK 6, 26 ADT, 27 VINA, 28 FT 31 and MVD. 32 RMSD values were evaluated using the fconv 1.24 program. 33 After the redocking of the co-crystallized ligand (T10) in triplicate in the five programs, it was observed that the FT program presented the lowest RMSD value (equal to 2.91), and thus was chosen to dock the 27 compounds in the 1ME4 enzyme.
The interactions of the 27 compounds were analyzed in the PoseView online server. 68 In order to select the compounds against the cruzain enzyme of T. cruzi, the structures were classified according to the interaction energy values obtained by the FT, as well as a visual inspection taking into account the hydrogen bonds, hydrophobic interactions, and the biological activity (pIC 50 ) values obtained. Table 6 and Figure 7 show that CP2 and LMM4 stood out for their high interaction energy and pIC 50 values, as well as their hydrogen bond interactions with the main active site residuals that are part of the catalytic triad, such as Cys25, Gly66, Asn69 and His159, as well as hydrophobic interactions with the Gly65, Leu67, Ala133, and Glu205 residuals.
In the conformation obtained by the molecular docking for CP2, this compound showed hydrogen bonds with the  residues of amino acid Cys25 and Gly66 (Figure 7). The Cys25 belongs to the active cruzain enzyme site, and this residue is located in the S1 subsite of the enzyme, while the Gly66 is located in the S3 subsite, highlighting that the Cys25 residual is part of the enzyme catalytic triad. 24,69 In the conformation obtained for the LMM4 ligand, it is possible to observe hydrogen interactions with the Gly66, Asn69 and His159 residues. The His159 is part of the catalytic triad and the amino acid residues Leu67, Asn69, Ala133 and Glu205 are located in the S2 subsite, and according to a study by Durrant et al., 70 it appears that these flexible residuals can be explored in the drug design process. In addition, a study by Gillmor et al. 71 showed that the Glu205 is important because it adjusts itself to restructure part of the active site, conferring a strong interaction with basic or hydrophobic inhibitors in the S2 subsite.
After docking the 27 compounds in the cruzain, the HA/S derivatives were docked to the homologous human cathepsin L enzyme recovered from the PDB under code 5MAE. 72 Thus, docking calculations showed that these compounds did not interact with the 5MAE enzyme when subjected to interaction analysis on the PoseView online server. 68

Molecular dynamics
Experimental and computational studies 73,74 have been performed to verify the specificity and characterize the cruzain enzyme site in order to facilitate directed drug design. Free energy calculations have been increasingly used to provide quantitative predictions in order to prioritize compounds based on their estimated activity.
After molecular docking simulation, CP2 and LMM4 were selected for MD calculations because these ligands present hydrogen bonds with key residues, in addition to lower energy values obtained by docking with the enzyme of T. cruzi and pIC 50 values predicted by the PLS model above 6 μM.
The structural stability of the systems was analyzed using RMSD values as a function of simulation time (Figure 8). For the systems evaluated in the Amber program, 34 it is observed that in all systems, the RMSD values are less than 2.2 Å, showing few conformational changes in the complex's structure. In addition, an analysis of Figure 8 shows that the most stable complex is that referring to the LMM4 ligand, which has a lower RMSD than CP2. Similar results were also found for the same systems in the Q program (Supplementary Information).
The flexibility of protein regions with respect to each ligand was observed by RMSF. Figure 9 shows that the regions corresponding to the amino acid ranges Val54-Leu67, Glu95-His106 and Thr148-Gln159, which correspond to the loop regions, showed greater fluctuations. It is also worth highlighting the region between the residuals Val54-Leu67, which showed significant deviations from the two ligands. Studies conducted by Wiggers et al. 69 show the importance of these residuals located in the S3 subsite of the cruzain enzyme, which demonstrates their importance in the molecular recognition process and, consequently, in inhibiting the enzyme.
MD simulations are an important tool for including receiver flexibility and calculating the binding free energy value, at which this value can be used to quantify the affinity between an enzyme-inhibitor type system. Thus, the calculation of binding free energy values is an important tool in the drug design process, and is considered a key factor in proposing powerful enzyme inhibitors. 47 In this context, to evaluate the affinity of the CP2 and LMM4 compounds in the cruzain enzyme of T. cruzi, the binding free energy values were determined by the MM-GB(PB)SA and LIE methods in the Amber 16 34 and Q programs, 35 respectively.
The calculation of the binding free energy for the complexes formed between CP2 and LMM4 was performed   (Table 7). Table 7 shows that the formation of the two complexes formed by the enzyme and the ligands is favorable. Furthermore, comparing the binding free energy values (MM-GB(PB)SA) shows that the LMM4 ligand presented the lowest energy values when compared to the CP2 proposed by Papadopoulou et al. 11 The values found in Table 7 show that the formation of the two complexes was also favorable by the LIE method, and once again the complex formed by the LMM4 ligand, proposed in the QSAR model, presented a lower energy value, that is, a higher affinity than the CP2 compound. Therefore, for the three free energy calculation methods (MM-GB(PB)SA and LIE), the results corroborate each other.
The energy contribution of each residue for the complexes formed by the CP2 and LMM4 compounds obtained by the MM-GBSA method are shown in Figure 10. This energy decomposition analysis shows the residues that contribute most significantly to the total interaction energy and, therefore, to the stabilization of the complex formed by ligand CP2 are Cys25, Trp26, Gly65, His162 and Gly163. Other favorable interactions were with residuals Ser24, Gly66, Leu67, Ala138 and Leu160. For the complex formed by LMM4, there is a greater number of residuals that contribute to the stabilization of the complex. In addition, to those previously mentioned,  Gln21, Cys22, Gly23, Cys63, Ser64, Met68 and Val139 residues were important for LMM4. It is important to highlight that the interactions with residuals Cys25, Gly65, Gly66 and Leu67 had already been observed in the molecular docking results. It is noteworthy that the catalytic triad formed by the Gly23, Cys25 and Gly65 residuals are responsible for the trypanosomide activity observed. 7 Furthermore, a greater interaction of the LMM4 ligand with the amino acid residues of the cruzain enzyme can be observed in the S1 (Gly23, Cys25 and Trp26), S3 (Ser64, Gly65, Gly66 and Gly67) and S1' (His162) subsites, which were important for fixation and stability of the inhibitor at the active site, results that corroborate studies by Bryant et al. 75 Figure 11 shows the energy contribution of each residue calculated by Q program, 35 and from these results is verified the stabilization of the system formed by the CP2 and LMM4 ligands with the enzyme (Figures 11a and 11b). Figures 11a-11b show that residuals Gly23, Ser24, Cys25, Trp26, Gly65, Gly66, Leu67, Met68, Leu160, Asp161, His162 and Gly163 contributed favorably. Thus, the results found by the LIE method corroborate those found by the MM-GBSA for compounds CP2 and LMM4. Interestingly, the LMM4 compound made an important interaction with the Glu208 residue (Figure 11b), located at the end of the S2 subsite, which contributes to the flexibility of this subsite to accommodate volumous inhibitors in this region. 71,76 Thus, despite the different approaches used to evaluate the proposed systems, the results indicated that the LMM4 obtained by the QSAR model does in fact present more effective interactions in relation to CP2, and thus can be indicated as a potential inhibitor of the cruzain enzyme of T. cruzi.
The QSAR model obtained made the proposition of the LMM4 ligand possible for the purpose of improving trypanocidal activity. Besides, the results of docking and MD showed that the LMM4 compound had low RMSD values, lower free energy value when compared to the structure obtained in the study by Papadopoulou et al. 11 Also, the LMM4 showed a greater number of residues, using different methods of MD, that contribute to the stabilization of the enzyme-inhibitor complex. Thus, it was possible to propose a promising molecule that may lead to an increase in the inhibitory capacity of the derivatives against T. cruzi. Figure 11. Graphical representation of the interaction energy per residue (left) for the complex formed (a) CP2 and (b) LMM4, obtained by Q program. 35

Conclusions
In this study, we performed out a molecular modeling study on derivatives of HA/S molecules as cruzain inhibitors using QSAR, molecular docking, MD simulations and binding free energy calculations. The QSAR method was employed by using PLS method, and the proposed molecular models were validated by using internal and external statistics parameters set, which show high quality and suitable prediction capacity. Besides, the final models were satisfactory according to the statistical validation results, where the PLS method leads to propose five new HA/S derivatives.
Molecular docking method was then applied to explain the results of QSAR and explore the interactions between new HA/S derivatives and catalytic site of cruzain enzyme, a potential target against Chagas disease. The molecular docking results suggest that amino acid residues such as Gln19, Cys25, Gly66, Leu157, Asp158, Glu205 and Ser208 were determined as the key residues for inhibitory activity.
The MD simulations and binding free energy calculations were computed on different packages and approaches, showed that the MD results were consistent with the findings obtained from QSAR and molecular docking results, and they suggested that LMM4 compound is more stable than inhibitor (CP2) when complexed to cruzain enzyme. In addition, the binding free energy decomposition pointed out the key residues involved in the binding of LMM4. Therefore, we could have a better understanding of the QSAR and the binding features of new HA/S derivatives using the molecular docking calculations and five new compounds with potential cruzain activity were obtained.
Finally, molecular modeling techniques such as QSAR, molecular docking, and MD simulations were successfully applied to design and to predict the biological activity of the new LMM4 compound against the cruzain enzyme. These results can be used on drug design field of new HA/S derivatives as potent inhibitor of cruzain from T. cruzi.

Supplementary Information
RMSD plots of MD simulations performed out by Amber and Q programs for CP2 and LMM4 systems are available free of charge at http://jbcs.sbq.org.br as PDF file.