A Combined Approach of Ligand-based and Structure- based Virtual Screening to Select Structures with Potential Antichagasic Activity from SISTEMATX Sesquiterpene Lactones Database

Chagas disease is an endemic disease caused by Trypanosoma cruzi, which affects more than eight million people, mostly in the Americas. A search for new treatments is necessary to control and eliminate this disease. Sesquiterpene lactones (SLs) are an interesting group of secondary metabolites characteristic of Asteraceae that have presented a wide range of biological activities. From the ChEMBL database, we selected a diverse set of 4,452, 1,635 and 1,322 structures with tested activity against the three T. cruzi parasitic forms, amastigote, trypomastigotes and epimastigote, respectively, to create random forest (RF) models with an accuracy of greater than 74 % for cross-validation and test sets. Afterwards, a ligand-based virtual screen of the entire SLs of Asteraceae database stored in SistematX (1,306 structures) was performed. In addition, a structure-based virtual screen was also performed for the same set of SLs using molecular docking for T. cruzi cruzain. Finally, using an approach combining ligand-based and structure-based virtual screening along with the equations proposed in this study to normalize the probability scores, we verified potentially active compounds and established a possible mechanism of action.


Introduction
Chagas' disease is an endemic disease caused by Trypanosoma cruzi, which affects more than seven million people, mostly in the Americas [1]. The search for new treatments is necessary for the control and elimination of this disease. Natural products have been an invaluable source of inspiration for the development of therapeutic agents [2,3]. Sesquiterpene lactones (SLs) are one of those interesting small molecules for the search of new chemotherapies against infectious diseases [4,5].
Using a combined approach of ligand-based and structure-based virtual screening (VS) with the entire SLs databank stored in SistematX (http://sistematx.ufpb.br), we verified potentially active compounds against Trypanosoma cruzi and established a possible mechanism of action.

Ligand-based VS
The training set hit-rate values for the three RF models are quite close to or exactly 100%; nevertheless, for cross-validation and test sets values range from 64.6% to 91.1%, with epimastigote and trypomastigote models serving as better predictors of inactive molecules than the amastigote model. The specificity of the epimastigote model is better than the other two models, as the percentage of true negative compounds predicted in the test set (91.1%) was higher than the cross-validation set (85.6%). The amastigote model is the most sensitive of the three, presenting a true positive prediction rate of 76.7% and 79.1% for the cross-validation and test sets, respectively. In turn, the models for the two other parasitic forms were approximately 10% less sensitive to the values reached in the amastigote model.
Using this machine learning algorithm, a virtual screen was performed on a set comprising 1,306 molecules obtained from SistematX. For amastigotes, 34 SLs were predicted to be antichagasic compounds, with probability values ranging from 0.50 to 0.58. Some common structural features are observed among the structures with higher probability values, SLs 1-2 ( Figure 1). are acetylated molecules germacranolides contained an epoxide moiety in their structures.
Otherwise, 17 SLs were predicted to be anti-T. cruzi compounds for the trypomastigote parasitic form, with probability values ranging from 0.50 to 0.64. Desacetyl-isotenulin (3, Figure 1) was the structure with the highest probability value. The structures of the active molecules are similar (guaianolides). Finally, the epimastigote model was less selective than the other two models, as 420 active molecules were predicted, with probability values ranging from 0.50 to 0.82. As in the amastigote model, structural similarity was observed between SLs with higher probability values (5-6).

Structure-based VS
Initially, molecular docking was validated by redocking of the original ligand for T. cruzi cruzain. This score is listed in Table 1 with their respective RMSD value. Table 1. The docking energy (kJ/mol) of two of the best-ranked SLs from the structure-based approach for cruzain. Ligand = energy (kJ/mol) for the PDB ligand and the RMSD values obtained from the redocking procedure. After, a virtual screen of 1,306 SLs was performed. Based on the binding energy values, all tested molecules were ranked using the following probability calculation (ps, Equation 1): where ps = structure-based probability; Ei = docking energy of compound i, and i ranges from 1 to 1306 (SLs dataset); Emin = the lowest energy value of the dataset; Eligand = the ligand energy from protein crystallography.
For 753 SLs, values greater than 0.5 and binding energy values less than the ligand were observed. The structures 7 and 8 (Figure 2A), two guaianolide SLs extracted from Lactuca georgica, presented the highest active probability values in structure-based VS. Figure 2B

Ligand -based and Structure based VS combined approach.
Using the equation 3, an approach combining structure-based and ligand-based virtual screening was performed to verify potentially active molecules as well as their possible mechanism of action, facilitating the identification of potential multitarget compounds. where pc = combined probability ps = structure based probability; TN = true negative rate; p = ligand-based probability Table 2 summarizes the results for the best-ranked SLs obtained using the combined approach. Some structures that previously displayed a high active probability value in the ligand-based virtual screen appear to be interesting potential structures for each T. cruzi parasitic form.
Mol2Net, 2017, 3(Section 13.), 1-5, paper, doi: xxx-xxxx 4 Table 2. The best-ranked structures for each parasitic form obtained using an approach combining ligand-based and structure-based virtual screening.; p = active probability value in ligand-based VS; ps = active probability value in structure-based VS. pc = combined probability value Structure 1 and 4, have the highest pc values for amastigote and trypomastigote parasitic form, these two compounds also presented high probability scores in Ligand-based VS. Structure 9 (Figure 3), emerges as an interesting structure that acts in cruzain of epimastigotes, since that have good results in the two VS methodologies as well as in the combined-approach.

Materials and Methods
From the ChemBL database were obtained 4,452, 1,635 and 1,322 structures with activity against the three parasitic forms of T. cruzi, amastigotes, trypomastigotes and epimastigotes, respectively (https://www.ebi.ac.uk/chembl/). The compounds were classified using values of pIC50 (-log IC50), which led us to divide them into active (pIC50≥5) and inactive ( SLs were performed in these models ( Figure 4). The structure of T. cruzi protein, Cruzain (PDB ID: 4XUI) in complex with the respective inhibitor (PDB ID: 2VC), were downloaded from the Protein Data Bank-PDB. The docking procedure was performed using MOLEGRO virtual docker 6.0, using a GRID with a radius of 15 Å and a resolution of 0.30 Å to cover the ligand-binding site in the structure of cruzain ( Figure 4). In the present study, potential antichagasic SLs for the three parasitic forms and some structural features were determined from RF models of T. cruzi. In addition, a structure-based virtual screen using PDB structure of T. cruzi cruzain for the entire SL set allowed the selection of potential inhibitors of this enzyme. Finally, using a combined approach of structure-based and ligand-based VS enabled the identification of promising multitarget antichagasic SLs.