Virtual Screening of Secondary Metabolites of the Family Velloziaceae J. Agardh with Potential Antimicrobial Activity

The objective of this work was to carry out a bibliographic survey of secondary metabolites isolated from the Velloziaceae family, creating a bank of compounds. After the bank was created, four prediction models for potentially active compounds against pathogenic microorganisms (Candida albicans, Escherichia coli, Pseudomonas aeruginosa and Salmonella sp.) were obtained trying to identify which metabolites would be more active against the strains. Four sets of compounds with known activity for microorganisms were selected for the construction of predictive models from the CHEMBL database. Another bank with 163 unique molecules isolated from the Velloziaceae family was built. The Volsurf+ v.1.0.7 software obtained the molecular descriptors and Knime 3.5 generated the in silico model. The performances of the internal and external tests were also analyzed. The study contributed through the virtual screening of a bank of metabolites to select several compounds with potential antimicrobial activity, highlighting the biflavonoid amentoflavone which showed potential activity against the four strains.


Introduction
Natural products have been used historically for the treatment of various diseases, where medicinal plants act as an important resource for the recovery, cure and prevention of numerous diseases. 1 Thus, their use as a target for the discovery and/or obtaining of new drugs, whether in their entire form or in isolated compounds, is currently emphasized and, in data, it is observed that more than 70% of a total of 1562 new drugs approved by the Food and Drug Administration (FDA, 1981(FDA, -2014 are of natural origin. 2,3 The Velloziaceae family is native and not endemic to Brazil, where it currently comprises five genera (Acanthochlamys, Barbacenia, Barbaceniopsis, Vellozia and Xerophyta) and about 274 species, 4,5 inhabiting arid, rocky and elevated places. 6 The vast majority of species are distributed in Neotropical America (Barbacenia, Barbaceniopsis and Vellozia), others occur in Africa, Madagascar and the Arabian Peninsula (Xerophyta and Vellozia) and one in China (Acanthochlamys). 7 As for ethnopharmacological use, the aerial parts of some species of the family are used as anti-inflammatory, antirheumatic, treatment of bruises and bone fractures (topical use) and infections. 8,9 Besides this information, studies of phytochemicals of several species of the family are observed in the literature, as well as limited in silico and pharmacological studies of compounds used.
Bacterial resistance to more traditional antimicrobials is one of the biggest and most considerable obstacles to public health, where, according to the World Health Organization (WHO), Escherichia coli, Klebsiella  In this perspective of obtaining compounds and envisioning their pharmacological potential, the use of computational methods in order to carry out the virtual screening of bioactive substances has been widely used. The search consists in selecting compounds with the computer axis from data in a database with a large number of molecules for diseases and contributing to the advancement in the planning of medicines, reduction of time, costs and animals in research. [13][14][15] Thus, in this study a bibliographic survey of secondary metabolites isolated from the Velloziaceae family was carried out, creating a bank of compounds. After the bank was created, four prediction models for potentially active compounds against pathogenic microorganisms (Candida albicans, Escherichia coli, Pseudomonas aeruginosa and Salmonella sp.) were obtained trying to identify which metabolites would be more active against the strains.

Methodology
Computational chemistry Database From the ChEMBL database, four sets of chemical structures with known activity for microorganisms were selected: Candida albicans, Escherichia coli, Pseudomonas aeruginosa and Salmonella sp., for building predictive models. The details of each set are described in Table 1. The compounds were classified from pMIC 50 (−logMIC 50 ) (where pMIC 50 is the planktonic minimum inhibitory concentration or 50%); emphasizing that the MIC 50 represents the minimum concentration necessary for a 50% inhibition of the studied microorganisms. Another database of isolated molecules of the Velloziaceae family was built from a literature review of this family, with a total of 196 botanical occurrences and 163 unique molecules.
SMILES codes were used for all structures as input data for Marvin. 16 Standardizer software 17 was also used, which converts the various chemical structures into personalized canonical representations. This standardization is of paramount importance to create libraries of consistent compounds, in addition to obtaining the structures in canonical forms, adding hydrogens, flavoring, generating the 3D and saving the compounds in SDF format.

Volsurf descriptors
Molecular descriptors were used to predict biological and physicochemical properties of the molecules in the four databases. The calculation of the descriptors was generated when the molecules were transformed into a molecular representation that allows mathematical treatment.

Prediction model
Knime 3.5 software 19 was used to perform the analyses and generate the model in silico. The banks of molecules with the calculated descriptors were imported from the Dragon software, 20 and for each one, the data were divided using a "Partitioning" tool with the option of "Stratified sample", separating in Training and Testing, representing 80 and 20% of all compounds, respectively, where they were randomly selected, but maintaining the same proportion of active and inactive substances, in both databases.
For internal validation, cross-validation was used, where 10 stratified groups were selected, randomly selected, but distributed according to the activity variable in all validation groups. With the selected descriptors, the model was generated using the training set applying the random forest (RF) which is an algorithm for building decision trees, 21 used in WEKA. 22 100 forests and 1 random seed were the selected parameters for build the RF models.
The performance of the models' internal and external tests were analyzed for sensitivity (true positive rate, that is, the active rate), specificity (true negative rate, that is, the inactive rate) and accuracy (general predictability). In addition, the sensitivity and specificity of the receiver operating characteristic (ROC) curve was used to describe the true performance of the model, with more clarity than precision.
The model was also analyzed by the Matthews' coefficient, 23 a way of globally evaluating the model from the results obtained from the confusion matrix. The Matthews' correlation coefficient (MCC) is, in essence, a correlation coefficient between observed and predictive binary classifications. It results in a value between −1 and +1, where a coefficient of +1 represents a perfect forecast, 0 is nothing more than a random forecast and −1 indicates total disagreement between forecast and observation.
Matthews' correlation coefficient can be calculated from the following formula: The applicability domain based on Euclidean distances was also used in order to signal compounds in the test set for which predictions may be unreliable. Similarity measurements are used to define the model's applicability domain based on Euclidean distances between all training, test and virtual screening compounds. The distance of a compound from a test compound to its closest neighbor in the training set is compared to the predefined limit of applicability domain, if the similarity is beyond that limit, the prediction is considered unreliable. 24

Results and Discussion
The secondary metabolite data set was composed of a total of 196 botanical occurrences and 163 different chemical compounds, from 34 species of the Velloziaceae family (genera Vellozia, Acanthochlamys and Barbacenia). It was identified that although several species make up the family, few have phytochemical and/or pharmacological studies, predominantly the isolation of diterpenes (109), flavonoids (21), triterpenes (21), steroids/glycosylated steroids (3), biflavonoids (2), other classes (7). This data set is available in SistematX. 25 The generated models obtained excellent performances, with an accuracy greater than 75%. What also corroborates with these data are the high indexes of the MCC, thus informing the good prediction rate of the models ( Table 2).
Looking at the values of the ROCs curves of the models, we see that they all have a high probability of selecting truly positive compounds, that is, with a low probability of classifying inactive compounds as active. The area under the curve is greater than 0.83, remembering that a perfect model has an area under the curve equal to 1 (Figure 1).
For the models of C. albicans, E. coli and P. aeruginosa, only the flavonoid kaempferol 3-O-(3",6"-di-O-E-pcoumaroyl)-β-D-glucopyranoside was outside the scope of application. Among the remaining 162 molecules that remained within the domain, 86 were classified as likely to be active ranging between 51 and 76% in the C. albicans model, 26 in the E. coli model with a probability between 50 and 78% and only 10 molecules in the model of P. aeruginosa with probability varying between 52 and 62%. The molecules with the greatest potential to be active for these models are described in Table 3.
Some studies have reported the use of quantitative structure-activity relationship (QSAR) models to select molecules with potential antimicrobial activity. Trush et al. 26 used three types of classification models; the random forest (WEKA-RF), k-nearest neighbors and associative neural networks to select potent inhibitors against C. albicans. In cross-validation, the models achieved a corresponding predictive rate of 81-90%. The experimental results confirmed the predictive power of the models with the selection of the compound 1,3-oxazol-4-yl (triphenyl) phosphonium. The same predictive ability was also observed in the study by Hodyna et al., 27 where they used models identical to the previous study. The results of the   Cho et al. 28 constructed six models using energy relationship descriptors against E. coli, S. aureus and C. albicans using the MIC and minimum bactericidal concentration (MBC) values for each species. The predictability of the models was estimated by obtaining R 2 = 0.90 and 0.93 (R 2 = determination coefficient) for MIC and MBC of E. coli, respectively, R 2 = 0.91 and 0.94 for MIC and MBC of S. aureus, R 2 = 0.89 and 0.80 for C. albicans. According to the authors, 28 the QSAR models will support a reliable, economical, fast and safe evaluation as a supplementary method of experimental testing.
Although the number of studies with the use of QSAR classificatory models is increasing, further studies are needed with the application of these methodologies that can identify potential molecules and assist experimental tests.
The flavonoid amentoflavone had a probability of being active for all models, despite not being represented in the table for the C. albicans model, it had a probability of an active potential of 0.65 for this model.

Conclusions
Through the in silico tools used in this work, it was possible to generate a model bank to virtually track isolated compounds from the Velloziaceae family with probable antimicrobial potential. The models for C. albicans and E. coli were the ones that presented compounds with the highest probability of activity.
For C. albicans, the model selected thirty-one molecules with a potential activity greater than 60%, twenty-nine molecules with a probability greater than 50% for E. coli, eleven molecules with a probability greater than 52% for P. aeruginosa and nineteen molecules with a probability of 50% for Salmonella sp.
Biflavonoid amentoflavone was the only compound to be likely to be active for all four models with a considerable percentage, with a potential probability of 65, 74, 56 and 70% for C. albicans, E. coli, P. aeruginosa and Salmonella sp., respectively.
The present study contributed, through the virtual screening of a bank of secondary metabolites, to select several proposed compounds with potential antimicrobial activity, especially biflavonoid amentoflavone and, in the future, assist biological testing in discovering potential drug candidates.