A Novel Approach for Computer-Aided “ Rational ” Drug Design : Theoretical and Experimental Assessment of a Promising Method for Virtual Screening and in silico Design of New Antimalarial Compounds

Malaria is one of the most significant public health concerns in many tropical and subtropical regions of the world, with 40% of the world population exposed to malariacausing parasites. Increasing resistance of Plasmodium spp. to existing therapies has heightened alarms about malaria in the international health community. Nowadays there is a pressing need to identify and develop new drug-based antimalarial therapies. In an effort to overcome this problem, the main aim of this study was to develop simple linear discriminant-based QSAR models for the classification and prediction of antimalarial activity using some of the TOMOCOMD-CARDD fingerprints, so as to enable computational screening from virtual combinatorial datasets. In this sense a database of 1562 organic-chemicals having great structural variability; 597 of them antimalarial agents and 965 compounds having other clinical uses, was analyzed and presented as a helpful tool not only for theoretical chemist but also for other researchers in this area. These series of compounds were processed by a k-means cluster analysis in order to design training and predicting sets. Afterward, two linear classification functions were derived toward discrimination between antimalarial and non-antimalarial compounds. The models (including non-stochastic and stochastic indices) classify correctly more than 93% of compounds in both training and external prediction datasets. They showed high Matthews ́ correlation coefficients; 0.889 and 0.866 for training and 0.855 and 0.857 for test set. Models predictivity were also assessed and validated by the random removal of 10% of the compounds to form a test set, for which predictions were made from the models. The overall mean of the correct classification for this process (leave-group 10% full-out cross-validation) for obtained equations with non-stochastic and stochastic quadratic fingerprints were 93.93% and 92.77%, correspondingly. The quadratic mapsbased TOMOCOMD-CARDD approach implemented in this work was successfully compared with four of the most useful models for antimalarials selection reported to date. The models developed with non-stochastic and stochastic quadratic indices were then used in a simulation of a virtual search for Ras FTase inhibitors with antimalarial activity;

70% and 100% of the 10 inhibitors used in this virtual search were correctly classified, showing the ability of the models to identify new lead antimalarials.Finally, these two QSAR models were used in the identification of previously un-known antimalarials compounds.In this sense, three synthetic intermediaries of quinolinic compounds were evaluated as active/inactive ones using the developed models.The synthesis and biological evaluation of these chemicals against two Malaria strains, using Chloroquine as reference, was performed.An accuracy of 100% with the theoretical predictions was observed.The compound 3 shown antimalarial activity, being the first report of an arylaminomethylenemalonate having such activity.This result opens a door to a virtual study considering a higher variability of the central core already evaluated, as well as other chemicals not included in this family.We conclude that the approach described here seems to be a promising QSAR tool for molecular discovery of novel classes of antimalarial drugs which may meet the dual challenges posed by drug-resistant parasites and the rapid progression of malaria illness.

BACKGROUND
Malaria remains one of the most serious health threats in the world, affecting 300-400 million people and claiming ca. 3 million lives each year. 1,24][5] Knowing the complexity and cost of the process of drug discovery, the use of "rational" search methodologies is recommended.Consequently, medicinal chemists are called to developing more efficient strategies for the search of novel candidates to be assayed as antimalarial drugs.7][8][9] One of the major goals of such design strategy is the identification from large databases or libraries, of structural subsystems responsible for a specific biological activity.][12][13][14] In this context, our research group has recently introduced a novel scheme to perform rational -in silico-molecular designs (or selection/identification of lead drug-like chemicals) and QSAR/QSPR studies, known as TOMOCOMD-CARDD (acronym of TOpological MOlecular COMputer Design-Computer Aided "Rational" Drug Design). 15his method has been developed to generate molecular fingerprints based on the application of the discrete mathematics and linear algebra theory to chemistry.In this sense, atom, atom-type and total quadratic and linear molecular fingerprints have been defined in analogy to the quadratic and linear mathematical maps. 16,177][18][19] In addition, TOMOCOMD-CARDD has been extended to consider three-dimensional features of small/medium-sized molecules based on the trigonometric 3D-chirality correction factor approach. 20 later paper allowed the description of the significance-interpretation and the comparison to other molecular descriptors. 17,18The approach describes changes in the electronic distribution with the time throughout the molecular backbone.Specifically, the features of the k th total and local quadratic and linear indices were illustrated by examples of various types of molecular structures, including chain length and branching as well as content of heteroatoms, and multiple bonds. 17,18Additionally, the linear independence of the atom-type quadratic and linear fingerprints to other 229 0D-3D "DRAGON" molecular descriptors was demonstrated.In this sense was concluded that local TOMOCOMD-CARDD fingerprints are independent indices which contain important structural information to be used in QSPR/QSAR and drug design studies. 17,18he prediction of the pharmacokinetical properties of organic compounds is a problem that can also be addressed using this approach.2][23] The obtained results suggested that the TOMOCOMD-CARDD method was able of predicting the permeability values and it proved to be a good tool for studying the oral absorption of drug candidates during the drug development process.The TOMOCOMD-CARDD strategy has also been useful for the selection of novel subsystems of compounds having a desired property/activity.In this sense, it was successfully applied to the virtual (computational) screening of novel anthelmintic compounds, which were then synthesized and in vivo evaluated on F. Hepatica. 24,25][28][29] Later, promising results have been found in the modeling of the interaction between drugs and HIV Ψ-RNA packaging-region in the field of bioinformatics using the TOMOCOMD-CANAR (Computed-Aided Nucleic Acid Research) approach. 30,31Finally, an alternative formulation of our approach for structural characterization of proteins was carried out recently. 32,33This extended method [TOMOCOMD-CAMPS (Computed-Aided Modelling in Protein Science)] was used to encompass protein stability studiesspecifically how alanine substitution mutation on Arc repressor wild-type protein affects protein stability-by means of a combination of protein linear or quadratic indices (macromolecular fingerprints) and statistical (linear and non-linear model) methods. 32,33n the present work, TOMOCOMD-CARDD strategy is used to find quantitative models which allow the discrimination of antimalarial compounds from inactive ones in a rational way using non-stochastic and stochastic quadratic indices.A virtual screening for the search of new leads compounds with a novel action mechanism is performed for the case of Ras FTPase inhibitors with antimalarial activity.Finally, we present the design, synthesis and in vitro evaluation against two Plasmodium falciparum strains of synthetic intermediates of quinolinic compounds, as starting point for the development of new nonexpensive antimalarials.

THEORETICAL FRAMEWORK
The theoretical scaffold of the TOMOCOMD-CARDD's molecular descriptors family was split into two parts; one for describing the mathematical features of non-stochastic fingerprints and the other one related with the stochastic quadratic indices.

Non-Stochastic Quadratic Fingerprints
Implemented in the subprogram CARDD of the TOMOCOMD software, the atom, atomtype and total non-stochastic quadratic fingerprints can be calculated from both, molecular pseudograph's atom adjacency matrix and molecular vector of small-tomedium-sized organic compounds.[23]25,26 However; an overview of this approach will be given.For a given molecule composed of n atoms, the "molecular vector" (X) is constructed and the k th total quadratic indices, q k (x) are calculated as quadratic forms as shown in Eq. 1, where, n is the number of atoms in the molecule and x 1 ,…,x n are the coordinates or components of the "molecular vector" (X) in a system of canonical basis vectors of ℜ n .The components of the molecular vector are numerical values, which can be considered as weights (atom-labels) for the vertices of the pseudograph.Certain atomic properties (electronegativity, atomic radii, etc) can be used with this propose.In this work, the Pauling electronegativities are selected as atom weights. 34he coefficients k a ij are the elements of the k th power of the symmetrical square matrix M(G) of the molecular pseudograph (G), and are defined as follows: where E(G) represents the set of edges of G. P ij is the number of edges (bonds) between vertices (atoms) v i and v j , and L ii is the number of loops in v i .Equation (1) for q k (x) can be written as the single matrix equation: where X is a column vector (a nx1 matrix), X t the transpose of X (a 1xn matrix) and M k the k th power of the matrix M of the molecular pseudograph G (mathematical quadratic form's matrix).In addition to total quadratic indices, computed for the whole-molecule, a local-fragment (atom and atom-type) formalisms can be developed.[23]25,26 The definition of these descriptors is as follows: where, m is the number of atoms of the fragment of interest and k a ijL is the element of the row "i" and column "j" of the matrix M k L .This matrix is extracted from the M k matrix and contains information referred to the vertices (atoms) of the specific molecular fragments and also of the molecular environment.The matrix M k L = [ k a ijL ] with elements k a ijL is defined as follows: k a ijL = k a ij if both v i and v j are atoms contained within the molecular fragment (5) = 1 / 2 k a ij if v i or v j is an atom contained within the molecular fragment but not both = 0 otherwise These local analogues can also be expressed in matrix form by the expression: Notice that the above scheme follows the spirit of a Mulliken population analysis. 35Also note that for every partitioning of a molecule into Z molecular fragment there will be Z local molecular fragment matrices.In this case, if a molecule is partitioned into Z molecular fragments, the matrix M k can be partitioned into Z local matrices M k L , L = 1,... Z, and the k th power of matrix M is exactly the sum of the k th power of the local Z matrices.In this way, the total quadratic indices are the sum of the quadratic indices of the Z molecular fragments: Atom and atom-type quadratic fingerprints are specific cases of local quadratic indices.In this sense, the k th atom-type quadratic indices are calculated by adding the k th atom quadratic indices for all atoms of the same type in the molecule.In the atom-type quadratic indices formalism, each atom in the molecule is classified into an atom-type (fragment), such as heteroatoms, hydrogen bonding (H-bonding) to heteroatoms (O, N and S), halogen atoms, aliphatic carbon chain, aromatic atoms (aromatic rings), an so on.For all data sets, including those with a common molecular scaffold as well as those with diverse structure, the k th atom-type quadratic indices provide important information.

Atom, Atom-type, and Total Stochastic Quadratic Fingerprints
Notice that the mathematical quadratic form's matrices, M k , are graph-theoretical electronic-structure models, like the "extended Hückel" model.The M 1 matrix considers all valence-bond electrons (σ -and π -networks) in one step, and their power k (k = 0, 1, 2, 3…) can be considered as an interacting-electronic chemical-network model in steps k.This model can be seen as an intermediate one between the quantitative quantummechanical Schrödinger equation and classical chemical bonding ideas. 38ecently, our research group has also developed a new method based on the Markov chain theory, which has been successfully employed in QSPR and QSAR studies. 13,37,39his approach also describes changes in the electron (stochastic) distribution and vibrational decay with time throughout the molecular backbone using Markov chain formalism.The present approach is based on a simple model for the intramolecular (stochastic) movement of all valence-bond electrons.Let us consider a hypothetical situation in which a set of atoms is free in space at an arbitrary initial time (t 0 ).In this time, the electrons are distributed around atomic nuclei.Alternatively, these electrons can be distributed around cores in discrete intervals of time t k .In this sense, the electron at an arbitrary atom i can move to other atoms at different discrete time periods t k (k = 0, 1, 2, 3…) throughout the chemical-bonding network.The k th stochastic molecular pseudograph's atom adjacency matrix [S k (G)] can be obtained from M k .Here, S k (G) = S k = [ k s ij ] is a squared table of order n (n = number of atoms), and the elements k s ij are defined as follows: where k a ij are the elements of the k th power of M, and the SUM of the ith row of M k are named the k-order vertex degree of atom i, .The k i k δ th s ij elements are the transition probabilities with which the electrons moving from atom i to j in the discrete time period t k (step-by-step).Notice that the k th elements s ij takes into consideration the information of the molecular topology in step k throughout the chemical-bonding (σ -and π -) network.
For instance, the 2 s ij values can distinguish between hybrid states of atoms in bonds.In this sense, it can clearly be seen from .This is a logical result as the electronegativity scale of these hybrid states is taken into account.The k th total and local stochastic quadratic indices, s q k (x) are calculated in the same way that the non-stochastic quadratic indices, but using the k th stochastic molecular pseudograph's atom adjacency matrix, S k (G), as mathematical quadratic forms' matrices.

Computational Methods: TOMOCOMD-CARDD Approach
TOMOCOMD is an interactive program for molecular design and bioinformatic research. 15It consists of four subprograms: (CARDD:Computed-Aided 'Rational' Drug Design, CAMPS:Computed-Aided Modeling in Protein Science, CANAR:Computed-Aided Nucleic Acid Research and CABPD:Computed-Aided Bio-Polymers Docking).Each one of them allows drawing the structures (drawing mode) and calculating molecular 2D/3D (calculation mode) atom-and bond-based descriptors.In the present report, we outline salient features concerned with only the subprogram CARDD.The main steps for the application of this method in QSAR/QSPR and drug design can be briefly summarized as follows: 1. Drowning the molecular pseudographs for each molecule of the data set, using the drawing mode.This procedure is performed by a selection of the active atomic symbol belonging to the different groups in the periodic table of the elements, 2. Use of appropriate weights in order to differentiate the molecular atoms, 3. Compute the total and local (atom and atom-type) quadratic indices of the molecular pseudograph's atom adjacency matrix.They can be carried out in the software calculation mode, where you can select the atomic properties and the family descriptor previously to calculate the molecular indices.This software generates a table in which the rows correspond to the compounds, and columns correspond to the total and local quadratic indices or other family of molecular descriptors implemented in this program, 4. Development of a QSPR/QSAR equation by using several multivariate analytical techniques, such as multilinear regression analysis (MRA), neural networks (NN), linear discrimination analysis (LDA), and so on.In this sense it is possible to find a quantitative relation between an activity A and the quadratic fingerprints having, for instance, the following appearance A = a 0 q 0 (x) + a 1 q 1 (x) + a 2 q 2 (x) +….+ a k q k (x) + c (10) where A is the measured activity, q k (x) are the k th total quadratic indices, and the a k 's are the coefficients obtained by the linear regression analysis.5. Test of the robustness and predictive power of the QSPR/QSAR equation by using internal (leave-one-out and leave-group-out cross-validation) and external (using a test set and an external predicting set) validation techniques.
The following descriptors were calculated in this work: i) q k (x) and q k H (x) are the k th total quadratic indices not considering and considering Hatoms in the molecular pseudograph (G), respectively.ii) q kL (x E ) and q kL H (x E ) are the k th local (atom-type = heteroatoms: S, N, O) quadratic indices not considering and considering H-atoms in the molecular pseudograph (G), correspondingly.These local descriptors are putative H-bonding acceptors.iii) q kL H (x E-H ) are the k th local (atom-type = H-atoms bonding to heteroatoms: S, N, O) quadratic indices considering H-atoms in the molecular pseudograph (G).These local descriptors are putative H-bonding donors.The k th stochastic total [ s q k (x) and s q k H (x)] and local [ s q k (x E ), s q k H (x E ) and s q k H (x E-H )] quadratic indices were also computed.

Data Set
It is well known, that the quality of the classification models is highly dependent on the quality of the selected data set.The most critical aspect for constructing the training set is to warrant a great molecular diversity on it.Taking that into account, we selected a large data set of 1562 organic-chemicals having great structural variability; 597 of them are antimalarial agents 2,7-9, 40-82 and the other ones are non-antimalarials 41,82 (965 compounds having other clinical uses, such as antivirals, sedative/hypnotics, diuretics, anticonvulsivants, hemostatics, oral hypoglycemics, antihypertensives, antihelminthics, anticancer compounds and so on.It is clear that the declaration of these compounds as "inactive" antimalarial per se does not guarantee antimalarial side-effects for some of these organic-chemical drugs that have been left undetected so far.This problem can be reflected in the results of classification for the series of inactive chemicals.On the other hand, the data set of active compounds was selected by considering representatives of most of the different structural patterns and action modes for the case of the antimalarial activity.For instance, it includes: 1) alkaloidal and synthetic quinoline-based antimalarial drugs which involve the blockage of the function of the food vacuole (4-and 8-aminoquinolines, 9,70 peptide derivatives, 52 dimeric quinolines 47,49 and other compounds such as indolo[3,2-c]quinolines 7 and methylene blue derivatives), 2) peptide (fluoromethyl ketone peptide derivatives) and nonpeptide (phenothiazines and chalcones) falcipain-cysteine protease inhibitors, 42,45 3) peptide and nonpeptide inhibitors of malarials aspartyl protease plasmepsin II, 48 4) agents interfering with Plasmodium Falciparum phospholipids metabolism (primary, secondary, tertiary amines and quaternary ammonium and bisammonium salts), 69 5) antimalarials which have ability to inhibiting electron transport processes and respiratory systems by acting as ubiquinone antagonists (hydroxynaphthoquinones such as atovaquone), 40 6) selective inhibitors of lactate dehydrogenase from malaria parasite (some derivatives of the sesquiterpene 8deoyhemigossylic acid), 40 7) antimalarial chemicals which act by selectively inhibiting malarial dihydrofolate reductase-thymidylate synthase (pyrimethamine and it is analogs), 8 8) antiparasitic agents affecting DNA topoisomerases (e.g., anticancer acridines) 53 and 9) artemisinin-type antimalarials and other simple¸bicyclic and tetraciclic endoperoxides (incluiding lactone ring-open analogs the trioxane). 2-5, 44, 51, 54-65, 66-68, 72hese antimalarials endoperoxides appears to have a two-step mode of action.In the first step, the 'artesmisinin' compounds are activated by heme or molecular iron to produce free radicals and electrophilic (alkylating) intermediates.In the second step, these reactive species react with and damage specific malarial membrane-associated proteins.Other compounds for which have not been found or defined a specific mode of action, but have been reported as antimalarial agents were also included. 41,50,82 igure 1 shows a representative sample of such active compounds.Later, two k-means cluster analyses (k-MCA) were performed for active and inactive series of compounds, which permitted us to split the dataset (1562 organic-chemicals) into training and predicting series. 83,84That is, all cases were processed using k-MCA in order to design training and predicting data series in a "rational" way.The main idea consists in carrying out a partition of either active or inactive series of chemicals in several statistically representative classes of compounds.Thence, one may select from the members of all these classes of training and predicting series.This procedure ensures that any chemical class (as determined by the clusters derived from k-MCA) will be represented in both compounds' series.

Chemometric Methods k-means cluster analysis (k-MCA).
The statistical software package STATISTICA was used to develop the k-MCA. 85The number of members in each cluster and the standard deviation of the variables in the cluster (kept as low as possible) were taking into account, to have an acceptable statistical quality of data partition in clusters.We also made an inspection of the standard deviation (SS) between and within clusters, of the respective Fisher ratio and their p-level of significance, which was considered to be lower than 0.05. 83,84near Discriminant Analysis.In spite of several chemometric techniques to find good discriminant functions exist, such as SIMCA or neural networks, we select the linear discriminant analysis (LDA) in order to generate the classifier function on the basis of the simplicity of the method.The use of this statistical analysis will permit to classify new compounds as active or inactive ones from molecular descriptors.LDA was carried out with the STATISTICA software. 85The considered tolerance parameter (proportion of variance that is unique to the respective variable) was the default value for minimum acceptable tolerance, which is 0.01.Forward stepwise was fixed as the strategy for variable selection.The principle of parsimony (Occam's razor) was taken into account as strategy for model selection.In connection, we selected the model with a high statistical signification but having as few parameters (a k ) as possible and maximizes the degrees of freedom.In the equation 10, a k are the coefficients of the classification function, determined by the least square method as implemented in LDA modulus of STATISTICA. 85he quality of the models were determined by examining Wilks' λ parameter (Ustatistic), square Mahalanobis distance (D 2 ), Fisher ratio (F) and the corresponding plevel (p(F)) as well as the percentage of good classification in the training and test sets.Models with a proportion between the number of cases and variables in the equation lower than 5 were rejected.The Wilks' λ statistics is helpful to evaluating the total discrimination, and can take values between zero (perfect discrimination) and one (no discrimination).The D 2 indicates the separation of the respective groups.The biological activity (antibacterial in this case) was codified by a dummy variable "Class".This variable indicates the presence of either an active compound (Class = 1) or an inactive compound (Class = -1).The classification of cases was performed by means of the posterior classification probabilities.This is the probability to which the respective case belongs to a particular group (active or inactive) and it is proportional to the Mahalanobis distance.On completion, the posterior probability is the probability, based on our knowledge of the values of others variables, to which the respective case belongs to a particular group.By using the models, one compound can then be classified as active, if ∆P% > 0, being ∆P% = [P(Active) -P(Inactive)]x100 or as inactive otherwise.P(Active) and P(Inactive) are the probabilities with which the equations classify a compound as active and inactive, respectively.On the other hand, validation is a crucial aspect of any QSAR/QSPR modeling. 86,87One of the most popular validation criteria is the leave-one-out (LOO) cross-validation method (internal validation).This method systematically removes one data point at a time from the data set.A QSAR/QSPR model is then constructed based on this reduced data set and subsequently used to predict the removed data point.This procedure is repeated until a complete predictions set is obtained.Good results in this experiment can be considered as a proof of the high predictive ability of the models.However, this assumption is generally incorrect and it can be that it exists lack of correlation between the good LOO results and the high predictive ability of QSAR/QSPR models. 86,87Thus, the good behavior of models in an LOO procedure appears to be the necessary but not the sufficient condition for the models, to have a high predictive power.In this sense, Golbraikh and Tropsha 87 emphasized that the predictive ability of a QSAR/QSPR model can be estimated by using only a test set (external validation) of compounds that were not used for building the model.For this reason, in order to assess the predictability of the obtained model, external validation procedures were carried out.In this sense, the statistical robustness and predictive power of the obtained model was assessed using a prediction (test) set.In the present work leave-group-out (LGO) cross-validation strategy was carried out. 86In this case, 10% of the data set was used as group size, i.e. groups including 10% of the training data set are left out and predicted for the model based on the remaining 90%.This process was carried out 10 times on 10 unique subsets.In this way, every observation was predicted once (in its group of left-out observations).The overall mean for this process (10% full leave-out cross-validation) was used as a good indication of robustness and stability of the obtained models.Finally, the calculation of percentages of global good classification (accuracy), sensibility, specificity (also known as 'hit rate'), false positive rate (also known as 'false alarm rate') and Matthews correlation coefficient (MCC) in the training and test sets permits carrying out the assessment of the model. 88While the sensitivity is the probability of correctly predicting a positive example, the specificity is the probability that a positive prediction is correct.On the other hand, MCC quantifies the strength of the linear relation between the molecular descriptors and the classifications, and it may often provide a much more balanced evaluation of the prediction than, for instance, the percentages. 88thogonalization of Descriptors.0][91][92][93][94][95] This process is an approach in which molecular descriptors are transformed in such a way that they do not mutually correlate.The main philosophy of this approach is to avoid the exclusion of descriptors on the basis of its collinearity with other variables previously included in the model.Both, the non-orthogonal descriptors and derived orthogonal descriptors, contain the same information.0][91][92][93][94][95] It is known that the interrelatedness among the different descriptors can result in highly unstable regression coefficients, which makes it impossible to knowing the relative importance of an index and underestimates the utility of the regression coefficients in a model.However, in some cases strongly interrelated descriptors can enhance the quality of a model because the small fraction of a descriptor which is not reproduced by its strongly interrelated pair can provide positive contributions to the modeling.On the other hand, the coefficient of the QSAR model based on orthogonal descriptors are stable to the inclusion of novel descriptors, which permits to interpret the regression coefficients and evaluated the role of individual fingerprints to the QSAR model.0][91][92][93][94][95] Thus, we will give only a general overview here.The first step in orthogonalizing the molecular descriptors included in models is to select the appropriate order of orthogonalization, which in this case is the order in which the variables were selected in the forward stepwise search procedure of the statistical analysis. 95The first variable (V 1 ) is taken as the first orthogonal descriptors 1 O(V 1 ), and the second one (V 2 ) is orthogonalized with respect to it [ 2 O(V 2 )] by taking the residual of its correlation with 1 O(V 1 ), which is that part of the descriptors V 2 not reproduced by 1 O(V 1 ).Similarly, from the regression of V 3 versus 1 O(V 1 ), the residual is the part of V 3 that is not reproduced by 1 O(V 1 ) and it is labeled 1 O(V 3 ).The orthogonal descriptor 3 O(V 3 ) is obtained by repeating this process in order to also make it orthogonal to 2 O(V 2 ).The process is repeated until all variables are completely orthogonalized, and the orthogonal variables are then used to obtain the new model.

Chemistry
IR spectra were recorded with a FTIR-BOMEM spectrometer using KBr disks for solid or NaCl cell for liquids (υ in cm -1 ). 1 H NMR and 13 C NMR spectra were recorded on a Bruker ADPX-300 (300 mHz) using CDCl 3 as solvent.The calibration of spectra was carried out on TMS (internal 1 H) and CDCl 3 ( 13 C) signals δ 1 H (TMS) = 0; δ 13 C (CDCl 3 ) = 77.0.Chloroquine diphosphate was supplied by "Fundação para o Remédio Popular" (Brazil).All solvent were previously dried and purified before use, according to standards established in the literature. 96,97

Determination of in vitro Antiplasmodial Activity
In vitro antiplasmodial evaluation was performed by using the susceptibility microtechnique. 98Two strains of Plasmodium falciparum, K1-chloroquine resistant, and Palo Alto-chloroquine sensitive, kindly provided by the WHO Registry of Standard Strains of Malaria Parasites at the University of Edinburgh, were continuously maintained in culture and used in these assays. 99The parasites freezing and thawing procedures were based on that described. 100The parasites were cultivated to 5% hematocrit in RPMI 1640 medium with 25 mM HEPES, 21 mM sodium bicarbonate, 370 µM hypoxantine, 40 µg/ml gentamycin, and 10% human A + or O + serum provided by Fundação Pró-Sangue/Hemocentro de São Paulo.Washed human O + erythrocytes were added to the culture as necessary.Synchronization was obtained by treatment with Dsorbitol when the parasites were predominantly in the young trophozoite stage. 101Stock solutions of the compounds (1 000 pmol/100 µL of ethanol) were used to prepare different concentrations (1, 2, 4, 6, 8, 16, 32 and 100 pmol/well) in aqueous solution.A stock solution of chloroquine diphosphate (1 000 pmol/100 µL in water) was used to prepare a series of concentrations (1, 2, 4, 6, 8, 16 and 32 pmol/well) to check the sensitivity of the isolates.Flat bottomed microtitre plates were dosed adding 100 µl of each concentration/well.The plates were dried at 37 o C and stored at 4 o C.An aliquot of 100 µl of culture with a parasitemia between 0.5-1.0%and parasites in young trophozoite stage was added to each well of the microtitre plates.A control without compound and a sensitivity test to chloroquine were performed in parallel.Microplates were incubated in a candle jar with a gas mixture of 3% CO 2 , 5% O 2 , 92% N 2 , and maintained at 37 o C for 24-36 h.Giemsa-stained thick blood smears were prepared from each well when controls showed presence of schizonts by optical microscopy.The number of schizonts was counted per 200 asexual parasites and the tests were considered valid when this number was equal or superior to 10%.The minimum inhibitory concentration (MIC) of each compound was defined by the lowest concentration that completely inhibited the schizont maturation.

Training and test sets design through k-means cluster analysis
The first step in this study was the design of the training and predicting series to prevent non-random distribution of chemicals between the two sets.This was achieved using k-MCA. 83,84This "rational" design of training and predicting series allowed us to design both sets that are representative of the entire "experimental universe".We carried out first a k-MCA with active compounds and afterwards with inactive ones.A first k-MCA (I) split antimalarials in 20  Then, selection of the training and prediction sets was performed by taking, in a random way, compounds belonging to each cluster.From these 1562 compounds, 1120 were chosen at random to forming the training set, being 437 of them actives and 683 inactive ones.The great structural variability of the selected training data set makes it possible, not only the discovery of lead compounds with determined mechanisms of antimalarial activity, but also with novel modes of action.It will be well-illustrated in this paper in a virtual experiment for lead generation.The remaining subseries composed of 160 antimalarials and 282 compounds with different biological properties were prepared as test sets for the external cross-validation of the models.These compounds were never used in the development of the classification models.Figure 2 graphically illustrates the above-described procedure where two independent cluster analyses (one for active and the other for inactive chemiclas) were performed, to select a representative sample for the training and test sets.The kth total and atom-type non-stochastic quadratic indices were used, with all variables showing p-levels of <0.05 for the Fisher test.From the k-MCA, it can be concluded that the structural diversity of several up-to-date known antimalarials (as codified by TOMOCOMD-CARDD descriptors) may be described at least by 20 statistically homogeneous clusters of chemicals.

Developing Classification Functions
1][12][13][14] Being the key of the present study, we developed two classification functions using topological descriptors computed with the TOMOCOMD-CARDD software. 15These linear models are given below together with statistical parameters: Class = -10.059-0.08844q 0 (x) +0.07085q 1 (x) +0.18907q 0  The percentage of false actives in this data set was only 3.66%, i.e. 25 inactive compounds were classified as actives from 683 cases.Conversely, 34 compounds from the group of 437 actives were misclassified as inactive ones (7.78% of misclassification).The statistical analysis of model 12 showed similar results.In this case, the overall accuracy of the model was 93.13%.Only 4.98 % of misclassification for the inactive group was observed (34 inactive compounds were classified as active ones from a total of 683).In this case 43 compounds from 437 (9.84%) were false inactives.The classification of all compounds in the complete training dataset provides some assessment of the goodness of fit of the models, but it does not provide a thorough criterion of how the models can predict the biological properties of new compounds.To assess such predictive power, the use of an external test set is essential.In this sense, the activity of the compounds in such set was predicted with the two obtained discrimination functions.The overall accuracy for this group was 93 It can be seen that the number of misclassified inactive compounds is relative low for both models.This is a desirable condition to consider a model as adequate, taking into account that this number represents inactive compounds that will be sent to biological assays and in this way, loss of time and resources. 12he results of global classification of compounds, in both training and external prediction sets, are shown in Table 1.This table also lists most parameters commonly parameters used in medical statistics (accuracy, sensitivity, specificity and false positive rate) and the Matthews correlation coefficient (MCC) for both obtained models. 88These models, Eqs.A second experiment, considering a leave-group-out (LGO) strategy, was carried out for both models as internal validation procedure. 86The overall mean of the correct classification for this process for Eq.11 and Eq. 12 were 93.93% and 92.77%, respectively.For a 10% full leave-out cross-validation procedure, this level of crossvalidated classification is a good indication of robustness and stability of the obtained models.The results of the LGO procedure are shown in Table 2.In summary, the calculation of percentages of good classification in the training and external data sets, and an internal cross-validation procedure permitted us to carry out the assessment of the models.A close inspection of the molecular descriptors included in both LDA-based QSAR models showed that several of these fingerprints are strongly interrelated to each other.
In Table 3 we resume the results of the orthogonalization of molecular descriptors included in both models.In this case, the equations 11a and 12a correspond to the final models with the orthogonalized molecular indices (see Table 8).Here, we used the symbols m O(q k (x)), where the superscript m expresses the order of importance of the variable (q k (x)) after a preliminary forward stepwise analysis and O means orthogonal.
Orthogonal atom, atom-type and total non-stochastic quadratic indices model derived with orthogonal atom, atom-type and total non-stochastic quadratic indices Orthogonal atom, atom-type and total stochastic quadratic indices model derived with orthogonal atom, atom-type and total stochastic quadratic indices It must be highlighted here that the orthogonal descriptor-based models coincides with the collinear (i.e.ordinary) TOMOCOMD-CARDD descriptors-based models in all statistical parameter.That is to say, the statistical coefficients of LDA-QSARs λ, F, MCC, accuracy, %(+) [good classifications in the active group] and %(-) [good classifications in the inactive group] are the same whether we use a set of non-orthogonal descriptors or the corresponding set of orthogonal indices.0][91] Only the D 2 values were different in both equation sets.This is because before carrying out the orthogonalization process, all the variables were standardized.In standardization, all values of selected variables (molecular descriptors) were replaced by standardized values, which are computed as follows: Std.score = (raw score -mean)/Std.deviation.LDA algorithms at one point need to assess the distances between group's centroids (or between cases and centroids), and obviously, when computing D 2 distances, LDA need to decide on a scale.Because the different molecular fingerprints included here used entirely "different types of scales", the data were standardized so that each variable has a mean 0 and a standard deviation of 1.This fact also makes interpretation of the coefficients, in the LDA-QSAR equations, possible.Therefore, m O(q k (x)) may be classified according to the distance k into short-(0-5), mid-(6-10), and long-range non-stochastic and stochastic quadratic indices.The information in Table 8 clearly shows that the major contribution to antimalarial activity is providing by short-range TOMOCOMD-CARDD descriptors.

Comparative Analysis of the Obtained Structure-Based Classification Models for Describing the Antimalarial Activity of a Heterogeneous Series of Compounds
In a previous paper, some of the present authors reported two classification models of antimalarial activity using the same training data set, but including non-stochastic and stochastic linear indices. 27With the aim to evaluate comparatively the ability of the nonstochastic and stochastic quadratic indices to encode chemical information and the quality of the obtained LDA-based classification models, we performed an examination of some statistical parameters.Table 4 summarizes the mains results achieved with both TOMOCOMD-CARDD descriptors (based on both quadratic and linear maps).broader range broader range broader range broader range low range low range a Equations 11 and 12 are reported in this work and models 13 and 14 were obtained previously by the present authors using non-stochastic and stochastic linear indices. 27Equations 15 and 16 were reported by Gozalbez et al. 6 for two different studies: Eq. 15 was performed for the classification of antimalarial drugs and non-antiprotozoan drugs and, Eq. 16 for the discrimination between antimalarials and antiprotozoan drugs without antimalarial activity.b LDA refers to Linear discriminant analysis.c Matthews correlation coefficient.d Only largely represented families were considered.
Making use of the models obtained here (Eqs.11 and 12) which includes non-stochastic and non-stochastic quadratic indices, 94.73% and 93.13% of compounds in the training dataset were correctly classified.As can be observed in Table 9, the models 13 and 14, obtained considering non-stochastic and stochastic linear indices, 27 shows lower values for such parameters (accuracy of 94.02% (93.42%) and 91.52% (90.50%) in training (test) set, correspondingly.Also the models reported in this work shown a higher MCC than models obtained in our previous study.As can be seen, models develop with quadratic maps-based TOMOCOMD-CARDD descriptors (Eqs.11 and 12) shows better parameters in all cases that models development with linear maps-based TOMOCOMD-CARDD indices (Eqs.13 and 14; see also equations 10 and 11 in reference 27).In this sense, we can conclude, that with the use of quadratic indices it is possible to codify useful chemical information and to obtain classification models comparable or even better than those obtained using analogous descriptors already reported.On the other hand, in the last decade other two -in silico-method have also been used to develop two structure-based classification models (Eqs.15 and 16 in Table 9) of antimalarial activity, which give rise to a good discrimination of this activity in large and heterogeneous series of organic compounds. 6We also pretend to compare both approaches in order of showing the potentialities of our method.In this case, due to differences in the composition of experimental data used in carrying out the QSAR, it is not feasible to perform a "strict" comparison between the method reported previously 6 and the current approach.However, a relative comparison could be based on the kind of method used for deriving the QSAR and their statistical parameters, the number and diversity of chemical structural patterns contained in the data, the overall accuracy (%), Matthews correlation coefficient and the method which was used for the validation of the models.Table 9 also shows these chemometric coefficients for all approaches.The global good classification in the training set of quadratic maps-based TOMOCOMD-CARDD models was higher than the two reported LDA equations (see Table 9).It is remarkable that the TOMOCOMD-CARDD models were derived from training series 27.3(1120/41), and 24.8(1120/45) times bigger than the series used by Gozalbes et.al. 6 In this sense, the overall accuracy in test sets of quadratic maps-based TOMOCOMD-CARDD models was higher than the rest of two reported LDA equations (see Table 9).Another remarkable aspect is refereed to the spectrum of structural patterns considered in the studies under comparison.Without doubts, for the development of the TOMOCOMD-CARDD models reported here, a broader diversity of antimalarial was considered.

Virtual Screening of Ras FTase Inhibitors: An Experiment of Lead Generation
One of the most important aspects of any quantitative structure-activity relationship model is its ability to predict the desired activity for new compounds not included in the training data set.3][104] With the aim of testing the ability of our models to detecting new lead compounds with "unknown" structures, we carried out a simulated virtual screening of inhibitors of Farnesyltransferase (FTAse) that showed potent antimalarial activity in cell assays. 105No one compound with this kind of structure was included in the training data set, and in this sense this evaluation is equivalent to the discovery of new lead compounds using the developed models.In this simulation, 10 previously reported FTase inhibitors with potent antimalarial activity were evaluated with models 11 and 12 as active/inactive ones.The results of the classification are shown in Table 5 and the molecular structures are illustrated in Scheme 1.As can be seen, both models classify correctly most of the 10 selected compounds.In the first case only 3 FTase inhibitors were classified as false inactives (70% of correct classification), while with model 12 the prediction has an overall accuracy of 100%.This result is in accordance with the character of the TOMOCOMD-CARDD approach, which permits to consider implicitly, through the calculation of non-stochastic and stochastic quadratic molecular descriptors, substructural and global features responsible for a specific activity.In this way, new lead compounds could be designed using the TOMOCOMD-CARDD method described in this paper.

Experimental Results: Discovery of Novel Quinolinic Intermediaries as Antimalarial Compounds
The aim of the present work is the development of discriminant functions for the rational design (or selection/identification) of new antimalarial compounds.As shown, we explored the ability of our classification models to find new active compounds carrying out an experiment of lead generation for the case of Ras FTPase inhibitors.These results encouraged us to developing a search of novel active compounds not described yet as antimalarials in the literature.
In this sense, we also explore a large dataset of organic-chemicals through virtual screening in order to discover novel candidates for antimalarial drug-like compounds.A great number of the candidates to be assayed as antimalarial, detected with our models, were sent to biological assays and their presentation will be the objective of a forthcoming paper.Nevertheless, in this work we want to show some promissory outcomes of this computational screening, which can represent an important starting point to the design of novel antimalarials.It is well known, that the major of compounds used in the treatment of malaria are quinolinic derivatives such as quinine, chloroquine, mefloquine, halofantrine and primaquine.Acyclic β-enaminoesters and arylaminomethylenemalonates are synthetic intermediates of quinolinic compounds and can be achieved by economic and simple synthetic routes. 106,107On the other hand, there are not many researches related to the biological activity of enamine compounds.Taking that into account, we explored in our search the behavior of some acyclic β-enamino esters and arylaminomethylenemalonates.
Three of these compounds were initially evaluated with models 11 and 12 and in order to corroborate the predictions, prepared with excellent yields by very economic and simple methods, and evaluated against two strains of Plasmodium falciparum.
The acyclic β-enamino ester 1 were prepared by means of a nucleofilic addition of the aromatic amine to the keto group of the corresponding β-keto ester, using a previously described methodology. 108Arylaminometilenemalonates were synthesized by means of "one pot" process, starting from equimolar quantities of the corresponding aniline, ethyl malonate and ethyl orthoformate in the presence of catalytic amount of ZnCl 2 . 109Both general procedures are shown in Scheme 2. All the structures were confirmed by spectroscopic data analysis which is given as Supporting Information.
Scheme 2. Synthetic Procedure for the Synthesis of Quinolinic Intermediaries.The results of the prediction process using models 11 and 12, as well as the minimum inhibitory concentration (MIC) for the three assayed compounds against K1 and Palo Alto strains are shown in Table 6.The sensitivity control of each strain was carried out with chloroquine diphosphate.The MIC of chloroquine for sensitive strains is 5.7 pmol/well, i.e. strains with MIC above of this value are resistant to this compound. 110In our study, the determined value of the MIC for K1 strain was 8 pmol/well (µmol/L) and for Palo Alto strain 4 pmol/well (0.8 µmol/L) confirming the sensitivity of the used strains.As expected, compound 1 did not show activity against K1 and Palo Alto strains.The inhibition of the schizont maturation was observed at 100 pmol/well.Compound 2 did not inhibit the growth of parasites at any of the assayed concentrations (MIC > 100 pmol/well).Conversely, and in accordance with the predictions, the best results were observed for compound 3, which showed a MIC = 32 pmol/well against K1 strain and a MIC = 16 pmol/well) for the case of Palo Alto strain.Taking into account that this is the first report of an arylaminomethylenemalonate with antimalarial activity, the result can be considered as a very promissory starting point for the future design and refinement of novel compounds with higher antimalarial activity.That is to say, compound 3 was tested at higher doses than chloroquine diphosphate (reference or control antimalarial drug), but this result leaves a door open to a virtual variation study of the structure of these compounds in order to improve their antimalarial activity.Other chemicals in the same family as compound 3, as well as other chemicals not in this family, were also predicted as antimalarials.The synthesis, characterization, and biological evaluation of these compounds are, however, beyond the scope of the present paper and will be discussed elsewhere.It is important to recall that the aim of this study is not to validate the model but to provide an experimental example of how to use the model for potential drug discovery.

CONCLUDING REMARKS
The introduction and use of graph theoretical descriptors for rational drug design has become an attractive tool for medicinal chemists.In this sense, the fusion of high throughput screening and classification-based QSAR models in an attempt to minimize the costs in terms of time, financial, human, and animal resources is becoming a viable alternative to massive screening.In this work, we have shown that TOMOCOMD-CARDD approach can be applied to generate useful quantitative models for the classification of antimalarials.In flexible way, this method permits a quick in silico discovery of new candidates to lead compounds making use of a minimum of resources.Considering a training data set of compounds with a considerable structural variability, we reduce the degree of uncertainly for this process.The simulated virtual screening of Ras FTase inhibitors with antimalarial properties has proved the ability of our models for an adequate discrimination of new active compounds from inactive ones.The collected data of active compounds used in this study, results an important tool not only for the theoretical research, but for the general scientific work in this area.Using the developed models, a new lead candidate has been identified as a promising starting point for the design of new arylaminomethylenemalonates with potent antimalarial activity.Some works in this direction are at the moment in progress and will be published in a forthcoming paper.The interactive character of the TOMOCOMD-CARDD approach permits the future inclusion of new antimalarial drugs in the training data set and the generation of each time more "intelligent" models.In this sense, the new considered structural patterns will recognized for the models and a better discrimination of such kind of compounds will be obtained.However, this point is out of the general scope of the present work.

Supporting Information Available:
The complete list of compounds used in training and prediction sets, as well as their structures, posterior classification according to model 11 and 12, chemistry and data analysis of the obtained chemicals is available free of charge via Internet at http://pubs.acs.org.

Figure 1 .
Figure 1.Random, but not exhaustive, sample of the molecular families of antimalarial agents studied here.
) N = 1120 λ = 0.35 D 2 = 7.7 F(10,1109) = 203.11p<0.0001where N is the number of compounds, λ is Wilks' statistics, D 2 is the squares of Mahalanobis distances, F is the Fisher ratio and p is the signification level.Model 11, which includes non-stochastic indices, classified correctly 94.73% of the compounds in the training dataset, misclassifying only 59 compounds of a total of 1120.

Table 1 .
11 and 12, showed a high MCC of 0.89 (0.87) and 0.86 (0.86) in training (test) sets, correspondingly.Global Results of the Classification of Compounds in the Training and Test Sets.

Table 2 .
Predictivity based on the Use of Ten Randomly Selected Subsets (LGO crossvalidation) of LDA Models.

Table 4 .
Comparative Analysis of the Obtained Structure-Based Classification Models for Describing the Antimalarial Activity of a Heterogeneous Series of Compounds.

Table 5 .
Results of the Virtual Screening Simulation of Peptidomimetic Inhibitors of Protein Farnesyltransferase (FTAse) that Showed Potent Antimalarial Activity in Cell Assays.Compounds a-j were taken from Ohkanda et al., 2001 (Ref.105).b Inhibition at 20µM, RBC = Red Blood Cell.c Results of the classification of compounds obtained from Eqs. 11 and 12, respectively.Molecular Structure of Peptidomimetic Inhibitors of Protein Farnesyltransferase (FTAse) that Showed Potent Antimalarial Activity in Cell Assays.

Table 6 .
Synthetic Intermediates of Quinolinic Compounds Evaluated in the Present Study, their Classification (∆P%) According to the TOMOCOMD-CARDD Approach, their Antimalarial Activity against two Malarial Strain and Antimalarial Activity of Chloroquine.Results of the classification of compounds obtained from Eqs. 11 and 12, correspondingly.