The Dragon Method in the Computational Identification of Novel Tyrosinase Inhibitors . Results Supported by Experimental Assays

QSAR (quantitative structure-activity relationship) studies of tyrosinase inhibitors employing Dragons descriptors and linear discriminant analysis (LDA) are presented here. A dataset of 653 compounds, 245 with tyrosinase inhibitory activity and 408 having other clinical uses were used. The active dataset was processed by k-means cluster analysis to design training and prediction series. Seven LDA-based QSAR models were obtained. The discriminant functions applied showed a globally good classification of 99.79% for the best model (Eq. 3) in the training set. External validation processes to assess the robustness and predictive power of the obtained model was carried out. This external prediction set had an accuracy of 99.44%. After that, the developed were used in ligand-based virtual screening of tyrosinase inhibitors from the literature and never considered in either training or predicting series. In this case, all screened chemicals were correctly classified by the LDA-based QSAR models. As a final point, these fitted models were used in the screening of new bipiperidines series as new tyrosinase inhibitors. The biosilico assays and in vitro results of inhibitory activity on mushroom tyrosinase showed a good correspondence. These results support the role of biosilico algorithm for the identification of new tyrosinase inhibitors compounds.


INTRODUCTION
Tyrosinase (EC.1.14.18.1) is a copper-containing enzyme widely distributed in nature including fungi, higher plants, and animals.This enzyme catalyzes two key reactions in the melanin biosynthesis pathway, the hydroxylation of monophenol to odiphenol (monophenolase activity) and conversion of an o-diphenol to the corresponding o-quinone (diphenolase activity), involving reactive oxygen species (ROS) [1,2].
Quinones are highly reactive compounds and can polymerize spontaneously to form highmolecular-weight compounds or brown pigments (melanins), or react with amino acids and proteins that enhance the brown color produced [3].
Alterations in melanin synthesis occur in many disease states like hyperpigmentation, melasma and age spots [4].Melanin pigments are also found in mammalian brain and tyrosinase may play a role in neuromelanin formation in the human brain.This mixedfunction oxidase could be central to dopamine neurotoxicity and may contribute to the neurodegeneration associated with Parkinson´s disease [5].Melanona specific anticarcinogenic activity is also known to be linked with tyrosinase activity [6].
The standard topical treatments for hyperpigmentation disorders include tyrosinase inhibitors, some compounds with the inhibitory activity are used in medicine, but the majority of them don't satisfy all requirements of clinical efficacy, or adverse effects are observed [4,7].As result of these clinical behaviors and other side effects, there has been a constant search to find new herbal or synthesized compounds with anti-tyrosinase activity [8][9][10].In this sense, one of our group's researches has been focused on finding new potent tyrosinase inhibitors through 'trial-and-error' techniques [11,12].By other way, the in silico techniques have proven its usefulness in the pharmaceutical research for the selection/identification and/or design/optimization of new chemical entities (NCE), to transform early stage drug discovery, particularly in terms of time-and cost-savings [13].QSAR approaches report a high incidence of the use of different molecular descriptors for the in silico drug screening [14][15][16][17].
The congeneric dataset used in SAR and QSAR studies of tyrosinase inhibitors [11,12,[18][19][20]. don't provide the enough tools for drug development, this kind of data only can be applied to structural lead optimization.Therefore database of heterogeneous compounds may be a successful tool in QSAR research of tyrosinase inhibitors and the discovery of novel lead compounds with different structural features and more effective activity [21][22][23].
In the present paper, we used the Dragon descriptors, extensively applied to describe biological activities [24,25].and linear discriminant analysis (LDA) strategy to find classification functions that allows discriminate tyrosinase inhibitors compounds from inactive ones.As a final point, the in silico selection (identification), isolation, and in vitro assays of a new series of compounds was carried out to show the applicability of Dragon descriptors in the biosilico drug discovery processes.

Chemical Dataset.
Selected data set of this study was constructed warranting enough molecular diversity on it.Taking this into account, we selected a data set of 653 organic-chemicals having a great structural variability, 245 of them having tyrosinase inhibitory activity reported and the rest inactive ones [26] (408 compounds having different clinical uses, such as antivirals, sedative/hypnotics, diuretics, anticonvulsivants, hemostatics, oral hypoglycemics, antihypertensives, antihelminthics, anticancer compounds, and so on) was employed.
The great structural variability of chemicals in training and prediction series can assure an adequate extrapolation power.In this sense, the selection process is not

Dragon Molecular Descriptors.
The molecular descriptors were calculated using the Dragon [37] software; these were the Constitutional, Topological, BCUT, Galvez topological charge, 2D autocorrelations, empirical and properties descriptors [38].Descriptors with constant values inside each group were discarded.For the remaining descriptors, a pairwise correlation analysis for all families of descriptors was carried out.The presented exclusion method was used to reduce, in a first step, the collinearity and correlation between descriptors.

Chemometric Techniques. k-Means Cluster Analysis(k-MCA).
The statistical software package STATISTICA was used to develop the k-MCA [39].The number of members in each cluster and the standard deviation of the variables in the cluster (kept as low as possible) were taken into account, to have an acceptable statistical quality of data partitions in the clusters.The values of the standard deviation (SS) between and within clusters, of the respective Fisher ratio and their p level of significance, were also examined [40,41].Finally, before carrying out the cluster processes, all the variables were standardized.In standardization, all values of selected variables (molecular descriptors) were replaced by standardized values, which are computed as follows: Std.score = (raw score -mean)/Std.deviation.
Linear Discriminant Analysis (LDA).LDA was carried out with the STATISTICA software [39].The considered tolerance parameter (proportion of variance that is unique to the respective variable) was the default value for minimum acceptable tolerance, which is 0.01.A forward stepwise search procedure was fixed as the strategy for variable selection.The principle of parsimony (Occam's razor) was taken into account as a strategy for model selection.In connection, we selected the model with a high statistical significance but having as few parameters (ak) as possible.The quality of the models was determined by examining Wilks' ì parameter (U statistic), the square Mahalanobis distance (D2), the Fisher ratio (F), and the corresponding p level [p(F)] as well as the percentage of good classification in the training and test sets.Models with a proportion between the number of cases and variables in the equation lower than 5 were rejected.
The biological activity was codified by a dummy variable "Class".This variable indicates the presence of either an active compound [(Class ) 1] or an inactive compound [Class )-1].The classification of cases was performed by means of the posterior classification probabilities.By using the models, one compound can then be classified as active, if ΔP% > 0, being ΔP% ) [P(Active) -P(Inactive)]100, or as inactive otherwise.P(Active) and P(Inactive) are the probabilities with which the equations classify a compound as active or inactive, respectively.
The statistical robustness and predictive power of the obtained model was assessed using a prediction (test) set.Finally, the calculation of percentages of global good classification (accuracy), sensibility, specificity (also known as "hit rate"), false positive rate (also known as "false alarm rate"), and Matthews' correlation coefficient (MCC) in the training and test sets permitted the assessment of the model [42].

Orthogonalization of Descriptors.
In this study, the Randić method of orthogonalization was used [43].This orthogonalization process of molecular descriptors was introduced by Randić several years ago as a way to improve the statistical interpretation of the models by using interrelated indices.This method has been described in detail in several publications [43][44][45][46][47][48][49].
Because the different molecular descriptors included here used entirely "different types of scales", the data were standardized so that each variable has a mean 0 and a standard deviation of 1.In standardization, all values of selected variables (molecular descriptors) were replaced by standardized values, which are computed as follows: Std.score = (raw score -mean)/Std.deviation.

Chemical Procedures.
The synthesis and structural characterization of the bipiperidine series and biological studies and cross references has been reported in some detail elsewhere by other of our research team [50].

Experimental Corroboration of Tyrosinase Inhibitory Activity.
Tyrosinase inhibition assay was performed with kojic acid and L-mimosine as standard inhibitors for the tyrosinase in a 96-well microplate format using a SpectraMax 340 micro-plate reader (Molecular Devices, CA, USA) according to the method developed by Hearing [51].Briefly, first the compounds were screened for the odiphenolase inhibitory activity of tyrosinase using L-DOPA as substrate.All the active inhibitors from the preliminary screening were subjected to IC 50 studies.Compounds were dissolved in methanol to a concentration of 2.5%.Thirty units of mushroom tyrosinase (28 nM from Sigma Chemical Co., USA) was first preincubated with the test compounds in 50 nM Na-phosphate buffer (pH 6. Here the B and S are the absorbances for the blank and samples, respectively.After screening of the compounds, median inhibitory concentrations (IC50) were also calculated.All the studies have been carried out at least in triplicates and the result represent the mean ± SEM (standard error of the mean).Kojic acid and L-mimosine were used as standard inhibitors for the tyrosinase and both of them were purchased from Sigma Chem.Co., USA.

Design of training and test set.
In first place the molecular diversity of active compounds should be assured , and in this sense a hierarchical cluster analysis (CA) is developed with the STATISCA software [39].Figure 2 show a dendogram, where can be observe a largely number of different subsets proving the structural diversity of the active data set (tyrosinase inhibitors).
Inactive compounds selected for both, training and test set, we chose at random 408 drugs, having a series of other clinical uses.The classifications of these compounds as 'inactive' (non-inhibitors of tyrosinase) does not assure that any inhibitory activity do not exist for those organic-chemicals that not have been detected.This problem can be reflected in the results of classification for the series of inactive chemicals [52].
Second, a k-MCA is carried out to ensure that any chemical subsystem selected will  Afterwards, the selection of the training and prediction sets for the active database was performed by taking, in random way, compounds belonging to each cluster.From these 653 chemicals, 474 were chosen at random to form the training set, being 182 of them actives and 292 inactive ones.The remaining subseries composed of 63 tyrosinase inhibitors and 116 compounds with different biological properties were prepared as test set for the external validation of the classification models (179 compounds).These chemicals were never used in the development of the classification models.Figure 3 illustrates graphically the above-described procedure where one independent cluster analysis for active compounds and a random selection for the inactive compounds, were performed to select a representative sample for the training and test sets.3.2.Finding Discriminant Models.

Classification Functions.
Although they could be used many different chemometric techniques to fit discriminant functions, such as SIMCA or neural networks, in our case, we select the linear discriminant analysis (LDA) given the simplicity of the method, in order to derive discriminant functions that permit the classification of compounds as tyrosinase inhibitors or inactive ones.The LDA has become in an important tool for the prediction of chemical bioactive properties [47][48][49][53][54][55][56][57].
In the present study, we developed discriminant functions, using Dragon descriptors as independent variables.Here were obtained 7 LDA-based QSAR models.The models used the Constitutional, Topological, BCUT, Galvez topological charge, 2D autocorrelations, empirical and properties as molecular descriptors in this order (Eqs.2-8), respectively.The classification models obtained are given in Table 2, and the meaning of the variables included in the models, are depicted in Table 3.
. MCC quantifies the strength of the linear relation between the molecular descriptors and the classifications, and it may often provide a much more balanced evaluation of the prediction than, for instance, the percentages (accuracy) [42].
Also, we list in Table 4 most of the parameters commonly used in medical statistics [sensitivity, specificity and false positive rate (also known as 'false alarm rate')] for the whole set of developed models.While the sensitivity is the probability of correctly predicting a positive example, the specificity (also known as 'hit rate') is the probability that a positive prediction is correct [42].These statistical parameters mentioned above, together with the linear discriminant canonical statistics: canonical regression coefficient (R can ) as well as chi-squared (χ 2 ) and its p-level [p(χ 2 )] were checked and results are depicted in the same Table 4.The canonical transformations of the LDA results with the Topological descriptors (Eq. 3) give rise to canonical roots with a good canonical correlation coefficient of 0.99.
Chi-square test permits us to asses the statistical signification of this analysis as having a p-level <0.0001.

Validation Test.
The statistical parameters in the complete training dataset provide some assessment of the goodness of the fit of the models, but it is not enough to assure the predictive power of the models.For that reason we carried out an external validation process using a test set [58,59].
In this sense, the activity of the compounds in the test set was predicted with the obtained discrimination functions.The equation 3 shows a 99.44% (C = 0.99) in the prediction series.The results of the classifications for all models in the test set are depicted in Table 5.A plot of the ΔP% (see "Materials and Methods") from model 3, for each compound in the training and test sets is illustrated in Figure 4, where the good classification results obtained with the current approach can be observed.The accuracy and other statistical parameters (sensitivity, specificity and false positive rate) of the test set are depicted in Table 5.These results validate the models for the use in the ligand-based virtual screening taking into consideration that 85.0% is considered as an acceptable threshold limit for this kind of analysis [60].

Descriptors Orthogonalization Process.
In other hand, a good method to eliminate the collinearity is the pairwise correlation analysis, but the correlation between variables can persist, how was observed after a close inspection of the molecular fingerprints included in the best LDA-based QSAR model.In Table 6 we give the correlation coefficient of the molecular descriptors in Equation 3.
It is well known that interrelation among the molecular descriptors makes difficult the interpretation of the QSAR model [44][45][46][47][48][49], and underestimates the utility of the correlation coefficient in a model.To overcome this difficulty, we used the Randić's orthogonalization process of the molecular descriptors.The main philosophy of this approach is to avoid the exclusion of descriptors on the basis of its collinearity with other variables included in the model.However, in some cases strongly interrelated descriptors can enhance the quality of a model because the small fraction of a descriptor which is not reproduced by its strongly interrelated pair can provide positive contributions to the modeling.This process is an approach in which molecular descriptors are transformed in such a way that they do not mutually correlate (see Section 2.3).Both, the non-orthogonal (original) descriptors and the derived orthogonal descriptors contain the same information.Therefore, the same statistical parameters of the QSAR models are obtained [44][45][46][47][48][49].
In It is important remark here that the orthogonal descriptor-based models coincide with the collinear (i.e.ordinary) topological descriptors-based models in all the statistical parameters.The statistical coefficients of LDA-QSARs λ, F, D 2 , C, accuracy, are the same whether we use either a set of non-orthogonal descriptors or the corresponding set of orthogonal indices.This is not surprising, because the latter models are derived as a linear combination of the former ones and cannot have more information content than them [44][45][46][47][48][49].
In the process of orthogonalization the data were standardized so that each variable has a mean of zero and a standard deviation of 1, because the different molecular descriptors used entirely "different types of scales".

Novel Tyrosinase Inhibitors Through Virtual Screening Identification.
One of the most common approaches reported recently in the area of drug discovery are the in silico methods, this tools permit the assay of virtual libraries of chemical, and can predict ahead of time, the likely result of many-year biological-property study.This process is associated to the great costs involve in the discovering of new drug-like compound by the pharmaceutical industries.Virtual essays can be considered in this case a novel paradigm inside the new automation and information technologies, and can provide to the pharmaceutical industry with platforms to translate clinical liabilities into simple, fast and cost-effective in vitro screening assays, applicable to the early phases of drug discovery [61].
In this context and with the aim to prove the possibilities of the present approach for the ligand-based virtual screening of tyrosinase inhibitors, we chose 81 compounds, whose names are depicted in Table 9.These organic-chemicals were reported active compounds from the medicinal chemistry literature (see the last column of Table 7:  In the first instance, a k-NNCA was realized to observe the molecular variability in the data set of the virtual screening.As can be seen in dendrogram of Figure 5 there are many different subsystems, showing the great molecular diversity of the selected chemicals in this set. The results for the classification of the compounds in the virtual screening (external set) are summarized in Table 7..All screened chemicals including in this "simulated" virtual screening experiment were well classified as active for the best LDA-based QSAR model developed with the topological descriptors (Eq.3).A plot of ΔP% values of the good classification for the models 3 is given in Figure 6.This figure is a pictorial representation of the accuracy of the best model where the 100% compounds were classified well by Equation 3.  Several drugs were identified by the discrimination models as possible tyrosinase inhibitors.This result is the most important validation for the models developed here, because we have demonstrated that they are able to detect a series of drugs as active and these chemicals have shown the predicted activity.The drugs with some pharmacological uses selected as new lead tyrosinase inhibitors have well-established methods of synthesis as well as toxicological, pharmacodynamical and pharmaceutical behaviours are also well known.

In Silico Novel Tyrosinase Inhibitors and Experimental Results.
In the following section, and taking into account all the above steps describe in past sections, we were conducted to explore the ability of our discriminant models to find novel compounds.Besides, the good results in the algorithm presented encouraged us to carry out an in-silico screening to search novel active compounds not described yet in the literature as tyrosinase inhibitors.
As previously indicated, one of our research teams has been focused mainly on trialerror searching for new tyrosinase inhibitors [9,11,12].At the same time, we are also identifying new drug candidates using computational screening (based on QSAR techniques).For that reason, we perform in silico essays for bipiperidines series isolated and characterized from natural sources (herbal plants), searching novel tyrosinase inhibitors by using the discriminant functions obtained through the Dragons descriptors and LDA technique.
The LDA-based QSAR models were used to evaluate seven compounds and in order to corroborate the predictions, were prepared with excellent yields by very economic and simple methods, and evaluated in vitro against tyrosinase enzyme.In Table 8 they are given the ΔP% values of the compounds in this series, as well as their canonical scores using all the developed models.From these results, we can conclude that the current approach is a suitable alternative for the selection/identification of novel tyrosinase inhibitors which may be used to prevent or treat pigmentation disorders.
A very good coincidence among the theoretical predictions and the observed activity for all the compounds is observed.In the study of the inhibitory activity all seven compounds showed effectiveness in mushroom tyrosinase inhibition (see Table 8).The compound BP1 (IC 50 = 110.79μM),showed mild inhibition against the enzyme, the  A k-NNCA for all the active compounds included in the training, test, virtual screening sets and the novel chemicals was carried out.This hierarchical cluster analysis was developed to compare similarities between new discovered active compounds and the complete active dataset.The dendrogram ilustrates the great diversity of subsystems in the complete data under investigation (see Figure 8).An exhaustive analysis of each cluster showed that these new compounds were included in many clusters.
The principal impact of these models developed here, is its capability to recognize new tyrosinase inhibitors.This is one of the major goals and can be considered as a very promising tool for the future design of new compounds with higher tyrosinase activity.In this sense, compound BP4 presented more potent effect in the inhibition against the enzyme than L-Mimose (reference drug) and is available consider this organical-chemical as a hit for drug-discovery.The identification of novel structural subsystems can be making in search of drug-like compounds with such activity, after examining the  10. a,b,c,d,e,f,g ΔP% = [P(Active) -P(Inactive)]x100 as well as canonical scores of each compound in this set: (i) Classification of each compound using the obtained models with the Dragon descriptors in the following order: Eqs. 2, 3, 4, 5, 6, 7, and 8. h IC 50 are the 50% inhibitory concentrations against the enzyme tyrosinase and SEM. is the standard error of the mean.characterized compounds were done to corroborate the in silico results.New seven chemical exhibited anti-tyrosinase activity, proving that the algorithm presented can constitute a step forward in the search of new structural features with the activity.In this way, and looking for more efficient ways to discover new potent-selective tyrosinase inhibitors which may be used to prevent or treat pigmentation disorders; can be said that, predictive in silico models could be used for drug target identification, accelerating the selection process of lead compounds [64].

Figure 1 . 3 O
Figure 1.Random, but not exhaustive, sample of the molecular families of tyrosinase inhibitors studied here.
8) for 10 min at 25 o C. Then the L-DOPA (0.5 mM) was added to the reaction mixture and the enzymatic reaction was monitored by measuring the change in absorbance at 475 nm (at 37 o C) due to the formation of the DOPAchrome for 10 min.The percent inhibition of the enzyme was calculated as follows, by using MS Excel ®TM 2000 (Microsoft Corp., USA) based program developed for this purpose:

Figure 2 .
Figure 2. A dendrogram illustrating the results of the hierarchical k-NNCA of the set of tyrosinase inhibitors used in the training and prediction set of the present work.

Figure 3 .
Figure 3. General algorithm used to design training and test sets.

Figure 4 .
Figure 4. Plot of the ΔP% from Eq. 3 (using the Dragon descriptors) for each compound in the training and test sets.Compounds 1-182 and 183-245 are active (tyrosinase inhibitors) in training and test sets, respectively; chemicals 246-537 and 538-653 are inactive (non-tyrosinase inhibitors) in both training and test sets, correspondingly.

Figure 5 .
Figure 5.A dendrogram illustrating the results of the hierarchical k-NNCA of the set of active chemicals used for evaluating the predictive ability of the QSAR models for ligand-based virtual screening.

Figure 6 .
Figure 6.Plot of the ΔP% from Eq. 3 (using the Dragon descriptors) for each compound selected in virtual screening protocols.

Figure 7 .
Figure 7. Molecular structure of the new bipiperidine series.

Figure 8 .
Figure 8.A dendrogram illustrating the results of the hierarchical k-NNCA of the set of all active chemicals (tyrosinase inhibitors) included in training, test, virtual screening and new active bipiperidine series discovery in the present work.

Table 1 .
Main Results of the k-MCAs, for Tyrosinase Inhibitors and Inactives Drug-like Compounds.
b Variability within groups.c Level of significance

Table 3 .
Symbols of the descriptors used in the QSAR models and their definitions.

Table 4
summarizes the prediction performances and the statistical parameters for LDA-based QSAR models with the training set.The equations showed to be statistically significant at p-level (p < 0.0001).Fitted model for the equation 3 showed the best result in these classification functions, this best model 3 have an appropriate overall accuracy of 99.79% in the training set.The equation showed a high Matthews correlation coefficients

Table 4 .
Prediction Performances and Statistical Parameters for LDA-based QSAR Models in the Training Set.
b Canonical correlation coefficient obtained from the linear discriminant canonical analysis

Table 5 .
Prediction performances for LDA-based QSAR models in the test set.

Table 6 .
Correlation matrix of the variables in the equation 3.

Table 7 :
Ref.).this good results, a next step should be do, the inclusion of these 'novel' compounds in the training set, and carry out new models to find novel discrimination functions.This new model can be significantly different from the previous one, due to the inclusion of a new structural pattern, but it should be able to recognize a greater number of such compounds as tyrosinase inhibitors.Therefore, this iterative process can improve the quality of the classification models in which a great quantity of compounds with novel structural features is evaluated against the activity of the enzyme.

Table 8 .
Results of Ligand-based in silico Screening and Tyrosinase Inhibitory Activities of New Bipiperidine Series.The molecular structures of these chemicals are shown in Figure *