Dataset of curcumin derivatives for QSAR modeling of anti cancer against P388 cell line

The dataset of curcumin derivatives consists of 45 compounds (Table 1) with their anti cancer biological activity (IC50) against P388 cell line. 45 curcumin derivatives were used in the model development where 30 of these compounds were in the training set and the remaining 15 compounds were in the test set. The development of the QSAR model involved the use of the multiple linear regression analysis (MLRA) method. Based on the method, r2 value, r2(CV) value of 0.81, 0.67 were obtained. The QSAR model was also employed to predict the biological activity of compounds in the test set. Predictive correlation coefficient r2 values of 0.88 were obtained for the test set.


a b s t r a c t
The dataset of curcumin derivatives consists of 45 compounds (Table 1) with their anti cancer biological activity (IC 50 ) against P388 cell line. 45 curcumin derivatives were used in the model development where 30 of these compounds were in the training set and the remaining 15 compounds were in the test set. The development of the QSAR model involved the use of the multiple linear regression analysis (MLRA) method. Based on the method, r 2 value, r 2 (CV) value of 0.81, 0.67 were obtained. The QSAR model was also employed to predict the biological activity of compounds in the test set. Predictive correlation coefficient r 2 values of 0.88 were obtained for the test set.
& The data is with this article

Value of the data
The data of curcumin can be used as one of the most potent and multi-targeting phytochemicals against variety of cancers such as for murine leukemia cancer (P388).
The QSAR model was generated to confirm the anti cancer activity of 45 curcumin derivatives compounds that can be used for searching new drug candidates against cancer (P388).
MLRA model are able to predict biological activity of compounds in the test set.

Data
Data presented here provide information about curcumin derivatives with their IC 50 against P388 cell line. This data is also show generation of QSAR model and how able the QSAR model can predict the inhibitory activity of compounds in the test set.

Dataset preparation
The dataset consists of 45 curcumin derivatives which were divided into a training set (30 compounds) for model development and a test set (15 compounds) for model validation. The training set selection was performed by first sorting through the biological activity list in increasing value. Next, the list of compounds were divided into three groups, i.e., group I comprising of compounds numbers 1 to 15, group II with compounds numbers 16 to 30 and group III comprising of compounds numbers 31 to 45. The compounds in groups I and II were assigned to the training set, and compounds in group III were assigned to the test set. Table 1 presents the molecular structures of curcumin derivatives with their IC 50 value.

QSAR model development
The 2D molecular structures of the dataset were sketched using Chemdraw 6.0 software and converted using ChemBio 3D ultra and then followed by energy minimization using MM2 force field [1].
Molecular descriptors were generated using ChemDes software package [2] for each compound for then these descriptors were reduced to a set of descriptors which is as small as possible but are rich information. Correlation matrix was then applied to select the best subset of descriptors to be included in the model by eliminating descriptors that are highly correlated with each other [3]. The next step involved scaling the descriptors which is a very delicate procedure since there may be underlying relationship between these descriptors and it may not be possible to foresee the effects of these manipulations. The range scaling can be calculated as: where, y i is the scaled value; x i is the original value; min x ð Þ is the minimum collection of x objects; and max x ð Þ is the maximum collection of x objects. The selected descriptors were then used to build QSAR model. QSAR model were developed using multiple linear regression analysis (MLRA) technique. In multiple regressions, a selection algorithm is used to choose a subset of the input X variables [4]. Molecular structures and their corresponding properties were correlated through a linear combination of structural descriptors. Only the chosen descriptors were included in the model which means that a variable which appears to be highly Table 1 Molecular structures of 45 curcumin derivatives, they were synthesized using base or acid catalyzed aldol condensation reaction of the appropriate substituted benzaldehyde and corresponding NH-4-piperidones, N-methyl-4-piperidones and N-benzyl-4-piperidones. The IC 50 were determined using MTT assay. The best QSAR model developed using multiple linear regression analysis (MLRA) technique was found with an r 2 value of 0.81 and an r 2 (CV) value of 0.67. The statistical output of this model is shown in Table 2 with the equation as presented as follow: Y ¼ À1:476495 Ã W þ0:57806629 Ã MREF þ 1:6221327 Ã nhyd þ 1:0599425 Ã LDI À 0:83247823 A plot of experimental vs. predicted PIC 50 of compounds in the training set is presented in Fig. 1. This plot is important to graphically demonstrate the predictive capability of QSAR models. Residual plots (scatter) are used to detect the existence of outliers from a QSAR model [2] as depicted in Fig. 2.

Model validation
Model validation was then applied to evaluate the robustness and the predictive capacity of the QSAR model. The inhibition concentration of 15 compounds in the test set was predicted using the developed QSAR model (i.e. equation). The calculated PIC 50 values of compounds in the test set are shown in Table 2. The r 2 between predicted and experimental values was also calculated. A predictive correlation coefficient r 2 value (test set) of 0.88 was obtained for the developed QSAR model. This Table 2 Statistical output of QSAR model.   value indicated the usefulness of the QSAR models in predicting activities of molecules not included in its derivation [2] (Table 3).