Quantitative Structure-property Relationship Modelling of Distribution Coefficients ( logD 7 . 4 ) of Diverse Drug by Sub-structural Molecular Fragments Method

Quantitative structure-property relationship (QSPR) for estimating the distribution coefficients (logD7.4) of 300 diverse drugs developed using sub-structural molecular fragments (SMF) method. Forward and backwards stepwise regression variable selection and multi-linear regression (MLR) combined to describe the effect of molecular graph on the logD7.4 according to the QSPR method. Finally, a QSPR model is selected based on leave-one-out cross-validation and its prediction ability is further tested on 50 representative compounds excluded from model calibration. The prediction results from the MLR model are in good agreement with the experimental values. The predictive power and robustness of the QSPR model were characterized by the statistical validation.


INTRODUCTION
The most common physicochemical descriptor is the molecule's partition coefficient in an octanol/water system.As emphasized previously, the drug will go through a series of partitioning steps: (a) leaving the aqueous extracellular fluids, (b) passing through lipid membranes, and (c) entering other aqueous environments before reaching the receptor.In this sense, a drug is undergoing the same partitioning phenomenon that happens to any chemical in a separatory funnel containing water and a nonpolar solvent such as hexane, chloroform, or ether.The partition coefficient (P) is the ratio of the molar concentration of chemical in the non-aqueous phase (usually 1octanol) versus that in the aqueous phase (Eq.1).For reasons already discussed, it is more common to use the logarithmic expression (Eq.2).

= [ ℎ ]
[ ℎ ] ... (1) ... (2) The difference between the separatory funnel model and what actually occurs in the body, is that the partitioning in the funnel will reach an equilibrium at which the rate of chemical leaving the aqueous phase and entering the organic phase will equal the rate of the chemical moving from the organic phase to the aqueous phase.This is not the physiological situation.The absorption of drugs is very important in rational drug design.Indeed, drugs have to cross a series of barriers either by a passive diffusion or by a carrier-mediated uptake.The 1octanol/water partition coefficient, logP, is accepted as one of the principal parameters to evaluate lipophilicity of chemical compounds that, to a large degree, determines these pharmacokinetic properties of drugs.Note that dynamic changes are occurring to the drug, such as it being metabolized, bound to serum albumin, excreted from the body, and bound to receptors.The environment for the drug is not static.Upon administration, the drug will be pushed through the membranes because of the high concentration of drug in the extracellular fluids relative to the concentration in the intracellular compartments.In an attempt to maintain equilibrium ratios, the flow of the drug will be from systemic circulation through the membranes onto the receptors.As the drug is metabolized and excreted from the body, it will be pulled back across the membranes, and the concentration of drug at the receptors will decrease.Equations 1 and 2 assume that the drug is in the non-polar state.A large percentage of drugs are amines whose pKa is such that at physiological pH=7.4,a significant percentage of the drug will be in its protonated, ionized form.A similar statement can be made for the HA acids (carboxyl, sulfonamide, imide) in that at physiological pH, a significant percentage will be in their anionic forms.An assumption is made that the ionic form is water-soluble and will remain in the water phase of an octanol /water system.This reality has led to the use of log D, which is defined as the equilibrium ratio of both the ionized and unionized species of the molecule in an octanol/water system (Eq.5).The percent ionization of a drug is calculated by using equation 3 for HA acids and equation 4 for BH + acids.... (3) ... (4) The percent ionization of ionized HA acids and BH protonated amines can be estimated from Equations 3 and 4 and the logD from Equations 6 and 7, respectively.... (5) ... (6) ... (7) Because much of the time the drug's movement across membranes is a partitioning process, the partition coefficient has become the most common physicochemical property 1-3 .It is now realized that the n-octanol/water system is an excellent estimator of drug partitioning in biological systems.The distribution coefficients of organic compounds calculated using ALOGPS program by Igor V. Tetko 4,5 .
In this article, at first, a quantitative structure-property relationship model for estimation of distribution coefficients at pH=7.4 of diverse drug is developed.These quantitative structure property relationships (QSPR) are generally used to correlate the biological, chemical, or physical property of a compound with its physico-chemical characteristics 6,7 .
In our previous papers, we reported on the application of QSPR techniques in to develop a new, simplified approach to prediction of compounds properties 8-13 .For the first time, we applied the substructural molecular fragment (SMF) method for modeling the logD 7.4 of diverse drug.The goal of this study is to develop a SMF method and the related software tools, to model relationships between the structure of 300 drugs and their distribution coefficients.This method is based on to represent a molecule by its fragments and on to calculate their contributions to a given property.It uses two types of fragments: (i) the sequences of atoms and/or bonds (atom and/or bond paths up to specified maximal length) and (ii) "augmented" represented by a selected atom and/or bonds with its environment.In fact, it represents an extension of empirical methods used to calculate physical or chemical properties of molecules using atomic or bond increments.

Experimental data
All logD 7.4 data for all 300 drugs were taken from the literature 14 .The logD 7.4 of these drugs are deposited in journal log as supplementary data.The values were used as dependent variable in the following analyses and the values ranged from -6 to 6.1.

Molecular modeling and fragment generation
All calculations were run on a Dell Inspiron N5010 laptop computer with Intel® Core™ i7 processor with Windows 7 operating system.The molecular structures of all compounds were drawn into the HyperChem 8.0 program and pre-optimized using MM + molecular mechanics method.The final geometries of the minimum energy conformation were obtained by more precise optimization with the semi-empirical AM1 method, applying a root mean square gradient limit of 0.01 (Kcal.mol - .Å -1 ).Then, the resulted geometries were put in to ISIDA/ QSPR (version 5.76.003, 2010) to calculate substructural molecular fragments.The ISIDA/QSPR program realizes the sub-structural molecular fragments (SMF) method for QSPR/MLRA modeling.The SMF method is based on the splitting of molecular graph into fragments, and on the calculation of their contributions to a given property (logD 7.4 ).

Computational procedure Sub-structural molecular fragments
The ISIDA/QSPR program realizes the sub-structural molecular fragments (SMF) method 15- 21 .which is based on the splitting of a molecular graph on fragments (sub-graphs), and on the calculation of their contributions to a given property Y. Two classes of fragments are used: "sequences" (I) and "augmented" (II).Three sub-types AB, A and B are defined for each class.For the fragments I, they represent sequences of atoms and bonds (AB), of atoms only (A), or of bonds only (B).Shortest or all paths from one atom to the other are used.For each type of sequences, the minimal (n min ) and maximal (n max ) number of constituted atoms must be defined.Thus, for the partitioning I(AB, n min -n max ), I(A, n min -n max ) and I(B, n min -n max ), the program generates "intermediate" sequences involving ν atoms (n min ≤ n ≤ n max ).In the current version of ISIDA/QSPR, n min ≥ 2 and n max ≤ 15.An "augmented" represents a selected atom with its environment including both neighboring atoms and bonds (AB), or atoms only (A, without taking hybridization of neighbors into account, or Hy, where hybridization of neighbors is accounted for), or bonds only (B).

Variable selection procedures
Generally, generated pool of descriptors is much larger than the number of compounds in the training set; therefore procedures for selecting variables should be applied to build statistically significant multi-linear regressions.In ISIDA, a combination of forward and backward stepwise variable selection procedures is used.

Filtering stage
The program eliminates variables X i which have small correlation coefficient with the property, R y,i < R 0 y,i , and those highly correlated with other variables X j (R i,j > R 0 i,j ), which were already selected for the model.In this work, the values R 0 y,i = 0.001, … and R 0 i,j = 0.75, … were used.

Forward stepwise pre-selection stage
The suite of forward and backward stepwise algorithms has been used for variable preselection in ISIDA/QSPR studies by the variable selection suite (VSS) program.Three algorithms for forward stepwise variable selection are based on the calculations of correlation coefficients and subtractions.This is an iterative procedure, on each step of which the program selects one X i (two X i and X j or three variables X i , X j and X k ) maximizing the correlation coefficient R y,j (R y,ij or R y,ijk ) between X i (X i and X j or X i , X j and X k ) and dependent variable Y.At the first step (s = 1), the modeled property for each compound is taken as its experimental one Y s = Y.At each next step s, as the property value Y s is used residual Y s = Y s-1 -Y calc , where Y calc = c 0 + c i X i (Y calc = c 0 + c i X i + c j X j or Y calc = c 0 + c i X i + c j X j + c k X k ) is calculated property by the one-variable (two-or three-variables) model with selected variable X i (variables X i and X j or X i , X j and X k ).This loop is repeated until the number of variables k reaches a user-defined value; in this work, k was analyzed from 0.1n to 0.9n, where n is the number of the molecules in the training set.

Backward stepwise selection stage
The final selection is performed using backward stepwise variable selection procedure based on the t statistic criterion.Here, the program eliminates the variables with low t i = c i /s i values, where s i is standard deviation for the coefficient c i at the i-th variable in the model.First, the program selects the variable with the smallest t < t 0 , then it performs a new fitting excluding that variable.This procedure is repeated until t ³ t 0 for selected variables or if the number of variables reaches the user's defined value.Here, t 0 , the tabulated value of Student's criterion, is a function of the number of data points, the number of variables, and the significance level.Default value of the t 0 is 1.96, it can be analyzed from 1.96 to 3.9.

Multi-linear regression model
The modeled physical or chemical proper ty Y can be quantitatively calculated accounting for contributions of fragments using linear (3) fitting equations.
... (8) where a i is a fragment contributions, N i is the number of fragments of i type.The a o term is fragment independent and Γ term is external descriptors (e.g., topological, electronic, etc.) by default Γ=0.Contributions of a i are calculated by minimizing a functional U[a i ] =

Validation of QSPR Models
In ISIDA/QSPR calculations, each initial data set was split into two sub-sets: training (250 compounds) and test (50 compounds) sets.The QSPR models were built on the training set followed by "prediction" calculations for the test set.Before a QSPR model is used to predict the properties for new compounds, it should be validated both internally and externally to ensure that the built model is robust, reliable, stable and predictive.In the current work, several statistic terms such as squared correlation coefficient R 2 for the training set fitness and Q 2 ext for the external predictive ability, leave-one-out (LOO) cross-validated Q 2  LOO and root mean square error (RMSE) were used to assess the internal and external predictive ability of the proposed model.The corresponding statistical parameters were defined as: Where i represents i th molecule, y ie is the desired output (experimental property),y ip the actual output, y icv is the output of leave-one-out crossvalidation, y training mean and y test mean are the mean values of y ip for the training and test sets, respectively.N is the number of compounds in the training or test set.In addition, the built model was also validated externally using the test set compounds due to the fact that the best way to evaluate the predictive ability of a QSPR model is its validation using compounds not included in the training set with known properties.

RESULTS AND DISCUSSION
The octanol-water distribution coefficient is a physical property used extensively to describe a chemical's lipophilic or hydrophobic properties.It is the ratio of a chemical's concentration in the octanol-phase to its total concentration in the aqueous-phase of a two-phase system at equilibrium.The ISIDA program has been developed to establish structure-property    22 .The graphical interface of ISIDA allows to attribute data to the learning or to the validation sets and to set up the parameters of calculations (type of fragments, minimal and maximal number of atoms/bonds in the sequences, type of equation).A QSPR is a mathematical relationship between a property of a chemical, in this case logD 7.4 , and molecular fragments of the chemical.The fragments are obtained from the structure of the chemical structure.First, a training set of distribution data is used to statistically establish the relationship between logD 7.4 and the molecular fragments.The QSPR can then be used to predict the logD 7.4 of untested chemicals for which the fragments are known.Thus, the fragments selected to describe this process in a QSPR should be able to describe the relative affinities of drugs for distribution between n-octanol and water.To establish relationships between the structure of drugs and their distribution properties, we used the recently developed sub-structural molecular fragments (SMF) method, which is based on the representation of the molecular graph by fragments and on the calculation of their contributions to a given property.When a compound is split into constitutive fragments, the fragments contributions to the distribution coefficients (logD 7.4 ) or to any other physical or chemical property are calculated using linear fitting equation: Here, A i is contribution of fragment, and N i is the number of fragments of i type.The A o term is fragment independent.The fragments contributions as fitted coefficients in the equation ( 14) at the learning stage are used to predict logD 7.4 for the compounds from the validation set.Set of fragments, coefficients of the equation, standard deviations for coefficients and their t-test for equation (14) are shown in Table 1.This shows that the logD 7.4 increases with positive coefficients, and decreases logD 7.4 with negative coefficients, respectively.On the other hand, the signs of the coefficients were used in order to determine the influence of each variable, positive or negative, on the logD 7.4 .The experimental, predicted and residuals data for training set (250 compounds) and test set (50 compounds) are shown in supplementary data (see Figure 1 and 2).The statistical results of training and external validation of model are shown in Table 2.In general, molecular fragments correlate well with physical properties that are dependent on molecular volume, shape and size as each index incorporates a summation of terms representing fragments of the molecules.Thus in the QSPR here, one sees a general increase in absorbability as molecular size increase, reflected in increases in fragments.Thus in a diverse drugs, distribution increases with increasing chain length data and Van der Waals forces.In molecular structures that have ionized functional groups distribution coefficients decreases, because concentration of drug increases in water and decreases in octanol.

CONCLUSION
The applicability of ISIDA-QSPR for developing a logD7.4 was demonstrated using a set of sub-structural molecular fragments calculated from molecular structure.In this work, MLR modeling method was used to study the quantitative structureproperty relationship of logD 7.4 for drug set.We can conclude that: firstly, the prediction results indicate that the multi-linear regression modeling method can improve the prediction accuracy significantly for this large data set; secondly, the models developed in this work provide an accurate model that can be used to predict the logD 7.4 from the molecular structure only.In this case the distribution occurs between water and n-octanol.Distribution of drug between polar and nonpolar solvent mainly involves hydrophilic and hydrophobic interactions.In this paper, new QSPR model have been developed for predicting the logD 7.4 of a diverse set of drugs from the molecular structure alone.The obtained results show that MLR method could model the relationship between logD 7.4 and their sub-structural fragmental.By performing model validation, it can be concluded that the presented model is a valid model and can be effectively used to predict the logD 7.4 of drugs with an accuracy approximating the accuracy of experimental logD 7.4 determination.It can be reasonably concluded that the proposed model would be expected to predict logD 7.4 for new drug for which experimental values are unknown.The main advantages of fragment descriptors lie in the simplicity of their computation, the easiness of their interpretation as well as in efficiency of their applications in similarity searches and SAR/QSAR/QSPR modeling.
exp,i -Y calc,i ) 2 => min ...(9) where n is the number of the compounds in the training set, w i the weight accounting for the accuracy of the experimental data, Y exp and Y calc are, respectively, experimental and calculated according to (3) property values.The equation (3) represents calculation of property Y by using additive contributions of fragments.The coefficients of the equation (3) being optimized at the training stage are then used to estimate Y values of the compounds from the test set or to screen external databases of real or virtual compounds.