In silico ADME modeling: QSPR models for the binding of β - lactams to human serum proteins using genetic algorithms

Quantitative structure-property relationship (QSPR) models based on in vitro serum proteins binding data of 113 diverse drugs and drug-like compounds are reported. For this purpose, two Genetic Algorithm (GA) based approaches GA1 and GA2 along with multiple linear regression (MLR) are employed to exhaustively search and to select multivariate linear equations, starting from a large pool of molecular descriptors (molecular properties or variables). The reported QSPR models are based on combinations of 5 and 6 molecular properties calculated from the 2D chemical structures. Internal (leave-one-out) and external validation tests have demonstrated that these models have excellent predictive power and can be applied to the design new β -lactams class of antibiotics. As the models reported herein, are based on computed properties, they appear as valuable virtual screening tools, where selection and prioritisation of candidates is required.


Introduction
Binding affinity of new chemical entities (NCE's) to serum proteins is one of the important ADME 1-7 (Absorption, Distribution, Metabolism and Excretion) properties considered in drug discovery and development.Serum proteins are grossly separated into albumin and globulins.Albumin is the protein of highest concentration in the serum (plasma is serum plus clotting proteins).It is a carrier of many small molecules, 8 and is very important in maintaining the oncotic pressure of the blood (that is keeping the fluid from leaking out into the tissues).Binding of a drug to serum proteins in human plasma is a major determinant of its pharmacodynamic behavior (the action of a drug to the body) and the pharmacokinetics of the drug (the action of the body to the drug) and consequently, can affect the systemic distribution of the drug in several ways.Binding of a drug to plasma protein is a reversible process and is therefore in an equilibrium state.The unbound drug molecules contribute to the pharmacological efficacy and are also susceptible to metabolic reactions.
Acidic drugs are known to bind tightly to human serum albumin (HSA), the major constituent of plasma proteins. 9HSA has two ligand specific binding sites 10,11 namely, site-I and site-II.The ligand selectivity is comparatively broader for these two sites, allowing a range of drug molecules to bind at these sites.This broad selectivity is considered to be a result of the significant allosteric effects in HSA 12 and drug molecules can also interact nonspecifically with HSA.In addition, alpha 1-acid glycoproteins (AGP) and lipoproteins, constituents of plasma proteins, can also interact with drugs.Although the amount of AGP in the plasma is far smaller than that of HSA, it interacts strongly with basic and neutral drugs 9 in addition to some acidic drugs. 13Binding to lipoproteins is considered non-specific due to hydrophobic interactions. 9rediction of the serum protein binding percentage is more difficult than that of other ADME factors because, this is a composite parameter made up of the sum of interactions with multiple proteins, each with a different affinity.The prediction is further complicated by having to include both specific and nonspecific binding as well as significant allosteric interactions.Given the importance of drug binding to serum proteins 13 , it should be extremely useful to develop quantitative structure-property relationships to predict the binding affinity to serum proteins, applicable to the whole medicinal chemical space.Computational models of this type are useful because they rationalize a large number of experimental observations and therefore allow us to save time and money in the drug design process.However, there are very few published reports [14][15][16][17] on the prediction of binding affinity of drugs and drug-like compounds to serum proteins and further, these predictive models are based on molecular properties chosen out of experience; consequently, the whole space of molecular properties is not explored.This prompted us to perform a study aimed at (i) gaining further insights into the molecular properties that influence serum protein binding (ii) develop improved mathematical models for serum protein binding and (iii) demonstrate their utility in drug discovery and development.
In this paper, we describe the application of two (GA) based approaches GA1 and GA2 to derive novel QSPR models to predict the binding affinity of a specific class of drugs, β-lactams, to human serum protein, for the first time using a large set of molecular properties, to the best of our knowledge.The predictive performance of the QSPR models on external data sets taken from literature is illustrated.

Methodology
The methodology adopted in the data analysis is depicted in Fig. 1.The QSPR models reported in the literature on protein binding are principally based on manually selected descriptors, where experience and intuition are the key factors for success.However, in this approach, the whole molecular properties space is not explored and hence, it is highly desirable to explore all possible descriptor combinations in model generation.Our methodology provides many advantages and some of them are as follows: 1) it is based on a larger training set 2) it is also based on the largest number of molecular properties reported as of date and hence can provide an optimal solution 3) we have set the inter correlation coefficient between descriptors as 0.75 and thereby the selected descriptors are expected to be independent.

QSPR models using GA1
In approach GA1, 100 compounds of training set their 322 descriptors and I are used to select the best three of the 5 and 6 variable combinations, keeping the experimental percentage fraction bound (PFB) values, as the dependent variable.As many of the 322 descriptors may be correlated, it is desirable to find variables that are not correlated and construct models using only these variables.It is well known that QSPR models based on un-correlated variables will improve the predictive performance and hence, we performed variable selection, setting the inter-variable correlation below 0.75, as this is expected to discard descriptors with a intercorrelation coefficient above 0.75.The selected combinations are shown in Table 1 and the interdescriptor correlations are shown in Table 2.The best 5 descriptors model G1 is based on variables 158, 235, 268, 284 and 304 with a correlation coefficient, R of 0.8935 and cross-validated correlation coefficient, Q of 0.8723.The correlation coefficients of the other two models G2 and G3 are 0.8933 and 0.8926 respectively.
Significantly, the descriptors 284 (Atomic-Level-Based AI topological descriptor -AIsssCH) and 304 (AlogP98) are selected in all the best five variable models G1, G2 and G3.Descriptors 231 (Mean information content on the distance equality) in model G3 and 235 (Mean information content on the edge distance equality) in models G1 and G2 have high correlation (R = 0.9945), indicating that they provide the same information and significance to protein binding property of chemical compounds.Similarly, descriptor 247 (Atomic Type Electrotopological state index -SsNH2) of model G2 and 268 (Hydrogen Electrotopological state index -SHsNH2) of models G1 and G3 have inter-correlation coefficient R = 0.9992.It is interesting to note that descriptor 199 (Autocorrelation descriptor (Moran) weighted by atomic van der Waals radius -Order 1) of model G3 has low correlation with the rest of the descriptors in the five descriptors combinations.
The best six descriptors combination G4 is based on the variables 158, 193, 231, 247, 284 and 304 with a correlation coefficient of 0.9074, and cross-validated correlation coefficient, Q of 0.8888.Combinations of descriptors in models G5 and G6 are the same as those in models G1 and G2 with an additional new descriptor 193.Similarly, model G4, is the same as model G3, with an additional descriptor 158.As there is no significant improvement in R and Q values between the five and six descriptor combinations, we did not proceed further to select models with higher number of descriptors and this prompted us to believe that the six descriptors model is an optimal solution to the prediction of protein binding based on the 332 descriptors.

Model validation
The above models, G1 to G6 are validated using Leave-One-Out approach.The results of LOO cross-validations are given in Table 3.A plot of cross-validated PFB values versus the experimental PFB values of compounds in training set using models G1-G6 are shown in Fig. 2 -Fig.7 respectively.Based on the cross-validated results, we believe that the models G1 and G4 are the best five and six descriptors models, based on their overall predictive power and can be used for virtual screening of β-lactam analogs.

External-validations of models G1 to G6
The manually selected members of test set I are used to study the predictive power of the models G1 -G6.The predicted PFB values of these test compounds are given in Table 4.  31 , which selects compounds based on the maximin (maximizing the minimum distances) principle, thereby satisfies the diverse distribution of molecular descriptors.We have chosen to study the application of GA based automated training set selection for the first time to the best of our knowledge, in the development of prediction models for PFB.In this study, the selection of the compounds is based on the best correlation coefficient of MLR, as the models with high value of R are expected to possess good predictive power.
In this approach, same data set of 113 compounds and 322 molecular descriptors are used, as in the case of the approach GA1.However, the model generation involves the following two steps: 1) variable selection using the full data set of 113 compounds as compared to 100, used in the case of GA1 and 2) based on the best variable combinations of step 1, the best training set containing 100 members are selected by GA2.
The best three of the 5 and 6 variable combinations selected by GA2 are shown in Table 5 and the inter-descriptor correlations are shown in Table 6.The 5 variable combinations of V1, V2 and V3 are the same as that of G1, G2 and G3 respectively of GA1.The best six variable combination, V4, is based on the variables 133, 199, 235, 268, 284 and 304 with a correlation coefficient, R of 0.8996.It is a new descriptors combination and such a result is anticipated, as the training set size is different from that in approach GA1.Similarly, the second best six variable combination V5 also has a new descriptor 35 (mean square distance index) that contains the same information as descriptors 231 and 235.The third best six variable combination V6 is the same as G5 selected by GA1, but has different statistical numbers.The best 5 and 6 variable combinations having high correlation to PFB, V1 and V4 are considered for the selection of best training sets using GA2.The data set consisting of 113 compounds and the best 5 and 6 variable combinations are used as the input data for GA2 to select the training set of 100 compounds based on the correlation coefficient R. The two best training sets, II and III are presented in Table 7.The compounds that are not selected as training set members in the case of models T1 and T4 constitute test sets II and III respectively.The compounds selected as training sets II and III for the models T1 and T4 with their corresponding PFB values are used for model generation using MLR and the resulting regression equations are given below:

Model validation
The members of the training sets II and III, selected by GA2 (Table 7), are cross-validated by the LOO method.The LOO cross-validated PFB values of compounds in training sets II and III are given in Table 8.A plot of cross-validated PFB values versus the observed PFB values using T1 and T4 is shown in Fig. 8 and Fig. 9 respectively.

External validation
The compounds of the test sets II and III are used for external validation of the models T1 and T4 to assess the actual predictive power of the models.The results of the external validation of models T1 and T4 using the test sets I and II are given in Table 9.

Analysis of the models G1-G6, T1 & T4 and their implications in drug discovery
As mentioned earlier, the unbound drugs are susceptible to metabolic clearance and at the same time, the unbound drugs are also responsible for pharmacological efficacy.Thus, the higher the protein binding, the lesser is the metabolic clearance of the drug, especially, when the binding is restrictive 27 Thus, an increase in the lifetime of the drug is observed for strongly binding drugs.Hence, the greater accuracy of prediction of high binding percent of drugs is desirable, as even small errors in this range will have significant impact on the pharmacodynamic effect of the drug.For example, the difference in the predicted values of percent fraction bound of drugs from 96 to 92 corresponds to a doubling of unbound fraction.Given the significance of high binding behavior of drugs on the pharmacological efficacy, it is desirable for any model to perform well, especially, in the higher range of protein binding (75%-100%).Analyses of the predicted values of PFB of the β-lactam compounds in the range of 0-33%, 34-66% and 67-100% versus the observed PFB values were performed using G4.Among the 23 compounds in the range 0-33% of PFB, model, G4 predicted effectively in 63.63% (14/22) cases respectively keeping the acceptable standard error as 10%.Similarly its performance in the range of 34-66% is 77.27% (17/22).Significantly, the predictive performance of the model is excellent in the case of high binding drugs, i.e., of the 55 compounds in the range of 67-100%; the model G4 predicted PFB values successfully for 47/56 compounds.The performance of the models is at the upper limits of the expectations for a property that is measured by a variety of methods with results of modest accuracy.
Another significant aspect of the models is the finding that most of the variables are common among the datasets.This suggests that we have been able to identify a number of common structural features contributing to the protein binding of these drugs.It is important to keep in mind that there is more than one binding site on albumin and that there are multiple proteins involved in the measured percent of binding.However, it is possible from these results to infer that certain structural features enhance binding, whereas others are detrimental.
As mentioned earlier, the GA based approaches used in the present study, GA1 and GA2, generated models with various combinations of molecular properties.This provides an advantage when one is interested not only in prediction, but also in understanding the mechanisms behind the modeled phenomenon.In this respect, it is clear that hydrophobicity increases drug binding to serum proteins because all the models contain a term for hydrophobicity factor, AlogP98.This has also been observed previously in other models of limited set of compounds 28, 29 and further supported by X-ray structure of HSA, both alone and bound to different ligands 27.From a drug design point of view, an increase of hydrophobicity within a series of compounds is expected to result in an increased serum protein binding as long as the corresponding chemical modifications do not result in an opposing effect of other types of interactions that affect binding.In our models in addition to AlogP98, we uncovered new molecular properties that effect serum protein's binding and further studies are needed to understand the correlation between these molecular properties and serum protein binding.

Conclusions
In this paper, we have derived novel predictive models for serum protein binding affinity βlactam class of antibiotics based on genetic algorithms and in vitro data of percent fraction of βlactams bound to serum proteins.Further, the utility of automated variable and training set selection using genetic algorithms in the development of chemoinformatic models for protein binding affinity is demonstrated and this approach is expected to have significant impact in drug discovery and development of this class of compounds.The predictive performance of the models (both internal and external) is excellent and hence they are applicable to the design of new derivatives of β-lactams.Unlike other reported approaches, in this paper, models with various combinations of molecular properties are presented, which provides options to the end users.
Significantly, the models reported herein are based on molecular properties that are easy and fast to compute and hence can be applied for virtual screening of drug like compounds.Further, new molecular properties contributing to the binding of β-lactam analogs to serum proteins are uncovered.

Datasets
The quality of a QSPR model depends on: 1) size and quality of the data set used 2) the molecular propertied considered and 3) the mathematical methods employed.The serum protein binding data of 113 drugs and drug like compounds used in the present study is collected from literature [15][16][17][24][25][26] and is expressed as a percentage of drug bound to total serum proteins (percent fraction bound, PFB)". Thesevalues are in vitro measurements carried out in several concentrations and the mean value is referred as "PFB.The PFB values range from 2.00 (Morepenem) to 98.0 (Cefonicid) and the molecular weights range from 258 (Penicillin_2) to 646 (Cefoperazone).The names, compound numbers, structures and their corresponding PFB values of the dataset are given in the supplemental material.

Training and test sets
In approach GA1, 100 drugs and drug like compounds are selected manually, and they constitute "Training set I" (Table 3).It consists of compounds with a wide range of molecular size.The selection is based on the substituents on the β-lactam ring system and the diversity of the experimental PFB data.The rest of the 13 compounds constitute the Test set I (Table 4).
In approach GA2, the same numbers of compounds are selected as training sets II and III (Table 7), in an automated fashion using GA from the initial pool of 113 compounds.The remaining 13 compounds are used as Test sets II and III (Table 9).

Software
All the programs used in the present study are developed in-house, are part of the in-house product TATA-Biosuite 22 and are used to perform the following: 1) to draw the 2D structures of the compounds 2) to calculate the molecular descriptors 3) to perform variable selection and the training set selection and 4) multiple linear regression and validation.

Descriptor generation
For all the compounds used in the study, 322 descriptors 23 are calculated using the QSAR module of "Tata-Biosuite" and are stored as a text file.Among the 322 calculated descriptors, 7 are physicochemical descriptors (AlogP98, SklogP, calculated vapour pressure etc.,), 7 are geometrical descriptors (topological surface area, 2D-van der Waals surface area, 2D-van der Waals volume etc.), 11 are structural descriptors (molecular weight, number of rotatable bonds, number of aromatic rings, number of hydrogen bond donors, number of hydrogen bond acceptors etc.,) and the remaining 297 are topological descriptors.The topological descriptors include Weiner index, Balaban index, Kier and Hall molecular connectivity indices, Kappa shape indices, autocorrelation indices, information content descriptors, Electrotopological indices, atomic-level-based AI topological descriptors etc.,

Table 2 .
Correlation matrix of the selected variables

Table 3 .
Results of LOO cross validation

Table 4 .
Results of External Validation It is well known that manual selection of training and test sets followed by QSPR model generation is arduous and even impractical at times, particularly, when the data set is too large.Automated training set selection procedures are expected to be useful in such cases.Abraham et al 30 have employed automated training set selection successfully for the generating prediction models for absorption.They have employed Kennath and Stone algorithm

Table 6 .
Correlation matrix of the selected variables

Table 7 .
Training set selected using GA2

Table 8 .
Results of LOO cross-validation of compounds in training sets II and III

Table 9 .
Results of external validation of compounds in test sets II and III