QSAR modeling and molecular docking studies of 2-oxo-1, 2-dihydroquinoline-4- carboxylic acid derivatives as p-glycoprotein inhibitors for combating cancer multidrug resistance

Multidrug resistance (MDR) proteins related to the ATP-binding cassette family are found in a very wide range of human tumors and result in therapeutic failure. The overexpression of efflux pumps such as ABCB1 is one of the mechanisms of MDR. This paper aims to develop a reliable quantitative structure-activity relationship (QSAR) model that best describes the correlation between the activity and the molecular structures in order to predict the inhibitory biological activity towards ABCB1. In this regard, a series of quinoline derivatives of 18 compounds were analyzed using different linear and non-linear machine learning (ML) regression methods including k-nearest neighbors (KNN), decision tree (DT), back propagation neural networks (BPNN) and gradient boosting-based (GB) methods. Their aim is to explain the origin of the activity of these investigated compounds and therefore, design new quinoline derivatives with higher effect on ABCB1. A total of 16 ML predictive models were developed on different number of 2D and 3D descriptors and were evaluated using the coefficient of determination (R2) and the root mean squared error (RMSE) statistical metrics. Among all developed models, A GB-based model in particular catboost achieved the highest predictive quality, with one descriptor, expressed by R2 and RMSE of 95% and 0.283 respectively. Molecular docking studies against the target crystal structure of the outward-facing p-glycoprotein (6C0V) revealed significant binding affinities via both hydrophobic and H-bond interactions with the relevant compounds. The 17 has shown the highest binding energy of −9.22 kcal/mol. Therefore, it can suggest that 17 may prove to be a valuable potential lead structure for the design and synthesis of more potent P-glycoprotein inhibitors for combination used with anti-cancer drugs for cancer multidrug resistance management.


Introduction
All over the world, cancer ranks first among the causes of death [1]. It occurs because of disruption of the physiological functions of ☆ Declaration of interest's statement.

cells.
A major challenge in developing effective therapy is the resistance of cells to multiple chemotherapy drugs [2]. The ABC (adenosine triphosphate) transporter super family, which transports cytotoxic agents and targeted anticancer drugs using ATP energy, is one of the major mechanisms of drug resistance [3,4,5]. Transporter B1 of the ATP-binding cassette family (ABCB1) is the commonest ABC protein able to mediate multidrug resistance. It is within a system of complex tissue and cellular features that help to bring about drug resistance in cancer cells [6,7,8,9]. In many cases, ABCB1 overexpression is the very first mechanism of resistance, which comes before the development of other mechanisms such as increased drug metabolism, drug target transformation, activation of DNA repair mechanisms, apoptosis checkpoint, and induction of EMT through cell proliferation and the ability to adapt to drug regimens [10]. Pgp (P-glycoprotein) has an impact on multidrug-resistant (MDR) cancer and the relationship between Pgp overexpression and MDR cancer has been proven in the literature [11,12,13,14,15,16]. The ability of Pgp to channel such diverse chemical classes is due, in part, to the numerous transport pathways through the protein that have been visualized using molecular dynamics simulations [17]. Work shows that overexpression of Pgp in cancers can be either intrinsic or acquired following drug treatment, depending on the tissue of origin [18].
The biological properties of some quinoline derivatives were found to be interesting and their pharmacological profile advantageous. It was remarked that those compounds with antitumor activity that contain a quinoline moiety perform as cytostatic agents or inhibitors of the topoisomerase-II enzyme, interfering with DNA replication [19,20,21,22,23].
In recent years, machine learning (ML) has drawn the attention of many researchers and has widely been used to develop QSAR models which allow a reliable prediction making of a targeted activity. Several powerful ML algorithms have recently been developed and proven to outperform the most commonly used regression algorithm in QSAR. Many ML models have been developed as each of them inherits a distinct regression algorithm and will therefore provide probable models with different performance. The idea here consists of making a comparative study of these models and choosing the one that performs best and that will guarantee quality predictions. El Hassan El Assiri has used multiple linear regression (MLR) and artificial neural networks to predict corrosion inhibitory activity of pyridazine-derivatives [24,25]. H. El Ghalia has also used MLR to predict the anticancer activity of 5.6.7-trimethoxy-N-aryl-2-styrylquinolin-4-amines [26]. Other methods were also used in QSAR modeling such as partial least squares, principal components regression methods and so on.
The molecular docking approach represents a new computational strategy to assess the binding affinity of docked molecules to receptors based on the scoring functions of mathematical algorithms, while the QSAR model produces new compounds with more precise pharmacological efficacy that can serve as effective future drug candidates. Both approaches are considered very effective in silico drug design and can be used separately or simultaneously [27].
The synthesis of novel compounds as potent modulators of ABCB1-induced drug resistance in mouse T-cell lymphoma has been previously reported by Baba, Y. F [28]. in our laboratory. These compounds were assessed for their cytotoxic effect and ABCB1 modulating properties against parental and ABCB1 overexpressing mouse T lymphoma cells. The findings of the rhodamine 123-accumulation assay in multidrug-resistant (MDR) mouse T lymphoma cells overexpressing the ATP-binding cassette B transporter protein will be used to construct QSAR models.
The main objective of this work is to develop mathematical QSAR models describing and predicting the inhibitory activity of quinoline derivatives against ABCB1 from different 2D and 3D molecular descriptors using different new ML regression methods. This is done using the molecular docking study to observe the binding mode of quinoline derivatives with anticancer effect in vitro to the active site of 6C0V, with the aim of synthesizing molecules with the desired biological response before carrying out the experimental synthesis protocol.

Data set
In this study, 18 derivatives of 2-oxo-1, 2-dihydroquinoline-4-carboxylic acid were prepared by our Laboratory [28], each was synthetized by varying the halogen and the radical (Table 1) of the acid (Fig. 1), for their inhibitory activity against the ABCB1.

Calculation of descriptors
Molecular descriptors are a fundamental element of the QSAR theory, and are in a way the numerical description of a molecular structure. They are used to establish a relationship between the structure of a molecule and its biological activity. In this work, ChemDraw (v.18.2) and Molecular Operating Environment (MOE v.2009.10) were used to draw the molecules and generate 199 2D and 3D descriptors, some of which are shown in Table 2.

The inhibitory activity response
The rhodamine 123 accumulation assay is a fluorescence detection system that uses verapamil as a reference inhibitor of the ABCB1 efflux pump [29]. The fluorescence intensity of the selected cell population was measured with a Partec CyFlow cytometer (Partec, Munster, Germany). The average fluorescence intensity was calculated for MDR and T-lymphoma cell lines from treated parental mice compared to untreated cells 30,31. The fluorescence activity ratio (FAR) was calculated based on the following equation relating the measured fluorescence values:   Note: The full name of descriptors can found in annex immediately after the conclusion.

Data split
The data consisted of 18 molecules descriptors was randomly split into training and testing sets. The former consisted of 15 molecules that span the entire chemical space for all the data. While the consisted of three molecules within the Applicability Domain (i.e., the range of the training set).

Data exploration by means of PCA
Principal component analysis (PCA) permits us to verify redundancy and collinearity between the studied descriptors and to carry out a comparative statistical study between the proposed mathematical models such as partial least squares regression (PLS) and stepwise multiple linear regression (SMLR) with the aim of correlating activity with molecular structure [32].

Partial least squares regression
PLS leads to a robust statistical solution when the independent variables are strongly related to each other, or if the independent variables outnumber the observations. PLS is an alternating regression method, which generates its solutions based on the linear transformation of a large number of original descriptors to a small number of new orthogonal terms, called latent variables [33,34]. Therefore, this methodology is considered a standard statistical.

Stepwise multiple linear regression
This approach uses the MLR variant commonly, which generates a multi-term linear equation, although not all of the independent variables are used. This method is well suited to be used when the number of descriptors is large and the main descriptors are unknown [35]. MLR is based on the assumption that the dependent variable is linearly related to some independent variables according to the following relationship.
Whereas Y represents the dependent variable (biological activity to be predicted), X i representing the independent variables (molecular descriptors), n indicating the number of molecular descriptors, a 0 showing the constant in the previous equation, a i being the coefficient of descriptors.

Data modeling
In general, QSAR data consists of a large number of features that reflect physico-chemical properties of molecules and which not all correlate to the target. Moreover, it is often characterized by high redundancy, which frequently leads to instability and increases the variance of ML models. The primary goal of data modeling is to develop prediction models that can accurately describe the FAR target variation based on descriptors that have a high, moderate, or low correlation to it. This helps identify the key property of the molecule that affect its activity, suggest a structure or molecule with a specific activity as well as understand the interaction between functional groups in a molecule. Data modeling was performed, on centered and scaled data, with python core (v3.10) where several machine learning (ML) algorithms were used to develop predictive regression models including, decision tree (DT), k nearest neighbor (KNN), gradient boosting-based (GB) models, back propagation artificial neural network (BPNN) [36,37,38].

Back propagation neural network
Back propagation neural network is a commonly known method for developing predictive models from large datasets. It is a neural network (NN) that is primarily trained through using back-propagation (BP) algorithm. It consists of three sorts of layers (input, hidden, and output layers), each of which is composed of one or more neurons characterized by an activation function and bias (Fig. 2). Neurons in one layer are interconnected to those in the next layer by connections known as synaptic weights, resulting in a network that attempts to mimic the human brain. The BP learning algorithm typically comprises of two successive phases which work to minimize the difference between measured values and the Network output by tuning weights and biases in iterative manner: forwardpropagation of information and backpropagation of error. Proper tuning of the weights and biases allows reduce error and make the model reliable. The number of neurons in each layer, appropriate activation functions, the number of layers, the maximum number of iterations. Are all hyperparameters to be tuned in BPNN. In this study, the tuning was performed using a meta-heuristic optimization algorithm called Simulated annealing provided by hyper opt module [39].

K nearest neighbor and decision tree regressors
KNN and DT are two regression methods used in this work because of their simplicity and performance in providing efficient predictive models. They are non-parametric regression methods able to quickly identify the relationship between descriptors and the target. The implementation of DT can also be done without scaling the descriptors and is not largely influenced by the presence of outliers in the data [40].

Gradient boosting algorithms
Gradient boosting is based on the assumption that combining the best next model with the prior models lowers overall prediction errors. Gb algorithms are a family of open-source ensemble methods. They have been widely used for several purposes including, highprecision and adoptable recommender systems building, weather prediction, features selection in regression problems and so on. They allow development of performant and stable predictive models by training a sequence of weak models based on decision trees, each of which compensates for the errors of its predecessors, in contrast to many ML models that concentrate on high quality prediction achieved by a single model. In this work, three GB regression algorithms were used: eXtreme gradient boosting (XGBoost), light gradient boosting machine (LGBM) and categorical boosting (CatBoost) [41].

Predictive models validation and evaluation
After predictive model development, a validation process is required to ensure that the models are performing the way it was intended and that it accurately predict the target. In this work, the validation process was performed in two ways: internal and external. The former was performed using Leave-One-Out Cross Validation (LOOCV) to prevent the model under and over-fitting and perform the hyperparameters tuning. While the later was performed using unseen data and it aimed to test the model capability in making the right predictions in the future. Models evaluation was used to estimate the developed models performance in training, internal and external validation. It was performed using two common statistical metrics: the coefficient of determination (R 2 ) and the root mean squared error (RMSE). The best model is characterized with high R 2 , low RMSE and low variance between the train, internal and external validation. Computational formulas are provided in (1(1) and (2)(2): where yi, ȳ, ŷ, and N are respectively a measured value, the average value of all measured values, the predicted value, the total number of samples.

Molecular docking methodology
The docking process was further investigated between the studied molecules with the best FAR values and the crystal structure of human P-glycoprotein in the outward-facing ATP-bound conformation. All scores achieved during the process of molecular docking have been computed and presented using the Molecular Operating Environment (MOE) software. The structures of these compounds were constructed using ChemDraw 18.2 software.
The protein data bank (https://www.rcsb.org/pdb) was used to recover and generate the target crystal structure of outward-facing p-glycoprotein (PDB code = 6C0V). This multi-drug transporter permeability (P)-glycoprotein is a transporter of adenosine triphosphate (ATP) binding cassettes that accounts for clinical resistance to chemotherapy. As such, P-glycoprotein extrudes toxic molecules and drugs from cells through ATP-powered conformational changes. To accomplish the optimization, all water-bound cofactors and ligands were detached from the protein structure and the hydrogen atoms were finally attached. The active sites have been sequestered and taken as dummy atoms. The MMFF94x force field was adopted to assign all parameters and charges. Following the generation of the alpha site spheres using the MOE site search module, the structural model of the molecules was docked to the surface of the cancer protein interior through the MOE DOCK module. The London dG notation function was used to execute the dock notation in the MOE software and two unrelated refinement methods were then used for the upgrade. Auto-rotating links were then authorized for the top ten link poses that were targeted for analysis to obtain the highest possible score. The database browser was then utilized to match the docking poses to the ligand in the co-crystallized structure along with acquiring the RMSD of the docking pose. Then, the binding free energy and hydrogen bonds between the synthesized molecules and the amino acid residues of the receptors were computed to rank the binding affinity of the molecules to the protein molecules under study. The interaction types together with the RMSD of the (native) ligand in the receptor structure were assumed as the default-docking model.

Descriptors exploration
A heatmap of correlation shown in (Fig. 3) was used as a tool to provide an overview of the relationship between descriptors. It shows the presence of positive and negative high correlations (red and blue zones which correlation coefficient in absolute error is close to 1); this correlation means that there is a curse of redundancy of information in our data which often leads to the instability of ML models. To overcome this curse, two approaches were used: features extraction by means of partial least squares regression, and feature selection using embedded methods based on random selection. However, the figure also illustrates descriptors which have no significant correlation with any descriptor presented by the sky blue, green and yellow zones (correlation coefficient ranging between − 0.5 and 0.5).

Principal component analysis
The PCA was conducted on centered and scaled data in order to identify and select descriptors that correlate to FAR. For this, it was made on 199 descriptors and the sixteen principal components obtained are displayed in Fig. 4.
Ten descriptors which have high correlation (Fig. 5) with the component explaining much our target variation were selected to be used for development of SMLR and PLS models.

Partial least squares regression analysis
Partial least squares regression widely used in the case of a high redundancy in the data. It has the ability to distinguish highly informative features from redundant and uninformative ones and therefore helps construct reduced models that retains key features which have a relation with the response. The resulting PLS model expression, together with the statistical parameter values, is represented by the following equation: Based on the descriptors given in the PLS model equation, the importance of each individual descriptor in relation to the standardized regression coefficients is displayed in Fig. 6.
We observe in Fig. 6, that FAR importance based on molecular structure varies from descriptor to another. Nevertheless, since the descriptors in the PLS model have different units, these standardized coefficients are estimates without real scale, suggesting that these standardized coefficients are not useful for determining the true relative importance and significance of each descriptor in the regression analysis. Furthermore, their value is restricted to determining the positive or negative contribution of molecular indices to the property under study.
Returning to the statistical parameters and the result obtained in Fig. 7, we see that the model achieved a high performance in the training phase: a high R 2 value of 99% and a low RMSE value of 0.37. However, the model attained a performance much lower than that obtained in the training (R 2 = 40%, RMSE = 5.86) which means that it has overfitted the data. The weak statistical results of the PLS model led us to evaluate other statistical models, such as stepwise multiple linear regression and machine learning models.

Stepwise multiple linear regression (SMLR) analysis
Stepwise multiple linear regression is widely regarded as being one of the most fundamental modeling methods recognized in the QSAR field. The ten descriptors resulting from the PCA are an input file for a stepwise selection based on MLR analysis. SMLR consists of treating the links between the dependent quantitative variable to be explained (FAR) and the independent explanatory quantitative  The developed model achieved high quality performance during the training phase, expressed by high R 2 and low RMSE of 90% and 2.377 respectively. However, during the CV the model performance has decreased significantly (R 2 = 48%, RMSE = 5.325) while during the test the model has not explained any amount of the response variability (R 2 = 0%, RMSE = 32.546) ( Table 3 and Fig. 8). This means it is statistically not acceptable.

Machine learning models with feature selection
A summary of developed ML models on different numbers of descriptors (NVAR) is presented in Table 4. It shows that these models have achieved higher performance than SMLR and PLS. in general, most of the models have a good performance in all development and   However, a comparison between models with respect to the computational cost and the number of descriptors shows that M15 is not recommended because of its complexity, larger number of descriptors and a higher computational cost. Therefore, the M2, M4 and M8 are the most performant predictive models which can be used for the prediction of biological activity against ABCB1 from a low number of descriptors.

Docking study
With the intention of targeting more potent molecules, various molecular docking studies were undertaken using MOE software to virtual screen molecular binding modes of four prepared compounds, such as 10, 12, 13 and 17, in the P-glycoprotein pocket. Here, the ligands were docked to the encoded protein 6C0V loaded from the PDB. Ten different interaction positions were authorized for each molecule with the protein and the ranking poses were generated by the scoring functions which are given in Table 5. The 17 obtained the highest score, and the result was − 9.22 kcal/mol. The list of hydrogen bonds between the compounds and the selected protein coenzymes is given in Table 6. The best-fitting pose that was adopted by the enzyme-calmed compound 6C0V is displayed in Fig. 9.
MOE is one of the most important molecular docking mechanisms used to recognize a precise docking study between compounds and target proteins. The compound 17 exhibited a high docking score by hydrogen π-stacking with the 6-membered ring of Gln 882.
The interaction distance and stabilization energy were respectively equal to 4.18 A • and − 1.1 kcal/mol and by hydrogen bonding of  the oxygen atom contained in the ester function to the amino acid residue Asn 183. This hydrogen bond is at about 3.17 A • and the energy stabilization amounts to − 1.7 K cal/mol. Through these bonds obtained with the key amino acid residues of the binding pockets highlighted based on the above results, the target receptor structure could be stabilized. The docking pose showing the highest affinity is regarded as the best docking conformation. Similarly, in view of these crucial interactions at the molecular level, MOE was able to match the experimentally observed binding modes, thus identifying the particular conformation of the target and ligand. By comparing the results obtained with the others of biological activity carried out by our laboratory [42], we note that they are the same since the compound 17 that has the best biological activity obtained the best-docked conformation.

Conclusion
QSAR modeling was performed to develop models for the prediction of the inhibitory effect of quinoline-derivatives on one of the most studied human adenosine-triphosphate (ATP)-dependent efflux transporters that encodes a multidrug resistance protein, called ABCB1 gene. Through this study, several machine learning methods were tested to identify relevant descriptors and develop reliable models including partial least squares regression, stepwise multiple linear regression, back propagation neural networks. The results obtained showed that a catboost model statistically outperformed the other used methods, achieving a R 2 and RMSE of 95% and 0.28 respectively. Based on the computational study, compound 17 was found to possess the maximum binding affinity to the target protein of outward-facing p-glycoprotein with a binding energy equal to − 9.22 kcal/mol. According to the above fact of the laboratory results, this compound can be proposed as a lead structure for the design and synthesis of more potent P-glycoprotein inhibitors for combination used with anti-cancer drugs for cancer multidrug resistance management. Indeed, in this work, descriptors were identified by QSAR models to effectively predict the P-glycoprotein inhibitory effect as well as to guide the design of new of 2-oxo 1, 2-dihydroquinoline-4-carboxylic acid derivatives for potential applications in cancer multidrug resistance area. This will help to facilitate the drug development process and minimize the cost of synthesis in pharmaceutical chemistry laboratories. The mean square deviation after refinement E_place Score of the placement phase E_conf Energy conformer E_refine Score refinement E_scor1 Score the first step of notation