Generation of new inhibitors of selected cytochrome P450 subtypes– In silico study

Graphical abstract


Introduction
Various computational strategies are an indispensable part of the drug design process [1,2]. They support the development of new active compounds, as well as the optimization of their physicochemical and pharmacokinetic properties [3][4][5]. In silico methods used in the search for new active compounds can be divided into the ligand- [6] and structure-based [7] approaches. In the first case, the predictions are made based on the information of known ligands, both in terms of their activity and properties. On the other hand, in the case of the structure-based methods, it is the informa-tion about the target structure that is used to predict contacts between the potential ligands and target proteins.
To design an effective drug one must guarantee that after entering an organism it will have enough time to trigger a desired biological response. However, at the same time, the drug is constantly exposed to processes leading to its decomposition, which shorten its time of action and might also result in the formation of toxic products [8][9][10]. Unfortunately, biological processes occurring in the living organisms are very complex, and most often they are related to interactions with more than a single target. This makes metabolic stability one of the most difficult properties to evaluate using in silico methods. Nevertheless, both compound stability as well as other ADMET properties are extremely important for the potential compound success in drug design campaign, as even the most active compound will not pass to the subsequent stages of drug development pipeline if its ADMET properties are unfavourable.
Metabolic processes related to the final removal of xenobiotics from the organism can be divided into two main phases. In the first phase, which is the focal point of this study, the main role is played by cytochrome P450 (CYP) -a group of haemoprotein enzymes with monooxidase activity. There are almost 60 different CYP subtypes occurring in the human organism; however, some subtypes are involved to a much higher extent in the metabolic processes, such as CYP3A4, which is responsible for transformations of over 50 % of drugs [11][12][13].
When a compound comes into interaction with a particular CYP enzyme, it can slow down its transformation processes (inhibitors) or induce them (inducers). In this study, we concentrate on inhibition of selected CYP subtypes (CYP3A4, CYP2D6, CYP2C8, CYP2C9) and develop a methodology for generation of new inhibitors of these CYP proteins. As a starting point, we use known CYP inhibitors and modify their structure by adding selected chemical groups. The inhibition potency of the generated compounds is evaluated via docking and only the most potent compounds are finally returned. We use this database to perform a systematic analysis of the influence of particular substitutions on compound inhibition potency and provide the knowledge base for the design of new CYP inhibitors.
Furthermore, we use this database to develop a graph neural network (GNN) [14] that predicts the change in the compound CYP inhibition properties for all modifications of the starting molecule. Moreover, for each newly generated structure, we provide an explanation of the prediction of its inhibitory potency. It enables the indication of the structural moieties, which are most important for the particular prediction. Therefore, such explanations can be used to guide the further optimization of the compound structure in terms of its CYP inhibitory properties, especially when combined with the prediction of the inhibitory power given by the GNN model. In addition, baseline machine learning (ML) models for the binary prediction of the docking score change (either decrease or increase) upon particular substitution are developed (ML models for metabolic stability prediction expressed as half-lifetime have already been constructed by several research groups [15][16][17][18]). It is worth noting, that it is an innovative approach of GNN application in computer-aided drug design, as it is the first time, when a compound graph is not treated as a whole, but particular graph vertices constitute basis for docking score predictions and assist in compound optimization.
Finally, we prepared an on-line visualization platform, where users can manually compare the compound poses in the respective CYP binding site, examine interactions and propose their own structural modifications (https://gmum.github.io/cyp-inhibitors/). All experiments carried out in the study can be reproduced for any target using the provided scripts (https://github.com/gmum/ cyp-inhibitors). The library of newly generated potential CYP inhibitors is shared in the Supplementary Data. The visualization of the main aspects of the presented study is presented in Fig. 1.
A series of approaches to quantitative structure-property relationship (QSPR) tasks has already been proposed [19][20][21][22][23][24][25][26][27][28], which are continuously evolving together with the development of new algorithms. GNNs used in the study have also already entered the field of computer-aided drug design, and they have been utilised for example in QSPR-related tasks due to their input being suitable to represent molecules and because of their superior performance [29][30]. Wang et al. [31] trained GNNs to predict pIC50 values of JAK inhibitors, while classification of molecules as inhibitors or non-inhibitors of selected CYP450 isoforms was done by Wu et al. [32] who used ML and deep learning (DL) models or Li et al. [33] who utilised single-and multitask deep neural networks (DNNs).
In our study, we use GNNs to predict the exact value of docking score of a series of compound derivatives. The problem of docking score prediction in the literature was reported e.g. by Jastrzę bski et al. [34] who concentrated on several GPCRs and selected CYPs and utilised GNNs or Ton et al. [35] who focused on SARS-CoV-2 main protease (Mpro) and used DNNs. Here, we direct our attention to selected CYP450 isoforms and employ GNNs to predict docking score change for all modifications of the input compound. This is different from the discussed work by that we predict a change in the docking score instead of the score itself, and, more  importantly, because the prediction is made not for the input compound itself but for all of its modifications. We believe that such a construction of the predictive model is best suited for compound optimisation task because there is no need for manual definition of possible modifications and moreover, predictions for all of them are calculated and returned all at once.

Datasets preparation and compounds enumeration
Compounds with known inhibition potency on the considered CYP subtypes (CYP3A4, CYP2D6, CYP2C8, and CYP2C9) were fetched from the ChEMBL database, version 27 [36] (the intersection between particular datasets is presented in Fig. 2). We filter out all records which do not refer to the Standard Type: Inhibition and Standard Units: % and the following number of data points remain: CYP3A4: 1900; CYP2D6: 1326; CYP2C8: 101; CYP2C9: 1000.
Then, the set of new potential CYP inhibitors is formed by addition of a respective substituent to the initial compound structure. The list of chemical fragments added is as follows: F, Cl, Br, I, C, C (C)C, CC, C(=O)O, O, OC, COC, CO, C(=O)C, N, S.

Crystal structures characterization and docking
All compounds (both these initially downloaded from ChEMBL database and the newly formed ones) are docked to the available CYP crystal structures. In all cases, we use two types of crystal structures: free and with inhibitor co-crystallized ( Table 1).
The docking is carried out in Smina [44] using the default settings and the Vina scoring function. We validate the docking via examination of the docking poses obtained for the co-crystallized inhibitors (Fig. S1). The proteins are cleaned by removing all non-protein atoms, excluding heme (the cytochromes cofactor). The information provided by docking (the docking score value) is used to create a new dataset that describes the change in the docking score for modifications of the starting molecule. The dataset obtained for considered CYP subtypes is available in Supplementary Data, and the code for generation of derivatives of ligands of any target is available on the GitHub repository (https://github.co m/gmum/cyp-inhibitors).

GNN model
We use the dataset consisting of the information from docking (the docking score value) to train convolutional GNNs [14]. The task is defined as node regression, and the models are intended to predict an exact change in the docking score for each possible modification of the input compound. It is worth emphasizing that the change in the docking score for all possible modifications of the original compound is computed in a single pass. Such an approach is much more effective than making a separate calculation for each possible modification.
In the node regression task, the label for each node is a vector of changes in the docking score value for all substitutions used in this atom (Fig. 3). If an atom cannot be substituted, the corresponding position in the vector remains empty and is not used neither for training nor for testing.
The GNNs consist of 3 or 5 convolutional layers with the hidden representation size of 256. We use both classical convolutional layers proposed by Kipf et al. [14] and graph attention layers proposed by Veličković et al. [45] The convolutional layers are followed by one or two linear layers. All models use skip connections, Batch-Norm [46] and dropout of 0.2 or 0.5. [47] All models are trained for 200 epochs with Adam, a learning rate of 0.01 or 0.001, batch size 256, and ReduceLROnPlateau scheduler with patience equal to 10. All models use weight decay of 0.0005 or no weight decay at all. As a training objective, we use masked MSE loss, that is MSE loss which ignores errors for substitutions that are not present in the training data. For each CYP subtype we train 64 different architectures using fivefold cross-validation to choose the best hyperparameters. The final model is evaluated using a held-out test set.
The molecules are represented using a graph molecular representation with the following atom features: atom type, the number of implicit hydrogens, the number of heavy-atom neighbours, formal charge, ring inclusion, and aromaticity. The resulting length of atom representation is 42. The information about bond features is not included.

ML reference models
As a reference, we develop models for the prediction of the direction of docking score change (increase or decrease). To this end, we use: a baseline approach, Random Forest (RF) [48], a GNN [14] and a GNN with a dummy node [49]. The summary of these models is presented in Table 2.
As a baseline approach we use a model that assigns the most prevalent label in the dataset to each compound. This baseline shows the ratio between positive and negative classes in the dataset.
RF makes separate predictions of a docking score of the original and the modified compound. These predictions are compared to assign the change in the docking score. The molecules are represented with Morgan fingerprints (Morgan FP) [50,51] with radius 2 and 1024 bit-length.  Both GNN models use graph molecular representation with the same atom features as previously. Here, the problem is again formulated as a node regression task -the docking score change is predicted for each atom of the original compound and for each possible modification in a single pass.
GNN is a classical graph convolutional model introduced by Kipf et al. [14], while GNN-dummy is an extension of the GNN model, which includes additional nodes, called dummy nodes, in the molecular graph [49]. These nodes are connected to all the other nodes, and their purpose is the aggregation of the signal from the whole molecular graph in each graph convolutional layer. This way the perception field of the convolution is artificially extended beyond the atom neighbours.

Explainability
Explainability is a quickly growing field of ML [52][53][54]. Its techniques aim to elucidate inner workings of black box models. In this work, we use an explainability technique, called saliency maps, in order to provide information about the influence of particular atoms on the predictions.
Saliency maps are a visualisation technique introduced by Simonyan et al. [55] They explain the predictions of the model by an analysis of its partial derivatives and can be seen as a sensitivity analysis technique [56]. The main idea behind this approach is that derivatives can be seen as a measure of how sensitive is the function's output with respect to its input. Formally, a saliency map is calculated by measuring a length of a vector of positive partial In the classical approach, only the positive gradients contribute to the final result. Apart from this, we also investigate the influence of negative partial derivatives (kReLU Àdy dx À Á k). This allows us to compare the influence of positive and negative partial derivatives which we illustrate by calculating a difference between classical and negative saliency maps and call this approach positive-negative saliency maps.

Visualization platform
In order to enable visual comparison of the obtained docking poses, we prepared an on-line visualization platform (https://gm um.github.io/cyp-inhibitors/). It enables manual confrontation of the docking poses of original compound and its derivatives, constituting a great support during interpretation of the docking score changes occurring upon substitution. In the platform, we incorporate results only for the top 100 compounds (in terms of their docking score value), while the docking poses for all derivatives obtained in the study are available at https://github.com/gmum/cyp-inhibitors/tree/main/data/poses.

ChEMBL data
The distribution of data used in the study is presented in Fig. 4. It shows that the distribution of the percentage of inhibition is similar for all examined CYP subtypes. The highest number of compounds fall in the range of subtle CYP inhibition (between 0 and 20 %), and the number of compounds from the subsequent inhibition ranges (20-40 % and 40-60 %) gradually decreases. In each case, there is also a small number of compounds which appear not to have the ability to inhibit CYPs (with inhibition percentage between À20 -0%) and for CYP3A4, CYP2C9, and CYP2D6, there The overview of the process of assigning labels to graph nodes. On the left, an exemplary compound is shown, and the substitutions are made in the position marked with the question mark symbol. After substituting this atom, the modified compounds are docked, and the difference DDS between the docking score of the modified and original compound is calculated. Next, the differences for each substitution are assigned to the vector label for the given node, as depicted on the right. The process is repeated for all ring atoms in the original compound and for all the defined substitution groups. If any of the substitutions is not possible (e.g. due to valency constraints), a null value is assigned in the label vector, and this value is omitted in the training. also exist several compounds that appear to be inducers with the reported CYP inhibition ability between À40 -À20 %. Additionally, we examine the influence of particular substitutions on the CYP inhibition present in the ChEMBL data. For this particular analysis we do not use the docking results and narrow down to only these modifications that are present in the ChEMBL database. Selected examples are shown in Fig. 5.
In the case of compounds CHEMBL470118 and CHEMBL518629 (first row), where the carboxyl group is exchanged into the primary amine, the significant change in the CYP3A4 inhibitory potency is

Analysis of the influence of particular substituents on docking score -monosubstitution
We dock all compounds (both original and the derivatives) to the respective CYP crystal structures and determine the docking scores of the obtained ligand-protein complexes. The differences in the docking scores between the original compounds and their derivatives are presented in Fig. 6.
The first observation is that the tendencies for each substituent are similar for free and for inhibited crystal structures -blue and orange parts of charts in Fig. 6 are similarly distributed. However, there is a variation between different substituents. In general, the addition of a halogen leads to an improvement of the docking score -this refers mainly to the fluorine and chlorine substitutions, which lead to more effective CYP inhibitors in comparison to the starting compounds (when docking score-based evaluation is taken into account). On the other hand, bromine and iodine substitutions are much less effective in improving the CYP inhibitors docking scores, and for CYP2C8 and CYP2D6 they even worsen the docking score values. Likewise, the addition of sulphur, OC or COC substituents always leads to the worsening of docking scores.
Despite similar distribution of general tendencies of docking score differences observed for free and inhibited crystal structures, examination of the results from the perspective of particular compound reveals that the outcome is in fact influenced by the type of the crystal structure (detailed analysis is present in Fig. S2).
In Fig. 7, we present histograms of changes in the docking scores (detailed information is available in the Supplementary Data). For a great majority of cases, changes in the docking score are below 1, although higher values are also observed. The fraction of compound poses with docking score difference between 1 and 2 is similar for all of the considered CYP subtypes. For CYP2C8, there are no poses with docking score change higher than 4 upon substitution. On the other hand, for all other CYP subtypes, there are vari- ations in the docking score up to over 10, although they refer to less than 1 % of the total number of docking poses. To sum up, in some cases even a single substitution can lead to a huge difference in the docking score value; however, in most cases this difference is negligible.

Di-substitution experiments
To examine the possibilities offered by modifying an input molecule in more than a single place, we carry out an analysis of di-substituted compounds. A visualization of differences between  the double and single substitutions is shown in Fig. 8. It is visible that in each case, the single substitution is preferred over the double one. The analysis is based on the averaged difference in docking scores regardless of the position of substitution.

Predictive performance of GNNs and ML models
To put the predictive performance of GNNs in context, we develop ML models with the aim of predicting the direction of the docking score change, which is a simpler task then predicting its exact value. The performance of the models is thoroughly examined using accuracy and mean squared error (MSE), which are calculated on both validation and held-out test sets. The evaluation on the held-out test set can be considered as a simulation of the real application of the constructed protocol to evaluate novel compounds (e.g. during the virtual screening procedure where compounds covering broad chemical space undergo evaluation), as it constitutes a fully external dataset. We use two approaches to divide data into train and held-out test sets -random selection and time-based selection (on the basis of the record list from the ChEMBL database).
The prediction accuracies and MSE values are gathered in Tables 3-6 (baseline assigns label of the majority class to all samples, RF accuracies are determined only for time split). The values shown in Table 3 and Table 4 indicate high variations in the prediction accuracy depending on the model. In almost all cases, GNN appears to Table 3 Accuracy of the prediction of the docking score change (increase or decrease) after compound modification for crystal structures with CYP inhibitor co-crystallized for validation and held-out test sets. The best predictions for the particular crystal are depicted in bold. be the most effective model with prediction accuracy between $0.58 to 0.67. The performance of both GNN and GNN-dummy models in the prediction of correct direction of change in the docking score upon substitution is better than the baseline by up to $0.15, which justifies the application of the developed approach. The performance of RF is close to the baseline, indicating its inability to correctly learn the patterns in the data and justifying the necessity of developing more sophisticated models, such as GNNs.
In addition, we observe no significant improvement in the performance of the GNN-dummy model over its classical GNN counterpart, which may indicate either ineffective aggregation of the molecular graph by the dummy node or a strong correlation between the docking score change and local features of the chemical structures in the GNN model. The performance dependencies are similar for both types of crystal structures -free (Table 3) and co-crystallized with inhibitor (Table 4). GNNs and GNN-dummy models are also the most effective methods when the compounds are ranked on the basis of the predicted docking score values, which is indicated by the lowest MSE out of all of the compared approaches ( Table 5, Table 6).
In addition, we compare the GNN models using the sum of ranking differences (SRD) [57][58][59] -GNN and GNN-dummy achieve comparable performance in compound ranking.
As an additional evaluation, we determine the prediction accuracy of GNN models when cases with very small changes in the docking score (defined as margin) are neglected. We calculate accuracy on held-out test sets with several different values of the margin and present the results in Fig. 9 (the complete data for validation and held-out test sets is present in the Supplementary  Data, Fig. S3).
The data presented in Fig. 9 indicates, that negligence of compounds with very small changes in the docking score leads to higher values of accuracy. It is an intuitive outcome, as occurrence of small docking score changes might be to some extent a result of randomness in the docking process and therefore the assignment of a substitution as leading to increase or decrease in the docking score might be biased. For CYP3A4, CYP2C8, and CYP2C9, for crystal structures co-crystallized with respective inhibitors, the accuracy values continuously increase as the margin is getting higher, up to approximately 1, and then, the accuracy drops again. However, it is worth noticing, that at the margin of 2, the accuracy still adopts higher values than when no data is neglected. On the other hand, for 3QM4 (CYP2D6 crystal structure), the accuracy values are constantly rising with the margin increase, reaching a plateau when the margin is equal to approximately 1.7. For free crystal structures, the situation is similar for CYP3A4, CYP2C9 and CYP2D6 (in comparison to inhibited proteins); however, when it comes to the CYP2C8, the accuracy adopts values between 0.60 and 0.62 until the margin of 1.0, whereas further margin increase leads to significant accuracy rise (with values approaching 0.85). It has to be however pointed out that the accuracy values on CYP2C8 might vary more than in the case of other targets, as there is a relatively low number of records in the dataset referring to this cytochrome subtype. This further entails the relatively small held-out test for CYP2C8, which therefore can be biased and can lead to higher variation in the results. Overall, this results suggest, that GNN models more often predict the docking score change correctly if its value is high enough. This indicates that, in some cases, incorrect prediction of the docking score change can be attributed to randomness of the docking score acquisition.

Saliency maps help designing new CYP inhibitors
In this section, we show how saliency maps can be used to enable guidance for the process of designing new CYP inhibitors. Saliency maps show which structural features of the input compound influence the obtained predictions to the highest extent.
In Fig. 10, we present saliency maps obtained for CHEMBL2407331 (top row) and CHEMBL2407336 (bottom row). In both cases the explanations are given for prediction of a docking score change of compounds modified by attaching fluorine to the atom marked with a red dot in leftmost pictures. The presented Fig. 10. Outcome of the explainability analysis for the selected compounds, a) indication of the atom in the CHEMBL2407331 structure for which the predictions are analyzed, b) saliency map for CHEMBL2407331, c) positive -negative saliency map for CHEMBL2407331 (difference between the classical saliency map and ''inverse" saliency map calculated on the negative partial derivatives), d) indication of the atom in the CHEMBL2407336 structure for which the predictions are analyzed, e) saliency map for CHEMBL2407336, f) positive -negative saliency map for CHEMBL2407336. molecules differ by only one atom (fluorine in CHEMBL2407331 is substituted with chlorine in CHEMBL2407336) and their explanations are similar. However, a few differences can be spotted. In the presented saliency maps saturation represents the value, with lighter shades marking values closer to zero, and color represents the sign -red for positive and blue for negative values.
At the first glance, one can observe that the only atoms identified as important are these in the close proximity of the atom for which the prediction is made. This stems directly from the definition of saliency maps and the fact that the model being explained has 5 convolutional layers. The number of convolutional layers restricts the neighbourhood from which information can be utilised to make a prediction. As a result, the outcome is insensitive to information from outside of this neighbourhood, and thus the respective partial derivatives are equal to zero.
Another similarity is the relative importance of the atoms, which can be partly attributed to the fact that the closer an atom is to the one for which a prediction is made, the more times its information is used to calculate this prediction. Furthermore, in case of positive-negative saliency maps the sign of the calculated importance is an another similarity as blue dots and red dots are similarly distributed.
For both molecules classical saliency maps indicate that the halogen atom that is closer to the substituted atom is more important than the other one. On the other side, positive-negative saliency map for CHEMBL2407331 (Fig. 10c) assigns higher importance to the halogen further away. This indicates that the negative saliency map assigns a higher value to this atom then the classical saliency map, whereas the values for the other halogen are similar for both classical and negative saliency maps. Intuitively, this means that the value of the prediction is sensitive to information encoded in both halogens; however, in the case of chlorine negative partial derivatives dominate while in the case of fluorine the positive and negative partial derivatives cancel each other out. In the case of CHEMBL2407336 the situation is reversed. These findings might suggest that the modified molecules take slightly different poses in the binding pocket, and thus different atoms are more important for the ligand-CYP interactions.

Visualization tool
We developed a visualization platform in order to enable a manual inspection of the docking poses obtained for the generated compounds and their comparison with the ligand-protein complexes for the existing inhibitors. The platform is available at https://gmum.github.io/cyp-inhibitors/. It enables instant confrontation of the obtained changes in the docking scores (original compound vs derivative) with the actual compound docking poses (Fig. 11).

Conclusions
In the study, we develop a protocol for generation of new inhibitors of selected CYPs. Although, the procedure is optimized for this particular target, it can be applied to any protein in a similar manner, as the code is available and includes all scripts required to reproduce the results (https://github.com/gmum/cypinhibitors).
The subsequent stages of the proposed methodology are composed of generation of new derivatives of known CYP inhibitors, docking and evaluation of the compound possible inhibition on the basis of the obtained ligand-protein complexes. Moreover, an innovative GNN model for prediction of the docking score change upon particular substitution is proposed. This model makes predictions for particular graph vertices, not a compound graph as a whole. The activity predictions obtained for the generated compounds can be analyzed in detail using saliency maps to detect structural features, which influence to the highest extent the predictions of the inhibitory potency of the newly formed molecules. Our data suggests that even a single substitution can lead to a huge difference in the docking score value; however, in most cases this difference is negligible. Furthermore, using more than a single substitution does not seem to further improve the docking score.
Moreover, we prepared a visualization platform (https://gmum. github.io/cyp-inhibitors/), where the docking poses of the newly formed inhibitors can be manually inspected and confronted with the docking outcome of the original compounds. The results not only provide a library of new potential CYP inhibitors (all generated compounds are shared in the Supplementary Data), but can also guide the process of designing new compounds with CYP inhibitory properties. Availability of all scripts used in the study (at https://github.com/gmum/cyp-inhibitors) makes the developed tools general and enable their application for any target.

Funding
The study was supported by the grant OPUS 2018/31/B/ NZ2/00165 financed by the National Science Centre, Poland (www.ncn.gov.pl). The work of TD was supported by the grant PRELUDIUM 2020/37/N/ST6/02728 financed by the National Science Centre, Poland (www.ncn.gov.pl). This research was supported in part by PL-Grid Infrastructure.
CRediT authorship contribution statement

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.