D-CyPre: a machine learning-based tool for accurate prediction of human CYP450 enzyme metabolic sites

The advancement of graph neural networks (GNNs) has made it possible to accurately predict metabolic sites. Despite the combination of GNNs with XGBOOST showing impressive performance, this technology has not yet been applied in the realm of metabolic site prediction. Previous metabolic site prediction tools focused on bonds and atoms, regardless of the overall molecular skeleton. This study introduces a novel tool, named D-CyPre, that amalgamates atom, bond, and molecular skeleton information via two directed message-passing neural networks (D-MPNN) to predict the metabolic sites of the nine cytochrome P450 enzymes using XGBOOST. In D-CyPre Precision Mode, the model produces fewer, but more accurate results (Jaccard score: 0.497, F1: 0.660, and precision: 0.737 in the test set). In D-CyPre Recall Mode, the model produces less accurate, but more comprehensive results (Jaccard score: 0.506, F1: 0.669, and recall: 0.720 in the test set). In the test set of 68 reactants, D-CyPre outperformed BioTransformer on all isoenzymes and CyProduct on most isoenzymes (5/9). For the subtypes where D-CyPre outperformed CyProducts, the Jaccard score and F1 scores increased by 24% and 16% in Precision Mode (4/9) and 19% and 12% in Recall Mode (5/9), respectively, relative to the second-best CyProduct. Overall, D-CyPre provides more accurate prediction results for human CYP450 enzyme metabolic sites.


INTRODUCTION
Drug metabolism is closely linked to the bioavailability, bioactivity, and toxicology of those drugs.Cytochrome P450 (CYP450) enzymes are responsible for the metabolism of approximately 90% of FDA-approved medicines and play a vital role in the Phase I metabolism of drugs (Nebert & Russell, 2002).Humans have more than 50 CYP450 isozymes, of which CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C9, 2C19, CYP2D6, CYP2E1, and CYP3A4 are responsible for the Phase I metabolism of most known drugs (Nebert & Russell, 2002;Furge & Guengerich, 2006), so predicting drug metabolism by CYP450 isoforms is crucial for drug design and discovery (Jianing et al., 2011). Schwab, 2013).As shown in Fig. 1, D-CyPre can be divided into two parts.The first part generates the features using D-MPNN, and the second part predicts metabolic sites using these features.Finally, D-CyPre visually displays the predicted results (Fig. 1).The darker the red in the figure, the higher the probability of metabolism at this site.Additionally, the probability value is written on the target atom or bonding atom of the target bond.D-CyPre only displays sites that have a probability of metabolism greater than 50%.
EBoMD2 was used as a test data set to evaluate D-CyPre's performance and compare it with CyProduct's performance.EBoMD2 comes from CyProduct and contains 68 extracted reactants and 30 known non-CYP450 reactants (Tian et al., 2021).D-CyPre's accuracy in metabolite prediction was then compared to BioTransformer and CyProduct (Djoumbou-Feunang et al., 2019;Tian et al., 2021), which are predictive models for metabolites.

Atoms and bonds of metabolism
The BOM used in this study was determined by CyProduct.Tian et al. (2021) argue that BOM is more clearly defined and classified more systematically than SOM.Because D-CyPre is structured differently than other models, a new definition of SOM was created based on the BOM.In D-Cypre, the features of atoms and bonds both descend to the same dimensions by neural networks, so common features of atoms and bonds can be identified.This study still refers to these defined atoms and bonds as SOMs.The specific rules of SOMs that D-CyPre should recognize are described, as follows: (1) i-j: i and j represent any two non-H atoms currently connected by an existing chemical bond.The bond formed by these two atoms is a SOM that D-CyPre should recognize.
(2) i-H: i represents any non-H atom, and hydrogen atoms on i are replaced with heteroatoms.Atom i and the bond formed between i and H are both SOMs because this reaction involves both i and its bonds with H.
(3) SPN: When new bonds are generated on S, P, or N by sharing their lone pair of electrons, these atoms are defined as SOMs because this reaction only involves atoms (S, P, or N).
Instead of creating a model for each type of bond, as CyProduct does (Tian et al., 2021), only one model was used to identify all the types of SOMs of one CYP450 isoform.D-CyPre does not treat atoms and bonds separately, but uses the same discriminator for both to determine whether they are SOMs.Atoms and bonds are determined using the same model because their associated information is combined during the message-passing process of D-MPNN, and the model is likely to learn more positive information without distinguishing between them.The distribution of SOMs for nine CYP450 isoforms is shown in Table 1 and Fig. 2. In both the EBoMD and the EBoMD2, 3A4 had the largest number of SOMs and non-SOMs, and there were large differences in the number of SOMs and non-SOMs for each isozyme.

Generation of metabolites based on SOMs
Metabolites were generated based on whether the SOM was an atom or a bond.The detailed SMIRKS Reaction can be found in the Reaction rules of the CyProduct project (https://bitbucket.org/wishartlab/cyproduct/src/master/;Tian et al., 2021).The generation rules are as follows: (1) Atom: When the SOM is an atom, oxidation occurs if the atom is S, N, or P. When the atom is C in the benzene ring and is connected with H, hydroxylation should occur.If the C is not on the benzene ring and there are N or O atoms, all the bonds connected to the N and O atoms are broken.If the molecule is broken into two products, the C of the product will further oxidize.If the molecule is still one product, the metabolism is not considered.
(2) Bond:   2).Details of these descriptors are available in Table S1.The data used for training and testing D-CyPre included C, H, O, N, S, and P. To avoid producing a large number of dimensions that could be learned, the same value was assigned to all other types of atoms when calculating the atomic number.

D-CyPre
D-CyPre consists of D-MPNN and XGBOOST, where D-MPNN outputs the features of atoms and bonds, and XGBOOST identifies SOMs based on these features.The details of these structures are as follows: D-MPNN D-MPNN was built based on ComboNet's MPN, which originally came from the opensource Chemprop Software (Yang et al., 2019;Jin et al., 2021) available at https://github.com/chemprop/chemprop.First, the information about the directed bonds and their starting atoms are fused: where W a 2 R hÂh a is a learned matrix, x v ; e vw ½ 2R h a splice together e vw , the feature of a directed bond, and x v , the feature of the initial atom of the bond.To avoid a message loop through a directed bond, the features of directed bonds vw starting from atom v are characterized by concatenating the features of atom v and undirected bonds vw, and then updated through W a Eq.(2).s is the LeakyReLU activation function (Xu et al., 2015).Then, the message-passing begins: where W m 2 R hÂh is a learned matrix, drop is the Dropout layer (Srivastava et al.), and N v ð Þ represents all the atoms connected to v. The features surrounding the directed bond kv are first aggregated through Eq. (3).The features aggregated in Eq. (3) are then updated through W m in Eq. ( 4) and are combined with the original features of the directed bond kv.This is because the initial features are the most important information directly related to the directed bond kv.Finally, the features of the directed bond kv are obtained after one message-passing action.The message-passing action is repeated n times, which represents the depth of the message-pass; the greater the n, the farther the message will pass.Then, the features of the bonds and atoms are calculated from the message: where B is the Batch Normalization (Ioffe & Szegedy, 2015), W o 2 R hÂh b is a learned matrix.Note that the same Batch Normalization layer is used for both atoms and bonds.
When calculating the features of the bond formed by atoms v and w, the mean of the bond features in both directions, after multiple message-passing actions, is calculated and normalized by Batch Normalization Eq. ( 5).To calculate the features of atom v, all the features of the directed bond starting from the atom v are gathered and spliced with the initial features of the atom.This is because the initial features of the atom are the closest features to the atom's own state.The features are then updated based on W o , and the final features are obtained through activation functions Dropout and Batch Normalization Eq. ( 6).Then, F v and F vw are fed into a single-layer neural network, and two values are obtained for each bond and atom: the positive probability and the negative probability.The cross-entropy is then used to calculate the loss of the model: where loss p and loss n are the loss of atoms and bonds that are truly labeled positive and negative, respectively; and a and b are two self-defined parameters, which respectively represent the importance attached to loss p and loss n .These two parameters are adjusted when training models of the different CYP450 isoforms.This approach is adopted because the data set is unbalanced, with a large number of negative data and only a small number of positive data (Fig. 2).Because of this imbalance, the model may classify all samples as negative resulting in an extremely low loss.Therefore, by setting the loss value in this manner, the weight of positive and negative data can be flexibly adjusted while treating them equally.This ensures that the recall rate of the model is not too low.The updated feature generator generates features that are passed to XGBOOST to train the real discriminator.The objective and Feval for XGBOOST are set to "binary: logistic" and Jaccard score, respectively.Other parameters for XGBOOST such as "n estimators", "reg lambda", "max depth" and "colsample bytree" are tuned for different isoforms.

Molecular features
Molecular features play a crucial role in identifying SOMs.For instance, two atoms or bonds in similar conditions may react differently with CYP450 due to their molecular structure: one may react, while the other may not.Such bonds or atoms are hard to identify without including molecular features.This study considered two types of molecular features: the first type was generated by a new D-MPNN (Yang et al., 2019), and the second type was directly calculated according to specific rules (MolWt; NumHAcceptors; NumHDonors; MolLogP; TPSA; LabuteASA).Details of these descriptors can be found in Table S1.The molecular features in this study were directly concatenated with the features of the atoms and bonds contained in that molecule.Although molecular features are important, their introduction did not necessarily improve the Jaccard Score of all models in this study.There are two main reasons for this.First, the model used in this study is already complex, so introducing molecular features might not have been able to further improve it and could have even caused more severe overfitting.Second, because the data set was not large enough, the model could only learn a small amount of molecular information, which may have been a disturbance for some isoforms of CYP450.Figure 3 illustrates the structure of D-CyPre that incorporates molecular features.
Precision mode and recall mode D-CyPre has two modes: high precision and high recall.The difference between the two is that in Precision Mode, XGBOOST's "scale pos weight" is set to the default, while in Recall Mode, this parameter is set to (c Â Positive/Negative), where c is a parameter that can be adjusted.

Training model
For any CYP450 isoform, the EBoMD was divided into a training set and a validation set, in a ratio of 8:2 (since the features of SOMs are affected by the entire molecular structure, molecules were used, rather than SOMs, as the minimum unit when dividing the data set).Based on these data, the model parameters were adjusted to obtain those with a high Jaccard score in both the training and validation sets.
During this process, D-MPNN was trained using data from all isoforms and then XGBOOST was trained using data from only the target isoform (Fig. 4).This improved both the Jaccard score and the generalization ability of the model because the metabolism information of nine isoforms had some overlap.Although learning more knowledge from other isoforms may introduce some noise into the model, this knowledge and moderate noise enhance the model's generalization ability (File S1).The same method, based on parameters with a high Jaccard score in both training and validation sets, was used to train the final D-CyPre and test it with the test set.

The results of training based on SOMs
The result of CypBOM is a cross-validation result of CyProduct in predicting metabolic sites, which is only used as a simple reference because of slight differences in the definition of metabolic sites.For metabolite prediction, after optimizing the SOMs prediction model, it is more valuable to convert the SOMs predicted by the model into metabolites and compare them with CyProduct and BioTransformer (the results of testing based on metabolites).

Training results of precision mode
Training results are shown in Table S2.The Jaccard score of D-CyPre-val for nine CYP450 enzymes was higher than the CypBoM Jaccard Score for these enzymes.Similarly, D-CyPre-val showed higher precision and F1 values for all CYP450 enzymes except 2C8.

Training results of recall mode
According to results shown in Table S3, D-CyPre-val produced a higher Jaccard score, recall, and F1 for nine CYP450 enzymes.However, D-CyPre-val failed to achieve better precision in the data of 1A2, 2B6, and 2C8 under Recall Mode.The weighted average of all metrics of D-CyPre-val was higher than that of CypBoM.This indicates that D-CyPre has good predictive power.In Recall Mode, D-CyPre produces comprehensive results while retaining high precision.

The results of testing based on SOMs
Testing results of precision mode Using Precision Mode, D-CyPre enhanced precision (WAvg) by 1,136% on the test set compared to Random Predictor (Table S4).Compared to CyProduct, D-CyPre also produced higher Jaccard score (WAvg), recall (WAvg), and F1 (WAvg) values, with increases of 800%, 21%, and 535%, respectively.The results from the training set are displayed in Table S2, with D-CyPre exhibiting exceptionally high precision values for several of the nine CYP450 enzymes in both the training set and test set.For example, the precision values for 2A6 and 2E1 in the training and test sets surpassed 0.8 and 0.9, respectively.
Unfortunately, D-CyPre in Precision Mode did not exhibit strong performance across all enzymes.Despite the fact that D-CyPre performed well for 2B6 and 2C8 in the validation set (Table S2), their results in the test set indicated severe overfitting (Table S4).CyProduct also encountered this issue (Tian et al., 2021), with the models for 2B6 and 2C8 performing well in the validation set but performing poorly in the test set.To further investigate the underlying causes of this problem, t-SNE was used to visualize the SOMs based on features generated by D-MPNN (van der Maaten & Hinton, 2008).The SOMs in the training set, validation set, and test set of these models were all visualized.The green box in Fig. 5A represents potential false negatives in the test set that reduced Recall for the 2B6 model.Similarly, Figs.5C and 5D illustrate possible sources of error for 2C8.From these results, it can be inferred that there may be two reasons for the poor generalization ability of these models on the test set.First, an insufficiently sized training set might have led to some bonds or atoms in the test set that are similar in structure to SOMs being incorrectly classified as positive (Figs.5A and 5C), as shown in the part of the graph where blue + coincides with red +.Some unfamiliar actual SOMs are incorrectly classified as negative (Figs. 5B and 5D), as shown in the part of the graph where red + coincides with blue +.The second possible source of error is that 2B6 and 2C8 had almost the largest ratios of Non-SOMs/SOMs (Table S5) in their respective test sets, which may have made Precision more sensitive to errors.On larger test sets, the test results of the model might perform more closely to the validation set.Furthermore, neither the test nor validation sets were distributed within regions lacking training data, indicating that there were no atoms or bonds present in either set that had not been previously encountered by the models.This indicates that D-CyPre's chemical space based on its training data was sufficiently large.

Testing results of recall mode
As shown in Table S6, compared to Random Predictor, D-CyPre exhibited an 813% increase in Jaccard score (WAvg), a 981% increase in Precision (WAvg), a 44% increase in Recall (WAvg), and a 545% increase in F1 (WAvg).Additionally, the Jaccard scores for 2B6 and 2C8 also improved under Recall Mode.Overall, D-CyPre successfully maintained high Jaccard scores while achieving high recall.The results of training D-CyPre with the molecular features on SOMs The effects of two molecular features on 1A2 and 2B6 (Table S7) were compared and the results showed that the molecular features calculated by D-MPNN exhibited some advantages over those calculated using fixed rules.As a result, the same methodology was applied to construct a version of D-CyPre that incorporated molecular features calculated by D-MPNN.However, this version of D-CyPre did not exhibit better performance across all isoforms when compared with the original version of D-CyPre (Tables S8 and S9).Subsequently, optimal models from both versions of D-CyPre (with or without molecular features) were synthesized to obtain a new Precision Mode (Table 3) and Recall Mode (Table 4); 1A2, 2A6, 2B6, 2C8, 2C9, and 2C19 enzymes were all adopted by models incorporating molecular features under both modes, which suggests that molecular structure may be an important factor affecting metabolic reactions for these enzymes.The parameters for loss function and XGBOOST for all models can be found in Table S10.

Training results of precision mode
In Precision Mode (Table S8), the results of the training set showed that the D-CyPre-val of 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2E1, and 3A4 enzymes all improved after the  introduction of molecular features.The Jaccard score of subtype 3A4 improved the most, increasing by 5.4% after the introduction of molecular features.The Jaccard score of seven subtypes increased by an average of 3.1% after the introduction of molecular features.

Training results of recall mode
In Recall Mode (Table S9), the D-CyPre-val of 1A2, 2A6, 2B6, 2C8, 2C9, and 2C19 enzymes improved after the introduction of molecular features.The Jaccard Score of subtype 1A2 exhibited the greatest improvement, with a notable increase of 9.2% upon the introduction of molecular features.On average, the Jaccard score of the four subtypes showed an increase of 6.5% following the incorporation of molecular features.

DISCUSSION
In Precision Mode and Recall Mode, D-CyPre produced good Jaccard scores while maintaining precision and recall values greater than 0.7.In both modes, D-CyPre exhibits significantly higher F1 scores across all subtypes compared to BioTransformer.Additionally, the Jaccard score for all subtypes except 2B6 are significantly higher for D-CyPre compared to BioTransformer in both modes.The 2A6, 2C8, 2C9, and 3A4 subtypes of D-CyPre in Precision Mode and the 1A2, 2A6, 2C8, 2C9, and 3A4 subtypes in Recall Mode all had a higher Jaccard score and F1 than with CyProduct.However, D-CyPre performed worse than CyProduct in other subtypes.Overall, D-CyPre has higher precision than existing models, allowing it to provide more accurate positive results.The results showed that the running time of D-CyPre decreased by 63% and 58% compared with CyProduct and Biotransformer, respectively, when predicting the metabolites of nine CYP450 subtypes of a single compound (Table S14).Additionally, the results indicate that molecular features are necessary to consider in in silico metabolism prediction.
The two modes enable this tool to be more flexibly applied in various fields such as drug metabolism research, metabolomics, food, and nutrition studies.When analyzing large amounts of data, using Precision Mode may be more appropriate because it will obtain a small but reliable amount of data from a large amount of data, greatly reducing the workload in subsequent experiments.Conversely, using Recall Mode may be more appropriate for analyzing a small amount of data, as it obtains as much information as possible.For example, a high-throughput study may demand more accurate results, whereas making predictions for several drugs and comparing their corresponding metabolites' mass spectra may require the inclusion of all possibilities.
Feature generators based on two D-MPNN are anticipated to exhibit greater potential compared to those relying on fixed rules as the volume of data increases.This is attributed to the flexibility and specificity of features generated by D-MPNN, which integrate information from molecules, atoms, and bonds.Additionally, when trained on larger-scale datasets, models reliant on fixed rules for feature generation may necessitate manual design of novel features.However, D-CyPre can autonomously extract the requisite features from the datasets.D-CyPre can learn both SOMs based on bonds and SOMs based on atoms at the same time, which makes it fully use existing databases.
Although this study explores the method and effect of a metabolic site prediction model based on D-MPNN and XGBoost, it does not establish a model that can directly produce metabolites, requiring an extra step and more time to further analyze and speculate based on the results of this tool.In addition, the effects of D-MPNN combined with other models are not discussed in this study, so there may be better combinations.In the future, the effectiveness of combining D-MPNN with other models should be further examined based on the results of this study, and a model that directly generates metabolic products should be established.

CONCLUSIONS
The present study introduces a novel SOMs identification tool termed D-CyPre.Unlike existing models that generate features for atoms or bonds based on fixed rules, D-CyPre utilizes D-MPNN to generate features for atoms and bonds, thereby enabling better extraction of metabolic-related information.D-CyPre transforms the features of atoms and bonds into a unified dimension and employs a single discriminator for classification, making it adaptable to SOMs defined under various criteria.Additionally, D-CyPre integrates molecular information extracted by D-MPNN with information on atoms and bonds extracted by another D-MPNN to discriminate metabolic sites, thereby further enhancing model accuracy.This model represents the first application of D-MPNN in metabolism prediction and yielded satisfactory in silico outcomes, demonstrating high precision, recall, and Jaccard score.D-CyPre comprises a feature generator and SOMs discriminators and is divided into Precision Mode and Recall Mode.To use the D-CyPre software (as detailed in File S1), a table containing the simplified molecular-input lineentry system (SMILES) strings of all the targeted compounds needs to be input.
Jie Liu conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.Kui Chen performed the experiments, prepared figures and/or tables, and approved the final draft.Shiyu Cong performed the experiments, prepared figures and/or tables, and approved the final draft.Shengnan Cai performed the experiments, prepared figures and/or tables, and approved the final draft.Yueting Li conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.Zhixin Jia conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.Hao Wu analyzed the data, authored or reviewed drafts of the article, and approved the final draft.Tianyu Lou analyzed the data, authored or reviewed drafts of the article, and approved the final draft.Zuying Wei analyzed the data, prepared figures and/or tables, and approved the final draft.Xiaoqin Yang analyzed the data, prepared figures and/or tables, and approved the final draft.Hongbin Xiao conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

20 Figure 3 Figure 4
Figure 3 Illustration of the proposed model, D-CyPre.D-CyPre employs two independent messagepassing processes to capture features of two kinds of directed bonds from a molecule.It then fuses the features of the two kinds of directed bonds to derive features of atoms, chemical bonds, and molecules.The features of atoms and bonds are separately combined with those of the molecule and input into a feed forward layer to generate prediction probabilities, which then update the network.The concatenated features of atoms and bonds are then fed into the XGBOOST model to obtain the actual prediction probabilities.Full-size  DOI: 10.7717/peerj-cs.2040/fig-3 Yang et al. (2024), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.204010/20

Figure 5
Figure 5 Visualization (by t-SNE) of the SOMs of 2B6 and 2C8.Visualization (by t-SNE) of the SOMs of 2B6 (ignore training set; negative) (A), 2B6 (ignore training set; positive) (B), 2C8 (ignore training set; negative) (C) and 2C8 (ignore training set; positive) (D).The green box part is some data that the model may misjudge.Full-size  DOI: 10.7717/peerj-cs.2040/fig-5 of D-CyPre on the training set.b The results of D-CyPre on the validation set.
of D-CyPre on train set.b The results of D-CyPre on validation set.c The results of D-CyPre on EBoMD.d The results of D-CyPre on EBoMD2.e The microaverage (weighted average, weighted by the number of SOMs) over the nine CYP450 enzymes.Yang et al. (2024), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.204013/20 If the bond is C-H and C is connected to O or N, the metabolism is based on the C atom as the SOM.Otherwise, the C-H is hydroxylated.If the atoms at both ends of the bond are C and the bond is a double bond, epoxidation occurs.If it is C-O, oxidation occurs.If it is i-O, where i is S, N, or P, the i-O is broken.If there are multiple O and i is S or P, one of the i-O is broken, and if there are multiple O and i is N, i-O is reduced to a primary amine group.If all the bonds in a ring are SOMs, then the aromatic and non-aromatic ring interconversion occurs.

Table 1
Distribution of SOMs for nine CYP450 isoforms in data sets.

Table 2
Descriptors of atoms and bonds.

Table 3
Training results (Precision Mode) for nine CYP450 enzymes in EBoMD and EBoMD2.

Table 4
Training results (Recall Mode) for nine CYP450 enzymes in EBoMD and EBoMD2.

Table 5
Results (Precision Mode) for nine CYP450 enzymes compared with CyProduct and BioTransformer on 68 reactants of EBoMD2.The microaverage (weighted average, weighted by the number of metabolites) over the nine CYP450 enzymes. Note:a