A mutation-induced drug resistance database (MdrDB)

Mutation-induced drug resistance is a significant challenge to the clinical treatment of many diseases, as structural changes in proteins can diminish drug efficacy. Understanding how mutations affect protein-ligand binding affinities is crucial for developing new drugs and therapies. However, the lack of a large-scale and high-quality database has hindered the research progresses in this area. To address this issue, we have developed MdrDB, a database that integrates data from seven publicly available datasets, which is the largest database of its kind. By integrating information on drug sensitivity and cell line mutations from Genomics of Drug Sensitivity in Cancer and DepMap, MdrDB has substantially expanded the existing drug resistance data. MdrDB is comprised of 100,537 samples of 240 proteins (which encompass 5119 total PDB structures), 2503 mutations, and 440 drugs. Each sample brings together 3D structures of wild type and mutant protein-ligand complexes, binding affinity changes upon mutation (ΔΔG), and biochemical features. Experimental results with MdrDB demonstrate its effectiveness in significantly enhancing the performance of commonly used machine learning models when predicting ΔΔG in three standard benchmarking scenarios. In conclusion, MdrDB is a comprehensive database that can advance the understanding of mutation-induced drug resistance, and accelerate the discovery of novel chemicals.

) for the mutant site of amino acids with different predicted secondary structures: helix, loop, and sheet. The x-axis represents the △△G values (kcal mol -1 ), and the y-axis represents the protein's secondary structure. Figure 8. Scatter plots of the experimental versus calculated ΔΔG values in Scenario 1. X-axis denotes the experimental ΔΔG values (kcal mol -1 ). y-axis denotes the calculated ΔΔG value (kcal mol -1 ). Each ΔΔG estimate is color-coded according to its absolute error w.r.t. the experimental ΔΔG value; at 300 K, the 1.4 kcal mol -1 error corresponds to a 10-fold error in the Kd change and 2.8 kcal mol -1 error corresponds to a 100-fold error in the Kd change.

Supplementary Note1: Model performance evaluation
In this section, we provide a comprehensive evaluation of 10 common machine learning models in several scenarios, and provide baseline prediction results on the MdrDB database. The corresponding code is available at https://github.com/tencent-quantumlab/MdrDB.
The following are the details of the four main experimental scenarios: • Scenario 1: Evaluate the prediction performance of the machine learning methods on the MdrDB_CoreSet, with samples corresponding to single substitutions. ◼ Scenario 1.1: Randomly split the samples. Approximately 80% of the data is used as training samples, and the remaining 20% as test samples. ◼ Scenario 1.2: 5-fold cross-validation. All data is randomly split into 5 folds, and then the machine learning method is trained on the 4 folds, while one fold is left to test the model. This process is repeated 5 times to obtain predictions for all data. ◼ Scenario 1.3: Group 5-fold cross-validation (Uniprot ID). Samples are grouped according to Uniprot ID, and all data is randomly divided into 5 folds, where the same group does not appear in two different folds. The machine learning method is trained on the k -1 folds, while one-fold is left to test the method. This process is repeated 5 times to obtain predictions for all data. ◼ Scenario 1.4: 5-fold nested cross-validation (protein sequence). Protein sequences are obtained according to Uniprot ID and encoded by one-hot encoding. Then, the protein sequences were divided into 5 groups by k-nearest neighbors (KNN) clustering. We then use 5-fold nested cross-validation. At each iteration, the machine learning method is trained on the 4 folds with 4 groups, while one fold is left to test the method. This process is repeated 5 times to obtain predictions for all data. ◼ Scenario 1.5: Group 5-fold cross-validation (drug name). Samples are grouped according to drug name, and all data is randomly divided into 5 folds, where the same group does not appear in two different folds. The machine learning method is trained on the 4 folds, while one-fold is left to test the method. This process is repeated 5 times to obtain predictions for all data. ◼ Scenario 1.6: 5-fold nested cross-validation (SMILES). SMILES strings are first converted into molecular fingerprints, which are binary vectors that encode the presence or absence of certain substructures in the molecule.
Pairwise Tanimoto similarity between fingerprints is then calculated as a measure of similarity between molecules. Finally, the KNN clustering algorithm is used to group similar molecules into 5 clusters based on their pairwise similarities. We then use 5-fold nested cross-validation. At each iteration, the machine learning method is trained on the 4 folds with 4 groups, while one fold is left to test the method. This process is repeated 5 times to obtain predictions for all data. ◼ Scenario 1.7: 25-fold nested cross-validation (amino acid type). Amino acid type changes (from wild type to mutation) are extracted from the mutation information of the dataset. For instance, the mutation information of one data is "A256R", in which amino acid A belongs to the Hydrophobic group and amino acid R belongs to the Positve group. Then, the amino acid type of this data is "Hydrophobic_Positve". In the manuscript, the 20 amino acids are divided into five groups (i.e., positive, negative, polar, special cases, and hydrophobic). Thus, there are a total of 5*5=25 groups of amino acid changes before and after mutation. Grouping samples according to the amino acid type changes. The machine learning method is trained on the 24 folds, while one fold is left to test the method. This process is repeated 25 times to obtain predictions for all data. ◼ Scenario 1.8: 237-fold nested cross-validation (amino acid). Amino acid changes (from wild type to mutation) are extracted from the mutation information of the dataset. For instance, the mutation information of one data is "A256R", and the amino acid change of this data is "A_R". Grouping samples according to the amino acid changes (237 groups). The machine learning method is trained on the 236 folds, while one fold is left to test the method. This process is repeated 237 times to obtain predictions for all data.
• Scenario 2: Evaluate the prediction performance of the machine learning methods on the MdrDB_CoreSet, with samples corresponding to multiple substitutions. ◼ Scenario 2.1: Randomly split the samples. Approximately 80% of the data is used as training samples, with the remaining 20% as test samples. ◼ Scenario 2.2: 5-fold cross-validation. All data is randomly split into 5 folds, and then the machine learning method is trained on the 4 folds, while one fold is left to test the model. This process is repeated 5 times to obtain predictions for all data. Evaluation metrics: Root Mean Square Error (RMSE), Pearson correlation coefficient (Pears), and the area under the precision-recall curve (AUPRC) were used to evaluate model performance. Consistent with the previous work, resistant mutations are defined as the affinity changes for mutants by least 10-fold, i.e., ∆∆Gexp > 1.36 kcal mol -1 .

Scenario 1.1: Randomly split the samples (single substitution).
In scenario 1.1, we report the RMSE, Pearson, and AUPRC averaged over 5 repetitions for each machine learning method in Supplementary Table 10. Supplementary Figure  10 shows the scatter plots of the experimental versus calculated ΔΔG values in one experiment. It can be observed that ExtraTrees outperforms other machine learning methods on MdrDB_CoreSet (single substitution). Furthermore, most of the tree-based and ensemble-based methods obtain better prediction performance than linear-based and neural network-based methods in this scenario.

Supplementary Table 10. Test prediction performance on MdrDB_CoreSet (single substitution).
Mean performance (±std) over 5 repetitions are reported. The best is highlighted in bold.

Scenario 1.2: 5-fold cross-validation (single substitution).
Supplementary Table 11 shows the mean and standard deviation for performance on MdrDB_CoreSet (single substitution) under 5-fold cross-validation, and Supplementary Figure 11 plots the scatter plots of the experimental versus calculated ΔΔG values in Scenario 1.2. It can be observed that ExtraTrees outperforms other machine learning methods on MdrDB_CoreSet (single substitution). Furthermore, consistent with Scenario 1.1, most of the tree-based and ensemble-based methods obtain better prediction performance than linear-based and neural network-based methods in this scenario.

Scenario 1.3: Group 5-fold cross-validation (Uniprot ID).
In Supplementary Table 12, we present the mean and standard deviation for performance on MdrDB_CoreSet (single substitution) using group 5-fold crossvalidation based on Uniprot ID, and Supplementary Figure 12 displays the scatter plots of the experimental versus calculated ΔΔG values. We can clearly see that the machine learning methods perform relatively well in terms of RMSE, but their performance in terms of correlation and classification ability is significantly worse than that in Scenario 1.2. For example, SVR demonstrates a weak correlation (Pearson=0.062 ± 0.04) and poor classification performance (AUPRC=0.218 ± 0.038). A possible explanation for this outcome is that, in this scenario, the same protein group does not appear in two different folds, and the performance of machine learning methods may be degraded when they are trained on unseen protein groups (i.e., on unrelated data).

Supplementary Table 12. Prediction performance obtained with group 5-fold cross-validation based on Uniprot ID on MdrDB_CoreSet (single substitution).
Mean and standard deviation are reported. The best is highlighted in bold.  Figure 12. Scatter plots of the experimental versus calculated ΔΔG values in Scenario 1.3. Each ΔΔG estimate is color-coded according to its absolute error w.r.t. the experimental ΔΔG value; at 300 K, the 1.4 kcal mol -1 error corresponds to a 10-fold error in the Kd change and 2.8 kcal mol -1 error corresponds to a 100-fold error in the Kd change.

Scenario 1.4: 5-fold nested cross-validation (protein sequence).
To further evaluate the ability of machine learning to predict drug resistance more broadly across protein classes, we calculated and clustered the similarity of protein sequences in MdrDB_CoreSet, divided them into five groups, and performed 5-fold nested cross-validation. Supplementary Table 13 shows the prediction performance of the machine learning methods, and Supplementary Figure 13 displays the scatter plots of the experimental versus calculated ΔΔG values. Although SVR achieves the best result among all competing methods, it has relatively poor RMSE (1.408 ± 0.291), weak correlation (Pearson=0.072 ± 0.034), and poor classification performance (AUPRC=0.24 ± 0.05) in this scenario. These results indicate that the generalization performance of the current machine learning methods is poor when predicting drugresistant mutations in protein families not seen in the training set.
Supplementary Table 13. Prediction performance obtained with 5-fold nested cross-validation (samples are clustered according to the similarity of amino acid sequences of proteins and are divided into 5 groups). Mean and standard deviation are reported. The best is highlighted in bold.

Scenario 1.5: Group 5-fold cross-validation (drug name).
To explore the ability of machine learning methods to predict drug resistance from a ligand perspective, we conduct group 5-fold cross-validation based on drug name on MdrDB_CoreSet. As shown in Supplementary Table 14, ExtraTrees obtains the best prediction performance, however, it has poor performance in terms of correlation (Pearson=0.161 ± 0.075) and classification ability (AUPRC=0.3 ± 0.06). In this scenario, the same ligand group does not appear in two different folds, which may be the reason why the prediction ability of the machine learning methods degrades when they are trained on unseen ligand groups.

Supplementary Table 14. Prediction performance obtained with group 5-fold cross-validation based on drug name on MdrDB_CoreSet (single substitution).
Mean and standard deviation are reported. The best is highlighted in bold.

Scenario 1.6: 5-fold nested cross-validation (SMILES).
To further explore the prediction ability of the machine learning methods to predict drug resistance more broadly across ligands, we calculated and clustered the similarity of SMILES of ligand in MdrDB_CoreSet, divided them into five groups, and performed 5-fold nested cross-validation. Supplementary Table 15 shows the prediction performance of the machine learning methods, and Supplementary Figure 15 displays the scatter plots of the experimental versus calculated ΔΔG values. Although RandomForest obtains the optimal results in terms of RMSE (1.181 ± 0.244), it has weak correlation (Pearson=0.016 ± 0.056), and poor classification performance (AUPRC=0.191 ± 0.133). This result indicates that the generalization performance of current machine learning methods has poor performance when predicting drug-resistant mutations in ligand classes not seen in the training set.

Scenario 1.7: 25-fold nested cross-validation (amino acid type).
To evaluate the capability of machine learning in predicting drug resistance across mutation aspects, we performed 25-fold cross-validation based on the type of amino acid change from wild type to mutation. Supplementary Table 16 shows the prediction performance of the machine learning methods, and Supplementary Figure 16 displays the scatter plots of the experimental versus calculated ΔΔG values in this scenario. ExtraTrees, RandomForest, and Bagging achieve similar prediction results, with relatively good RMSE and correlation, but poor classification performance.

Scenario 3.1: Training on the single substitution, and test on the multiple substitutions.
In Scenario 3.1, the machine learning methods are trained on MdrDB_CoreSet (single substitution), and then tested on the MdrDB_CoreSet (multiple substitutions). We aimed to assess whether the model can be extrapolated to multiple substitution mutation. Supplementary Table 20 shows the test prediction ability of the machine learning methods in this scenario, and the corresponding scatter plot of the experimental versus calculated ΔΔG values of multiple substitution samples is displayed in Supplementary  Figure 20. ExtraTrees obtains the best performance in this scenario, although can only obtain a relatively weak correlation (Pearson=0.272) and poor classification ability (AUPRC=0.337). This indicates that the robustness of the current machine learning method in the task of predicting drug resistance from single substitution mutations extrapolate to multiple point mutations still needs to be improved.

Scenario 4.1: Training on the single substitution, fine-tuning on the multiple substitutions (fine-tune : test = 8 : 2).
To further explore the prediction ability of the current machine learning methods for drug-resistance prediction problems, we conduct experiments on MdrDB_CoreSet in a fine-tuned manner. Specifically, the machine learning methods are trained on the MdrDB_CoreSet (single substitution), then the model parameters are fine-tuned on approximately 80% of the samples in MdrDB_CoreSet (multiple substitutions), and then tested on the remaining multiple substitution samples. Supplementary Table 22 shows the average prediction performance of all the competing machine learning methods over 5 repetitions, and the scatter plots of the experimental versus calculated ΔΔG values in one experiment are displayed in Supplementary Figure 22. ExtraTrees achieves the best performance in this scenario. Compared with the results obtained in Scenario 2.1, pre-training the model on MdrDB_CoreSet (single replacement) helps to improve the resistance prediction ability of the model on multiple substitutions.

Scenario 4.2: Training on single substitution mutations, fine-tuning on (deletion+indel+insertion+complex) mutations (fine-tune : test = 8 : 2).
In scenario 4.2, the machine learning methods are trained on the MdrDB_CoreSet (single substitution), then the model parameters are fine-tuned on approximately 80% of the samples in MdrDB_CoreSet (complex mutations, i.e., deletion, indel, insertion, and complex), and then tested on the remaining complex mutations. Supplementary  Table 23 shows the average prediction performance of all the competing machine learning methods over 5 repetitions, and the scatter plots of the experimental versus calculated ΔΔG values in one experiment are displayed in Supplementary Figure 23. Compared with the results of Scenario 3.2, the model uses a part of the complex mutation samples for fine-tuning, and the prediction performance is improved to a certain extent.