MMPatho: Leveraging Multilevel Consensus and Evolutionary Information for Enhanced Missense Mutation Pathogenic Prediction

Understanding the pathogenicity of missense mutation (MM) is essential for shed light on genetic diseases, gene functions, and individual variations. In this study, we propose a novel computational approach, called MMPatho, for enhancing missense mutation pathogenic prediction. First, we established a large-scale nonredundant MM benchmark data set based on the entire Ensembl database, complemented by a focused blind test set specifically for pathogenic GOF/LOF MM. Based on this data set, for each mutation, we utilized Ensembl VEP v104 and dbNSFP v4.1a to extract variant-level, amino acid-level, individuals’ outputs, and genome-level features. Additionally, protein sequences were generated using ENSP identifiers with the Ensembl API, and then encoded. The mutant sites’ ESM-1b and ProtTrans-T5 embeddings were subsequently extracted. Then, our model group (MMPatho) was developed by leveraging upon these efforts, which comprised ConsMM and EvoIndMM. To be specific, ConsMM employs individuals’ outputs and XGBoost with SHAP explanation analysis, while EvoIndMM investigates the potential enhancement of predictive capability by incorporating evolutionary information from ESM-1b and ProtT5-XL-U50, large protein language embeddings. Through rigorous comparative experiments, both ConsMM and EvoIndMM were capable of achieving remarkable AUROC (0.9836 and 0.9854) and AUPR (0.9852 and 0.9902) values on the blind test set devoid of overlapping variations and proteins from the training data, thus highlighting the superiority of our computational approach in the prediction of MM pathogenicity. Our Web server, available at http://csbio.njust.edu.cn/bioinf/mmpatho/, allows researchers to predict the pathogenicity (alongside the reliability index score) of MMs using the ConsMM and EvoIndMM models and provides extensive annotations for user input. Additionally, the newly constructed benchmark data set and blind test set can be accessed via the data page of our web server.


Supplementary Text S1
The versions of the databases used in Ensembl VEP v104.

Supplementary Text S2
The detailed plugins information installed in Ensembl VEP v104.
(5) Downstream predicts the downstream effects of frameshift variants on transcript protein sequences (Downstream.pm).
For detailed information about the installation and usage of other plugins, usage, please go to check http://may2021.archive.ensembl.org/info/docs/tools/vep/script/vep_plugins.html.

Supplementary Text S3
Powerful classification using XGBoost and SHAP-based feature importance analysis.
XGBoost constructs a robust ensemble of weak base models by iteratively adding new trees to the ensemble while minimizing the logistic function for binary classification.The XGBoost algorithm utilizes gradient boosting, fitting each new tree to the residual errors/gradients of the previous trees.The objective function for XGBoost is given below: where i y donates the training data labels, ˆi y represents the predicted values from the ensemble of trees, j f is one individual tree.The first summation term captures the logistic loss, which calculates the difference between the true and predicted values.
  j f  serves as a regularization component that regulates the complexity of individual trees to prevent overfitting.Through iterative optimization of equation (1), XGBoost effectively combines the strengths of gradient boosting with optimized tree-based models, resulting in a robust and accurate classification prediction model.
SHAP 18 (https://github.com/slundberg/shap)offers explanations for model predictions by leveraging the concept of Shapley value derived from game theory.It provides localized explanations for each feature and quantifies the contribution of each feature to the prediction results.The definition is as follows: where   i x  represents the SHAP value of feature i for sample x.N is the features set,   is the model's prediction result after adding feature i to the sample S x .  S fx is the model's prediction result using only the sample x with the feature subset S.
S denotes the number of features in the feature subset S.
N denotes the total number of features.

Supplementary Text S4
Performance evaluation measures.
To assess the pathogenicity prediction of MM, we defined pathogenic or likely pathogenic variants as positive samples, and benign or likely benign variants as negative samples.Meanwhile, we utilized several performance measures based on the golden standard (confusion matrix) for evaluation and comparison, including precision (Pre), specificity (Spe), negative predictive value (NPV), Recall, sensitivity (Sen, equals to recall), F 1 -score (F 1 ), accuracy (ACC), Matthew's correlation coefficient (MCC), false-positive rate (FPR), false-negative rate (FNR), error rate (ER), well as the area under the receiver operating characteristic (ROC) curve (AUROC) and precision-recall (PR) curve (AUPR), defined below: where TP (true positive) and TN (true negative) refer to correctly classified pathogenic and benign variants, respectively.Conversely, FP (false positive) and FN (false negative) denote misclassified benign and pathogenic variants.To compute the reliability index (RI) on a scale of 0 (random prediction) to 10 (certain prediction) based on the probability of pathogenicity ( / P LP P ), we employ the following formula:   / 20 0.5

Supplementary Text S7
The detailed introduction and parameter settings of CatBoost.To optimize parameter settings and experimental design, the following steps were taken.(1) Data Splitting: The benchmark dataset was divided into training data (80%) and test set (20%) using "train_test_split" from sklearn [47] module.(2) Grid Search Cross-Validation and Model Selection: GridSearchCV and 5-fold cross-validation were used to extensively search a predefined grid of hyperparameters and identify the best-performing model.(3) Parameter Settings: For XGBoost, the chosen settings were a learning rate of 0.01, a binary logistic regression objective, a subsample ratio of 0.5, the base score as the mean of "y_train", and the evaluation metric as log_loss.CatBoost and LightGBM underwent extensive grid searches for iterations, learning rates, and tree depths.The best parameter configurations were identified as follows: CatBoost (iterations=300, learning_rate=0.1,depth=8) and LightGBM (iterations=400, learning_rate=0.1,depth=6).The CatBoost, XGBoost, and LightGBM models were meticulously fine-tuned and evaluated, and the comparison results are documented in Table S2 and Figure S1.
When comparing the performance of CatBoost, XGBoost, and LightGBM, we evaluated multiple metrics (Table S2 and Figure S1)

Supplementary Text S9
The eighty features importance in ConsMM.
During the training of the ConsMM model using 87 individual outputs (predictions and annotations), it was observed that only 70 individual outputs received importance scores (as listed in Table S2) when utilizing the model.get_score(importance_type='weight') function.This suggests that the model deems these 80 features to be sufficiently significant and incorporates them in the computation of importance scores.
XGBoost computes feature importance based on the frequency of feature usage during the construction of decision trees in the training phase.Features with higher usage frequencies are regarded as more important.
If a feature possesses a low importance score or is excluded from the importance scores entirely, it indicates that the feature contributes less to the model's predictive performance.In other words, the inclusion of a feature in the importance scores reflects its limited impact on the model's predictions.Here are a few potential explanations as to why certain features may be absent from the importance scores: (1) Feature selection: XGBoost automatically identifies and retains the most informative features during training, discarding less relevant ones based on their contribution to improving impurity.Consequently, features with lower importance may be pruned. S-6 (2) Collinearity: If certain features exhibit high correlation, XGBoost may favor selecting a representative feature and disregarding redundant ones.As a result, only the most important feature within a correlated group is typically included in the importance scores.
(3) Weak predictive power: Features with limited predictive capability have minimal impact on the overall model performance.Hence, they often receive lower or no importance scores.Including them may not provide sufficient information to enhance the model's predictive accuracy.
The importance scores assigned to each feature provide a relative measure of their significance within the XGBoost model.These scores enable the evaluation of feature contributions to the model's predictive performance.

Supplementary Text S10
Assessing the contribution of different feature sets (not using protein ID as the criteria for training and test data splitting).
To assess the impact of different feature sets on EvoIndMM predictions, ablation experiments were conducted.Experiments results (Table S4 and Figure S3) provide insights into feature contributions.EvoIndMM with "VEP" demonstrates the lowest FPR (0.0372), revealing its proficiency in accurately identifying benign variants.Conversely, EvoIndMM with the "ESM-1b" feature set exhibits the highest FNR (0.1716), indicating challenges in correctly identifying pathogenic variants.
Analyzing the results presented in Table S4 and Figure S3, it is evident that EvoIndMM with the "Combined" feature set achieves outstanding performance, as reflected by its highest AUROC (0.9909) and AUPR (0.9923) values.Additionally, EvoIndMM demonstrates robust agreement between predicted and true classes, with the highest MCC (0.9218) and ACC (0.9611) values, indicating accurate classification.Besides, EvoIndMM with "VEP" exhibits superior Recall/Sen (0.9356), showcasing its ability to identify positive variants effectively.In contrast, EvoIndMM with "ESM-1b" shows a lower Spe (0.8022), suggesting a higher rate of false positives.Notably, EvoIndMM with "Combined" achieves the highest Pre score (0.9681) and F1 score (0.9639), indicating a favorable balance between precision and recall.
In summary, the "Combined" feature set consistently outperforms other feature sets across various metrics.The limited effectiveness of ESM-1b and ProtT5-XL-U50 embeddings in predicting MM pathogenicity can be attributed to their insufficient individual pathogenic predictions and annotations, limited coverage of mutation features, and inadequate representation of functional impacts.In contrast, the "Combined" feature set outperforms each single set, indicating the combination provides complementary information that can enhance predictive performance.The inclusion of the "VEP" feature set in "Combined", with its extensive individual-level predictions and annotations, likely compensates for any limitations in ESM-1b and ProtT5-XL-U50 embeddings, resulting in improved accuracy in predicting MM pathogenicity.

Supplementary Text S11
The one hundred-sixty features importance in EvoIndMM.
Due to space limitations, Table S3 only listed the top 160 features and their corresponding importance scores.
(1) VEST4 25 score ranges from 0 to 1.A higher score suggests a higher probability of the mutation causing a functional change (threshold=0.5).
(3) LIST-S2 27 predicts the deleteriousness of human coding mutations, focusing on two features: local identity and shared taxa.The output scores range from 0 to 1.A higher score suggests that the query mutation is more likely to have a deleterious effect (threshold=0.5).
(4) gMVP 28 is a pathogenicity prediction score for missense variants using a graph attention neural network model.The range of gMVP score is from 0 to 1 (threshold=0.5).The larger the score, the more likely the variant is pathogenic.
(7) M-CAP 35 is a hybrid ensemble method with an output score ranging from 0 to 1.A larger score predicts a higher likelihood of the SNP having a deleterious effect (threshold=0.5).
(8) DEOGEN2 36 is a predictor for missense SVN in human proteins.The output score indicates the probability of this variant being deleterious (0 is benign, 1 is deleterious) (threshold=0.5).
(9) InMeRF 37 (2020, https://www.med.nagoya-u.ac.jp/neurogenetics/InMeRF/) is a tool to predict the pathogenicity of non-synonymous SNVs (nsSNVs) using 150 independent models which are individually generated for all possible amino acid (AA) substitutions.InMeRF adopted Random Forest and rank scores of 34 tools from dbNSFP to develop an accurate pathogenicity prediction model.Larger the InMeRF score means the SNV to be more pathogenic.Scores range from 0 to 1 (threshold=0.5).
(10) VARITY 38 applied Gradient Boosted Trees to predict pathogenicity scores for rare human missense variants.The range is from 0 to 1 (threshold=0.5).The larger the score, the more likely the variant is pathogenic.
(12) MVP 46 utilizes deep residual networks and correlated predictors trained on large datasets.The output scores range from 0 to 1.A higher score indicates a greater possibility of the variant being pathogenic (threshold=0.5).

Supplementary Text S13
The variation number changes during data procedure in Figure 4A of the manuscript.
As depicted in Figure 4A of this manuscript, in order to generate embeddings from ESM-1b and ProtT5-XL-U50, we should firstly obtain the encoded protein sequence and also make sure the wildtype amino acid being correct.As for the benchmark dataset and the blind test set, we have done some steps as below: (1) The benchmark dataset: Step 1: The encoded protein sequence for each variation (a total of 73,078 variations) was retrieved using the ENSP identifier and the GET sequence function (from https://rest_ensembl.org) through the execution of an auto-run Python code.
Step 2: Variations with a '-' mark in the "Sequence" column were filtered out, resulting in 68,396 variants.
Step 3: The length of each obtained protein sequence was compared with the corresponding protein length provided by Ensembl VEP, and variations with mismatching lengths were excluded.
Step 4: The wildtype amino acid at the mutant site in the protein sequence was compared with the corresponding REF amino acid provided by Ensembl VEP, and variations with mismatched amino acids were removed.
After performing steps 3 and 4, a total of 65,914 variants were obtained. S-10 (2) The blind test set: For the blind test data, the same processing procedure as the benchmark dataset was followed, consisting of the following four steps: Step 1: Using the ENSP identifier and the GET sequence function (from https://rest_ensembl.org), we retrieved the encoded protein sequence for each variation in the blind test set (a total of 5,958 variants) through the execution of an auto-run Python code Step 2: Variations with a '-' mark in the "Sequence" column were filtered out, resulting in 5,297 variants.
Step 3: The length of each obtained protein sequence was compared with the corresponding protein length provided by Ensembl VEP, and variations with mismatching lengths were excluded.
Step 4: The wildtype amino acid at the mutant site in the protein sequence was compared with the corresponding REF amino acid provided by Ensembl VEP, and variations with mismatched amino acids were removed.
After performing steps 3 and 4, a total of 4,969 variants were ultimately obtained.

GERP++_RS
The GERP++ RS score, the larger the score, the more conserved the site.Scores range from -12.3 to 6.17.

BayesDel_addAF_rankscore
The rankscore is the ratio of the rank of the score over the total number of BayesDel_addAF scores in dbNSFP.

gnomAD_OTH_AF
Frequency of existing variant in gnomAD exomes other combined populations.

Polyphen2_HVAR_rankscore
Polyphen2 HVAR scores were first ranked among all HVAR scores in dbNSFP.The rankscore is the ratio of the rank the score over the total number of the scores in dbNSFP.If there are multiple scores, only the most damaging (largest) rankscore is presented.The scores range from 0.01493 to 0.97581.

FigureS3A
highlights that EvoIndMM with the "Combined" feature set achieves the highest TP (6839, 51.88%) and TN (5831, 44.23%), indicating superior classification performance compared to EvoIndMM with other feature sets.Additionally, the "Combined" feature set exhibits the lowest ER (0.0171) among all feature sets.

Supplementary Text S8 Performance of ConsMM on the test set of benchmark dataset (not using protein ID as the criteria for training and test data splitting)
22tBoost22(Categorical Boosting) is a gradient boosting framework specifically designed to handle categorical features effectively in machine learning tasks, which employs highly optimized algorithms to handle large-scale datasets and accelerate the training process.CatBoost also incorporates techniques to prevent model overfitting and construct robust decision trees.
Alternative allele frequency in the African/African American gnomAD genome samples v3.1.For mtDNA, this is sum of AF_hom ("Allele frequency restricted to variants with a heteroplasmy level >= 0.95") and AF_het ("Allele frequency restricted to variants with a heteroplasmy level >= 0.10 and < 0.95").11gnomAD_genomes_ASJ_AFAlternativeallele frequency in the Ashkenazi Jewish gnomAD genome samples v3.1.For mtDNA, this is sum of AF_hom ("Allele frequency restricted to variants with a heteroplasmy level >= 0.95") and AF_het ("Allele frequency restricted to variants with a heteroplasmy level >= 0.10 ranked among all ClinPred scores in dbNSFP.The rankscore is the ratio of the rank of the score over the total number of ClinPred scores in dbNSFP.16DANN_rankscoreDANNscoreswere ranked among all DANN scores in dbNSFP.The rankscore is the ratio of the rank of the score over the total number of DANN scores in dbNSFP.17ExAC_NFE_AFFrequency of existing variant in ExAC Non-Finnish European population.18ALSPAC_AFAlternativeallele frequency in called genotypes in UK10K ALSPAC cohort.

Table S2 .
Confusion matrix and three types of errors of CatBoost, XGBoost, and LightGBM on the test set of benchmark dataset (not using protein ID as the criteria for training and test data splitting).

Table S6 .
Detailed descriptions of the compared predictors.

Table S8 .
Performance comparison of ConsMM, EvoIndMM, and existing predictors on the blind test set (without splitting the blind test data into "repeated protein" and "non-repeated protein" groups).

Table S10 .
Detailed information of the twelve variants for case study.