Improved pathogenicity prediction for rare human missense variants

The success of personalized genomic medicine depends on our ability to assess the pathogenicity of rare human variants, including the important class of missense variation. There aremany challenges in training accurate computational systems, e.g., in finding the balance between quantity, quality, and bias in the variant sets used as training examples and avoiding predictive features that can accentuate the effects of bias. Here, we describe VARITY, which judiciously exploits a larger reservoir of training examples with uncertain accuracy and representativity. To limit circularity and bias, VARITYexcludes features informed by variant annotation and protein identity. To provide a rationale for each prediction, we quantified the contribution of features and feature combinations to the pathogenicity inference of each variant. VARITY outperformed all previous computational methods evaluated, identifying at least 10% more pathogenic variants at thresholds achieving high (90% precision) stringency.

: 10-fold nested cross validation. To measure VARITY performance, we used 10-fold nested cross validation, with the first fold of the outer loop ("Fold1") shown here for illustration. In each outer-loop, training sets (as weighted according to hyperparameters learned in the inner loop) were used to train VARITY models and test them using held-out test data. Three metrics-AUROC, AUBPRC and R90BP-were used for performance evaluation (see Methods). Figure S3: Moving window analysis for VARITY_R. Each panel illustrates assessment of a candidate informative property as a basis for weighting a single add-on set or combination of add-on sets. For Negative variants from HumsaVar 2 and gnomAD 3 (Panel c and e), the common (MAF>0.5%) and rare (MAF < 0.5%) add-on sets were combined for moving window analysis. Variants in add-on set(s) were ordered by the candidate informative property and 100 moving windows, each capturing same number of examples, were examined. To evaluate the predictive utility of each window, the model performance on the core set was estimated using 10-fold cross validation where the training examples in each fold was supplemented by examples in that moving window. Solid and dashed black lines indicate the mean ± standard error of VARITY_R performance measures over all moving windows. A Z-score was calculated to estimate direction and significance of the observed trend (see Methods). Figure S5: Weighted training sets for VARITY_R. Each plot illustrates the optimized weight with a color that varies between 0 (yellow; lowest weight) and 1 (green; highest weight) for a single core/addon set or multiple core or add-on sets. For compactness, weighting for common and rare add-on sets of Negative training examples are shown together for HumsaVar 2 and gnomAD 3 (Panels e and g, respectively), while other Panels correspond to single core or add-on sets. For each plot, the x-axis indicates the rank of each variant ordered by the associated informative property (y-axis). Discontinuities in weight along the x-axis are possible where variant weights were based on multiple quality-informative properties (in which case the overall variant weight is the product of individual weights; see Methods and Table S3). Figure S6: Moving window analysis for VARITY_ER. Each panel illustrates assessment of a candidate informative property as a basis for weighting a single add-on set or combination of add-on sets. For Negative variants from HumsaVar 2 and gnomAD 3 (Panel c and e), the common (MAF>=0.5%) and rare (MAF < 0.5%) add-on sets were combined for moving window analysis. Variants in add-on set(s) were ordered by the candidate informative property and 100 moving windows, each capturing same number of examples, were examined. To evaluate the predictive utility of each window, the model performance on the core set was estimated using 10-fold cross validation where the training examples in each fold was supplemented by examples in that moving window. Solid and dashed black lines indicate the mean ± standard error of VARITY_R performance measures over all moving windows. A Z-score was calculated to estimate direction and significance of the observed trend (see Methods).

Figure S8
: Weighted training sets for VARITY_ER. Each plot illustrates the optimized weight with a color that varies between 0 (yellow; lowest weight) and 1 (green; highest weight) for a single core/add-on set or multiple core or add-on sets. For compactness, weighting for common and rare add-on sets of Negative training examples are shown together for HumsaVar 2 and gnomAD 3 (Panels e and g, respectively), while other Panels correspond to single core or add-on sets. For each plot, the x-axis indicates the rank of each variant ordered by the associated informative property (y-axis). Discontinuities in weight along the x-axis are possible where variant weights were based on multiple quality-informative properties (in which case the overall variant weight is the product of individual weights; see Methods and Table S3).

Figure S9
: Individual feature contributions to VARITY_R model performance. The contribution of each feature to VARITY_R model performance was combined by weighted averaging across all training examples, using the weight of each training example that was optimized during hyperparameter tuning. The first column (left) indicates the total contribution to model performance of each feature, which consists of a feature-independent contribution (matrix cell on the diagonal on the corresponding row) and pair-wise differential feature contributions (non-diagonal matrix cells on the corresponding row). Red and blue color indicates positive and negative contribution to model performance respectively. A blue colored cell for pair-wise differential contribution indicates there is some redundancy between two features (e.g., between different conservation scores). The description of each feature can be found in Table S1.
Figure S10: Feature group contributions to VARITY_ER model performance. The contribution of each feature group to VARITY_R model performance was averaged (weighted) across all training examples using the weight of each training example as optimized during hyperparameter tuning. The first column (left) indicates the total contribution to model performance of each feature group. For each feature group, the total contribution can be decomposed into the individual feature contribution (matrix cell with a star symbol on the corresponding row) and the differential contribution of that feature when it is combined with each other feature group (matrix cells without a star symbol on the corresponding row). Red and blue color indicates positive and negative contribution to model performance respectively. A blue colored cell for pair-wise differential feature contribution indicates there is a certain amount of redundancy between two feature groups (e.g., between Conservation Scores and IN/OUT Pfam domain).
Figure S11: Individual feature contributions to VARITY_ER model performance. The contribution of each feature to VARITY_R model performance was combined by weighted averaging across all training examples, using the weight of each training example that was optimized during hyperparameter tuning. The first column (left) indicates the total contribution to model performance of each feature, which consists of a feature-independent contribution (matrix cell on the diagonal on the corresponding row) and pair-wise differential feature contributions (non-diagonal matrix cells on the corresponding row). Red and blue color indicates positive and negative contribution to model performance respectively. A blue colored cell for pair-wise differential contribution indicates there is some redundancy between two features (e.g., between different conservation scores). The description of each feature can be found in Table S1. Recall was averaged over 2,000 bootstrapped test sets with standard error indicated by the surrounding grey region. As overall performance measures, AUROC and their standard errors are shown. Predictors designed specifically for nucleotide variants are indicated with a '(•)'. Statistical significance relative to VARITY_ER was assessed using a one-sided Z test applied to 2,000 bootstrapped test sets (P-values are shown in brackets, with '*' indicating where P < 0.05). Other test statistics such as 95% confidence interval and effect size can be found in Table S6. Figure S13: Comparing ROC performance of VARITY_R with other predictors for a high-quality 'core' variant set (MAF < 0.5%). We compare ROC performance for VARITY_R (using nested crossvalidation) with other 23 variant pathogenicity predictors. For compactness, a predictor with AUROC smaller than 0.6 (see Table S11) is not shown. Predictors designed specifically for nucleotide variants indicated with a '(•)'. The test set was 9,719 variants (5,912 positive and 3,807 negative examples) from the core training set, after removing variants annotated by HGMD 6 and retaining only variants that had been scored by all methods. At any given false positive rate, true positive rate is averaged over all 10 outer-loop folds and the standard error is indicated by the surrounding grey region. As overall performance measure, AUROC and their standard errors are shown. Statistical significance of performance relative to VARITY_R used a one-sided paired t-test with 9 degrees of freedom (P-values are in brackets, with '*' indicating P < 0.05). Other test statistics such as 95% confidence interval and effect size are in Table S11.
Figure S14: Comparing balanced precision recall performance of VARITY_ER with other predictors in predicting a high-quality 'core' variant set (MAF < 10 -6 ). We compare balanced precision recall performance for VARITY_ER (using nested cross-validation) with other 23 variant pathogenicity predictors. For compactness, one predictor with AUBPRC < 0.6 is not shown (see Table  S13). Predictors designed specifically for nucleotide variants indicated with a '(•)'. The test set was 5,160 variants (4,675 positive and 485 negative examples) from the core training set, after removing variants annotated by HGMD 6 and retaining only variants that had been scored by all methods. Recall was averaged over all 10 outer-loop folds and the standard error is indicated by the surrounding grey region. As overall performance measures, AUBPRC and R90BP (the black dotted line) and their standard errors are shown. Statistical significance of performance relative to VARITY_ER used a onesided paired t-test with 9 degrees of freedom (P-values in brackets were indicated with a '*' where P < 0.05). Other test statistics such as 95% confidence interval and effect size are in Table S13.

Figure S15: Comparing ROC performance of VARITY_ER with other predictors in predicting a
high-quality 'core' variant set (MAF < 10 -6 ). We compare ROC performance for VARITY_ER (using nested cross-validation) with other 23 variant pathogenicity predictors. For compactness, one predictor with AUROC < 0.6 is not shown (see Table S14). Predictors designed specifically for nucleotide variants indicated with a '(•)'. The test set was 5,160 variants (4,675 positive and 485 negative examples from the core training set, after removing variants annotated by HGMD 6 and retaining only variants that had been scored by all methods. At any given false positive rate, true positive rate is averaged over all 10 outer-loop folds and the standard error is indicated by the surrounding grey region. As overall performance measure, AUROC and their standard errors are shown. Statistical significance of performance relative to VARITY_R used a one-sided paired t-test with 9 degrees of freedom (P-values in brackets were indicated with a '*' where P < 0.05). Other test statistics such as 95% confidence interval and effect size are in Table S14. Figure S16: The relationship between VARITY_R score and probability of pathogenicity. VARITY_R scores on the 38,047 labelled core set variants (MAF < 0.5%) were used for the plot. All core set variants were assorted into 20 bins each represents a different VARITY score range (see X axis). For each bin, the probability of pathogenicity (Y axis; fraction of variants annotated as either likely pathogenic or pathogenic in ClinVar) was calculated as number of variants labelled as 'putatively pathogenic' divided by number of total variants in the bin.