The capacity of CNVoyant to classify the clinical significance of CNVs was tested in 21,574 CNVs curated from DECIPHER after training on 52,176 CNVs published in ClinVar (Fig. 1). This approach is consistent with previously reported comparisons of pathogenicity, where a set of CNVs are examined in a general context rather than focusing on individual patients. This also aligns with the previously cited ACMG technical guidelines, that recommend uncoupling CNV pathogenicity classification from the implications for a specific patient (Riggs et al. 2020). Features were generated to capture information related to genomic position, variant composition, overlapping functional annotation, population frequency, conservation, and dosage sensitivity.
Training Dataset Curation
CNVoyant is trained on CNVs included in the January 2023 XML release of ClinVar (Landrum et al. 2018). This XML file was parsed, and extracted variants were limited to CNVs (variant type of “copy number gain” or “copy number loss”) that did not have duplicated variant positions. The Reference ClinVar Accession Number (RCV) entry was chosen to represent each CNV to avoid training on duplicates that can arise in cases of multiple submitters. 40,837 of the extracted CNVs were aligned to the GRCh37 reference genome, all of which required a genomic coordinate mapping via the UCSC liftOver command line tool (Hinrichs 2006) to be combined with the 12,641 CNVs that were aligned to GRCh38. Following liftOver, 1,126 variant entries were identified as duplicates of entries originally aligned to GRCh38, and they were omitted. 850 CNVs were omitted due to ambiguous clinical significance labeling or conflicting clinical significance annotation in entries with matching genomic coordinates. 20 variants were removed for having a size of less than 50 bp. The remaining ClinVar CNVs with at least one pathogenic or likely pathogenic designation were labeled as pathogenic for training purposes. Non-pathogenic variants with at least one VUS designation were labeled as VUS, and remaining variants containing only benign or likely benign classifications were labeled as benign. Altogether, 52,176 CNVs were included in model training (Fig. 2 (a), Table 1; DEL: 6,886 pathogenic, 7,191 VUS, 13,134 benign; DUP: 3,028 pathogenic, 10,792 VUS, 11,145 benign).
Table 1
Distribution of Variant Type and Clinical Significance in Training and Test Sets. 52,176 total CNVs were included in the ClinVar training set, and 21,574 total CNVs were included in the DECIPHER test set. The training set generally favored variants of benign significance, with pathogenic significance encompassing the fewest number of variants. This trend was reversed in the test set, which heavily favored VUS and pathogenic CNVs over those with benign significance. Clinical significance class distribution was generally consistent between duplication and deletion events except for more pathogenic variants being present in deletions.
| | Benign | VUS | Pathogenic | Total |
Deletions | Train | 13,134 (48.3%) | 7,191 (26.4%) | 6,886 (25.3%) | 27,211 |
Test | 1,097 (9.9%) | 3,785 (34.2%) | 6,183 (55.9%) | 11,065 |
Duplications | Train | 11,145 (44.6%) | 10,792 (43.2%) | 3,028 (12.2%) | 24,965 |
Test | 1,554 (14.8%) | 5,595 (53.2%) | 3,360 (32.0%) | 10,509 |
Total | Train | 24,279 (46.5%) | 17,983 (34.5%) | 9,914 (19.0%) | 52,176 |
Test | 2,651 (12.3%) | 9,380 (43.5%) | 9,543 (44.2%) | 21,574 |
Testing Dataset Curation
A test set of CNVs was curated from the web interface of the v11.18 release of the DECIPHER database (Firth et al. 2009). All 29,453 DECIPHER CNVs are aligned to the GRCh38 reference genome and have been assigned clinical significance following manual review. 712 variants had variant types other than CNV deletions or duplications and as such were removed from the test set. The remaining 28,132 variants were lifted to GRCh37 to accommodate the expected input for comparator algorithms; 800 failed to map to an autosomal or sex chromosome contig and were omitted. 1,003 entries were removed for having shared genomic coordinates and conflicting clinical significance annotation, while 5,138 entries were removed for having duplicated genomic coordinates and clinical significance annotation. To guard against data leakage between train and test sets, 118 variants whose coordinates matched a ClinVar submission were omitted from the test set. 38 variants were removed for having a size of less than 50 bp. 21,574 CNVs were included in the final test set, which was used to benchmark CNVoyant against comparator algorithms (Fig. 2 (b); DEL: 6,183 pathogenic, 3,785 VUS, 1,097 benign; DUP: 3,360 pathogenic, 5,595 VUS, 1,554 benign).
CNVoyant Feature Selection
CNVoyant implements 17 features to classify the clinical significance of a candidate CNV. Several feature distributions are right-skewed in the training data (counts of overlapping genes, exons, promoter regions, diseases, pathogenic ClinVar SNVs/indels, and bp length). These features are log-transformed to reflect more normal distributions. All features are normalized via the sklearn preprocessing.MinMaxScaler Python class prior to training or prediction (Pedregosa et al. 2011).
Genomic position (2 features)
-
Centromere distance: The number of bp separating the centromere from the candidate CNV. Distance from the centromere to the CNV is determined by selecting the CNV boundary closest to the centromere: the end coordinate on the P arm or the start coordinate on the Q arm.
-
Telomere distance: The number of bp separating the telomere from the candidate CNV. Distance from the telomere to the CNV is calculated by selecting the CNV boundary furthest from the centromere: the start coordinate on the P arm or the end coordinate on the Q arm.
CNV composition (2 features)
Functional annotation (5 features)
-
Count of genes: The gene count is defined as the total number of genes that overlap the candidate CNV. Genes overlap the candidate CNV if at least one bp is shared between the candidate CNV’s location and the gene’s genomic coordinates drawn from the RefSeq database (O’Leary et al. 2016). All overlap calculations were made via the Bedtools intersect function (Quinlan and Hall 2010).
-
Count of diseases: The disease count is defined as the total number of diseases associated with the gene (s) that overlap the candidate CNV. This allows genes with more disease associations to be distinguished from genes with only a single or no disease association. Disease-gene associations are referenced from the curated annotations provided by Online Mendelian Inheritance in Man (OMIM) (Amberger et al. 2015).
-
Count of exons: Overlapping exon count is calculated by summing exons, across all genes, that overlap the candidate CNV. The exonic boundaries were padded by 10 base pairs to account for canonical splice regions.
-
Count of promoter regions: CNVs may overlap the promoter region of a gene rather than the gene itself. To address this, we include a count of promoter regions, defined as the interval between the transcription start site (TSS) and 1,000 base pairs upstream of the TSS.
-
Count of ClinVar pathogenic SNVs and indels: A sum of overlapping pathogenic SNV/indels is included to capture potentially relevant pathogenicity interpretation. To obtain a set of overlapping pathogenic SNV/indels, ClinVar (Landrum et al. 2014) is intersected with the candidate CNV and limited to variants interpreted as having “Pathogenic” or “Likely Pathogenic” significance.
Population frequency (1 feature)
-
GnomAD SV popmax: To estimate population frequency, we identify the highest frequency across all gnomAD SV (V4) (Gudmundsson et al. 2022) entries that match the CNV’s variant type (deletion or duplication) and exhibit at least 50% reciprocal overlap in genomic coordinates. This means that the candidate CNV and the gnomAD SV entry it overlaps with must share at least half of their span, ensuring a significant genomic coverage overlap between the observed CNV and those described in gnomAD SV.
Conservation (2 features)
-
PhyloP: To estimate the conservation of a candidate CNV, PhyloP scores are referenced. PhyloP scores are available at single nucleotide resolution. Single nucleotide scores are highly variable and are thus correlated with the size of the candidate CNV. To mitigate this correlation, we employ a centered moving average that considers all scores within a specified reading frame (Supplemental Fig. 1). This effectively smooths the otherwise volatile conservation score curve. The maximum value of this smoothed curve within the genomic coordinates of the candidate CNV is returned as the PhyloP feature. A maximum was chosen considering that higher PhyloP scores indicate higher estimated conservation.
-
phastCons: The same procedure used to calculate the PhyloP feature was used to estimate conservation according to the phastCons score. Similar to PhyloP, the maximum was chosen as the aggregate function considering that higher phastCons scores indicate higher estimated conservation.
Dosage sensitivity (5 features)
-
HI Score: Dosage sensitivity was estimated by overlapping the candidate CNV with a curated set of dosage sensitive regions described by ClinGen (Rehm et al. 2015). Manually curated HI scores are available for all curated regions. HI scores were one-hot encoded to handle each unique value and split into binary features. In the case of multiple overlapping regions, HI score features were summed.
-
TS Score: As with HI Score, one-hot encoded binary features were generated from the curated TS scores of annotated dosage sensitive regions. The sum is also taken across all binary features when multiple dosage-sensitive regions overlap with the candidate CNV.
-
HI Index: The minimum HI Index score (Huang et al. 2010) observed in overlapping dosage-sensitive regions is obtained to represent the HI Index value for the candidate CNV. HI Index estimates the probability of loss intolerance, where lower values predict haploinsufficiency.
-
pLI: The maximum pLI score (Exome Aggregation Consortium et al. 2016) observed in overlapping dosage-sensitive regions is obtained from gnomAD to represent the pLI value for the candidate CNV. pLI estimates the probability of loss intolerance, where higher values predict haploinsufficiency.
-
LOEUF: The minimum LOEUF score (Karczewski et al. 2020) observed in overlapping dosage-sensitive regions is obtained to represent the LOEUF value for the candidate CNV. LOEUF estimates the probability of loss intolerance, where lower values predict haploinsufficiency.
Training Procedure and Model Selection
For model training, features were generated for CNVs included in the ClinVar training set before partitioning into deletion and duplication sets. Separate multi-class models were trained for duplication and deletions, each predicting whether a candidate CNV has benign, uncertain, or pathogenic clinical significance. 29 common ML architectures were tested via 5-fold cross-validation for each CNV type before selecting the top performers according to multi-class F1 score. Following architecture selection, 5-fold cross-validation was again employed to hyperparameter tune and find the most accurate model. 10,000 sets of hyperparameter values were tested for each CNV type via the sklearn.model_selection.RandomizedSearchCV class. The top-performing models were calibrated to class distributions in the training set via the isotonic method implemented in the sklearn.calibration.CalibratedClassifierCV class.
Comparator Algorithm Selection
CNVoyant predictions were compared to five published ML-based CNV pathogenicity classifiers, X-CNV (Zhang et al. 2021), TADA (Hertzberg et al. 2022), dbCNV (Lv et al. 2023), StrVCTVRE (Sharo et al. 2022), and ISV (Gažiová et al. 2022), as well as the algorithmic implementation of the ACMG technical standards for CNV interpretation, ClassifyCNV (Gurbich and Ilinsky 2020). The test set was passed to CNVoyant and all comparator algorithms to test for generalizability and generate accuracy metrics for benchmarking. Given that X-CNV, TADA, dbCNV, and ClassifyCNV take GRCh37-aligned variants as input, UCSC liftOver was again called to lift DECIPHER variants from GRCh38 to GRCh37. X-CNV pathogenic probabilities yielded only three unique values across the entirety of the test set, which was unexpected given the generation of a relatively heterogeneous feature set. Rather than attempting to amend the source code to produce more continuous output, X-CNV was omitted from benchmarking.
Measuring Performance
To generate granular performance data in the test cohort, accuracy metrics are reported for each of the three CNV prediction classes: benign, VUS, and pathogenic. The classification probability for each class was utilized to sort the list of CNVs and generate precision-recall (PR) and receiver operating characteristic (ROC) curves. The area under these curves (PR AUC, ROC AUC) is referenced to measure model accuracy, in addition to the average F1 score and overall accuracy for multi-class predictions. F1 and overall accuracy are only reported for CNVoyant, dbCNV, and ClassifyCNV, as these algorithms are the only three that provide multi-class output. dbCNV provides likely pathogenic and likely benign classification designations in addition to pathogenic, VUS, and benign designations. Likely pathogenic predictions were mapped to pathogenic classification and likely benign predictions were mapped to benign classification to ensure a fair comparison to CNVoyant and ClassifyCNV.
TADA, ISV, and StrVCTVRE all output a single score to estimate CNV pathogenic probability. The complement of the pathogenic probability score (1-Pr (pathogenic)) was calculated to estimate benign significance scores for these comparator algorithms. The ClassifyCNV output is a score rather than a probability, but the complement was still chosen to represent benign significance, as higher scores represent a higher pathogenicity. CNVoyant is the only algorithm that provides benign and VUS probabilities; these values were used in plotting corresponding benign and VUS classification curves. dbCNV does not provide probabilities or a continuous confidence score for classification, so there is no value to reference in plotting ROC and PR curves. As such, dbCNV was not included in the ROC and PR curve comparisons.
Deciphering Feature Influence on CNV Classification with SHAP Analysis
SHAP (SHapley Additive exPlanations) values offer a qualitative analysis tool for understanding how each feature influences the clinical significance prediction for distinct CNV classes. For CNVoyant, we generated SHAP beeswarm plots across all classes to visualize the effect of training features on model prediction (Lundberg and Lee 2017). These plots rank features by their importance and use color coding to depict the direction of their influence on the model's output. Each point on a plot represents a feature's SHAP value for an individual observation, quantifying its contribution to moving the model's prediction from the base value—the dataset's average prediction—toward the actual prediction.