Interpretation of Maturity-Onset Diabetes of the Young Genetic Variants Based on American College of Medical Genetics and Genomics Criteria: Machine-Learning Model Development

Background: Maturity-onset diabetes of the young (MODY) is a group of dominantly inherited monogenic diabetes, with HNF4A-MODY, GCK-MODY, and HNF1A-MODY as the three most common forms based on the causal genes. Molecular diagnosis of MODY is important for precise treatment. Although a DNA variant causing MODY can be assessed based on the criteria of the American College of Medical Genetics and Genomics (ACMG) guidelines, gene-specific assessment of disease-causing mutations is important to differentiate among MODY subtypes. As the ACMG criteria were not originally designed for machine-learning algorithms, they are not true independent variables. Objective: The aim of this study was to develop machine-learning models for interpretation of DNA variants and MODY diagnosis using the ACMG criteria. Methods: We applied machine-learning models for interpretation of DNA variants in MODY genes defined by the ACMG criteria based on the Human Gene Mutation Database (HGMD) and ClinVar database. Results: With a machine-learning procedure, we found that the weight matrix of the ACMG criteria was significantly different between the three MODY genes HNF1A, HNF4A, and GCK. The models showed high predictive abilities with accuracy over 95%. Conclusions: Our results highlight the need for applying different weights of the ACMG criteria in relation to different MODY genes for accurate functional classification. As proof of principle, we applied the ACMG criteria as feature vectors in a machine-learning model and obtained a precision-based result. (JMIR Biomed Eng 2020;5(1):e20506) doi: 10.2196/20506


Introduction
Monogenic diabetes results from DNA mutations in a single gene, which accounts for about 1%-4% of all cases of diabetes in the United States [1]. The most common form of monogenic diabetes is maturity-onset diabetes of the young (MODY), an autosomal dominant disease that most commonly occurs in adolescence or early adulthood [2]. Genetic sequencing is needed to identify the causal mutations and to diagnose different subtypes of MODYs [3]. The DNA variant causing MODY can be specifically assessed using the criteria established by the American College of Medical Genetics and Genomics (ACMG), as published in their guidelines [4]. Although the ACMG guidelines can be universally applied for all human DNA variants, our previous study suggested that a gene-specific assessment is important for identifying disease-causing mutations in different MODY genes [5]. In addition, contradictory evidence is commonly seen in functional classification of genetic variations when using the ACMG guidelines [6]. The ACMG guidelines may suggest a variant of uncertain significance; however, classification of the variant may have contradictory evidence, and some variants with contradictory evidence may turn out to have a reliable definite classification.
Machine learning has been advocated as an important tool for both clinical and research purposes in human diseases [7,8]. In this study, we aimed to develop machine-learning models for interpretation of DNA variants using the ACMG criteria, with a focus on DNA variants of three MODY genes (HNF1A, HNF4A, and GCK) underlying the three most common types of MODYs [9].

Data Collection for Machine-Learning Procedures
Known DNA variants of the three MODY genes HNF1A, HNF4A, and GCK were acquired from the dbSNP [10], the ClinVar database [11], and the Human Gene Mutation Database (HGMD) 2019 professional version [12]. Among the multihundred variants reported in these genes, approximately half have a classification of pathogenic/likely pathogenic (P/LP) according to the annotation in ClinVar or HGMD. According to the HGMD, the three genes were curated by Professor Andrew Hattersley, a leading genetic expert in MODYs [13]. The classification of variants as benign/likely benign (B/LB) varies between the different databases according to the annotation of ClinVar or dbSNP. Overall, for the three genes, there are 899 unique variants reported in HNF1A, including 569 P/LP sites and 330 B/LB sites; 1037 unique variants for HNF4A, including 182 P/LP sites and 855 B/LB sites; and 1664 unique variants for GCK, including 1065 P/LP sites and 599 B/LB sites. However, several of these variants have different annotation features between the different databases.

Feature Vector Generation
The feature vectors used for machine-learning modeling were the criteria based on the ACMG guidelines [14]. The criteria terms were generated based on InterVar [15], a computational tool that uses a preannotated or variant call format file as an input and generates automated interpretation based on the ACMG criteria. It should be noted that not all 33 ACMG criteria can be computationally scored. For example, the PS3 criterion requires well-established in vitro or in vivo functional studies supportive of a damaging effect on the gene or gene product. As a result, the following 15 ACMG criteria were used, which was also the length of feature vectors for the three MODY genes: PVS1, PS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BA1, BS1, BP4, BP6, and BP7.
Using machine-learning regression procedures, we normalized the weights for the evidence of different categories in accordance with the ACMG guidelines, assuming that the weight coefficient of PVS1 is 1, that of PS is 1/2, that of PM is 1/6, and that of PP is 1/12. We additionally assumed that the weight coefficient of BA1, BS, and BP is -1, -1/2, and -1/4, respectively. As the ACMG criteria were not originally designed for machine learning, these criteria are not true independent variables. Multicollinearity among feature vectors is commonly seen within each gene, which is the case for the PM1 and PP2 criteria. By definition, a PM1 hit means that the variant is located in a mutational hotspot or in a critical and well-established functional domain without benign variation, and a PP2 hit means that there is a missense variant in a gene that has a low rate of benign missense variation and in which missense variants are a common mechanism of disease. In many situations, PM1 and PP2 are consistent with each other, which increases the risk of inappropriate weighting of the two criteria because of multicollinearity. To detect the collinearity among feature vectors, we calculated the variance inflation factor (VIF) and pairwise correlation coefficient for the ACMG criteria. Feature vectors with a VIF greater than 10 or a correlation coefficient larger than 0.8 were removed before the learning procedures.

Learning Procedures and Predictive Modeling
The machine-learning procedure used in this study was a typical logistic regression based on the Scikit-learn package in Python [16]. For detection of the weight matrix of the ACMG criteria, all variants, including P/LP and B/LB variants, were taken into account. For predictive modeling, we split the data based on 2-fold random shuffle processes. In other words, the P/LP and B/LB variants were split randomly into equally sized sets, with one set serving as training data and the other set serving as testing data, to determine the predictive capabilities of the model. This process was repeated 20 times to obtain the mean and standard deviation for accuracy measures, including sensitivity and specificity.

Variation in the Weight Matrix of ACMG Criteria Among the Three MODY Genes
Based on the machine-learning procedure, we found that the weight matrix of the ACMG criteria was significantly different between the three MODY genes HNF1A, HNF4A, and GCK (Table 1, Figure 1). The differences are nontrivial and must be taken into consideration in clinical interpretation of DNA variants for genetic diagnosis.  Evidence for PS is rarely observed for the MODY variants. By contrast, evidence for PS4 (ie, the prevalence of the variant in affected individuals is significantly increased compared with the prevalence in controls) is commonly observed but is often misclassified. As an example, the HNF1A variant 12:121420807-G-A (rs1183910) was reported to be associated with C-reactive protein, a marker of inflammation, in a genome-wide association study [17]. However, as a common single nucleotide polymorphism with a minor allele frequency of 0.292 in European populations, this cannot be a variant causing the rare and dominantly inherited form of HNF1A-MODY.
With respect to evidence for PM criteria, PM1, which is defined as a variant located in a mutational hotspot or in a critical and well-established functional domain (eg, active site of an enzyme) without benign variation, and PM2 (absent from controls or at extremely low frequency if recessive in Exome Sequencing Project, 1000 Genomes Project, or Exome Aggregation Consortium) are both commonly observed, in support of pathogenic variants in the three MODY genes. However, PM2 is also commonly seen among B/LB variants in these three genes, thus lacking specificity for functional classification. In this study, PM2 showed a VIF of 79.0 in HNF1A and a VIF of 247 in GCK. Therefore, although PM2 is much more common than PM1 for the three MODY genes, the weight of PM2 in HNF1A is lower than that of PM1.
With respect to the evidence for PP criteria, PP2 (missense variant in a gene that has a low rate of benign missense variation and in which missense variants are a common mechanism of disease) is absent in HNF1A and GCK, but is commonly seen in HNF4A. However, PP2 showed a correlation coefficient of 0.932 with PM1, and therefore does not add substantial weight to the classification of P/LP variants in HNF4A.

Highly Accurate Predictive Ability for MODY Gene Pathogenicity
HNF4A-MODY (MODY1), GCK-MODY (MODY2), and HNF1A-MODY (MODY3) are the three most common types of MODYs, accounting for ~70% of all MODY genes [18]. Therefore, a predictive model that can accurately recognize pathogenic variants would be useful for the diagnosis of novel mutations in these genes. As described in the Methods section, we used 2-fold random shuffle testing with 50% of the 3600 mutations as training data and the other 50% as testing data, and repeated the analysis 20 times. The logistic regression machine-learning model showed overall accuracy above 95% (1676/1786) for MODY gene mutations (Figure 2). Both HNF1A (true negatives=163, false positives=2) and HNF4A (true negatives=428, false positives=0) had a specificity close to 100%, and the specificity in GCK was also above 95% (true negatives=289, false positives=10). This lower specificity is also consistent with the benign phenotype and mild clinical expression of GCK-MODY. These results proved the principle that ACMG criteria could be applied as meaningful feature vectors in a machine-learning model, and such a model based on ACMG criteria could provide accurate pathogenic classification for other Mendelian disease genes in a gene-specific manner.

Discussion
Our results highlight the need for applying different weights of the ACMG criteria in the functional classification of DNA variants of different MODY genes. In the past decade, sequencing technologies have evolved rapidly with the advance of high-throughput next-generation sequencing (NGS). By adopting NGS, clinical laboratories are now performing an ever-increasing volume of genetic testing for genetic disorders. However, increased complexity in genetic testing has been accompanied by new challenges in sequence interpretation, and multiple new standards have been implemented for physicians and genetic counselors regarding the interpretation and reporting of sequence variants at different levels of pathogenicity.
Currently, there are multiple computational tools available based on different algorithms and databases that are being used to predict the pathogenicity of DNA variants, such as SIFT [19], MutationTaster [20], likelihood ratio test [21], FATHMM by a supervised machine-learning model [22], GERP++ by maximum-likelihood evolutionary rate estimation [23] for coding variants, and DANN for both coding and noncoding variants using a deep neural network [24]. However, all of these computational tools assess each gene with a common rule, which is not based on biology, whereas this study proposes that a gene-specific assessment for pathogenicity is required, at least for MODY genes [5].
The evolutionary selection pressures on MODYs vary across different genes, and is considered to be the lowest in the case of GCK-MODY [25]. Similar issues exist with functional classification based on the ACMG criteria, which are globally applied for all human genes. The ACMG criteria contain 33 terms that lead to five categories of mutations ("pathogenic," "likely pathogenic," "uncertain significance," "likely benign," and "benign"), as one of the most commonly used standards.
MODY represents a group of dominantly inherited monogenic diabetes, and HNF4A-MODY (MODY1), GCK-MODY (MODY2), and HNF1A-MODY (MODY3) are the three most common subtypes of MODY. These MODY genes are involved in different molecular pathways. MODY variants of different genes show different clinical features and thus require different treatments. For example, HNF1A-MODY is characterized by a reduced beta cell mass or impaired function, and has been treated with sulfonylureas for decades with excellent results [26]. Patients with HNF1A-MODY are highly sensitive to sulfonylurea treatment and may be susceptible to developing hypoglycemia during the treatment [26]. HNF4A-MODY has similar clinical features with HNF1A-MODY, and the affected transcription network plays a role in the early development of the pancreas. The pancreatic beta cells produce adequate insulin in infancy but the capacity for insulin production declines thereafter [27]. The beta cells in GCK-MODY have a normal capacity to make and secrete insulin, but do so only above an abnormally high glucose threshold, which results in a chronic, mild increase in blood sugar that is usually asymptomatic [25]. Accordingly, treatment of GCK-MODY can be achieved by a healthy diet and exercise, while oral hypoglycemic agents or insulin is of no benefit for these patients [25]. Therefore, accurate molecular diagnosis of these MODYs is important for precise treatment.
In conclusion, we applied a computational machine-learning method together with the ACMG criteria for functional classification of genetic variants of the three most common MODY genes, HNF1A, HNF4A and GCK. Our results show that a typical machine-learning model using 15 computational ACMG criteria as the feature vector has predictive abilities that are highly accurate (>95% accuracy) for hundreds of annotated variants in three MODY genes. Therefore, this model could serve as a fast, gene-specific method for physicians or genetic counselors assisting with diagnosis and reporting, especially when confronted by contradictory ACMG criteria. Moreover, we show that the weight of the ACMG criteria exhibits gene specificity, which advocates for the application of machine-learning methods with the ACMG criteria to capture the most relevant information for each disease-related variant.