KMT2A‐D pathogenicity, prevalence, and variation according to a population database

Abstract Introduction The KMT2 family of genes is essential epigenetic regulators promoting gene expression. The gene family contains three subgroups, each with two paralogues: KMT2A and KMT2B; KMT2C and KMT2D; KMT2F and KMT2G. KMT2A‐D are among the most frequent somatically altered genes in several different cancer types. Somatic KMT2A rearrangements are well‐characterized in infant leukemia (IL), and growing evidence supports the role of additional family members (KMT2B, KMT2C, and KMT2D) in leukemogenesis. Enrichment of rare heterozygous frameshift variants in KMT2A and C has been reported in acute myeloid leukemia (AML), IL, and solid tumors. Currently, the non‐synonymous variation, prevalence, and penetrance of these four genes are unknown. Methods This study determined the prevalence of pathogenic/likely pathogenic (P/LP) germline KMT2A‐D variants in a cancer‐free adult population from the Genome Aggregation Database (gnomAD). Two methods of variant interpretation were utilized: a manual genomic variant interpretation and an automated ACMG pipeline. Results The ACMG pipeline identified considerably fewer P/LP variants (n = 89) compared to the manual method (n = 660) in all 4 genes. Consequently, the total P/LP prevalence and allele frequency (AF) were higher in the manual method (1:112, AF = 4.46E‐03) than in ACMG (1:832, AF = 6.01E‐04). Multiple ancestry‐exclusive P/LP variants were identified along with an increased frequency in males compared to females. Many of these variants identified in this population database are also associated with severe juvenile conditions. Conclusion These data demonstrate that putatively functional germline variation in these developmentally important genes is more common than previously appreciated and identification in cancer‐free adults may indicate incomplete penetrance for many of these variants. Future research should examine a genetic predisposing role in IL and other pediatric cancers.


| INTRODUCTION
The histone-lysine N-methyltransferase 2 (KMT2) family of genes, previously known as mixed lineage leukemia (MLL), encode for H3K4 methyltransferases that play an essential role in epigenetic regulation of gene expression via altering chromatin structure to promote DNA accessibility. 1 are highly conserved among eukaryotes given their importance in cellular function and mammalian gene expression during early development. 1,2 These four genes encode for large proteins part of complexes crucial for transcription, known as COMPASS (COMplex of Proteins ASsociated with SET1). 3 Beyond the well-characterized somatic dysregulation resulting from KMT2A-rearranged pediatric leukemia, recent studies have shown KMT2A-D variants are indeed among the most frequent genomic alterations in a variety of human malignant neoplasms with associations to common blood and solid tumor cancers. 2,3 Genotype-phenotype correlations have been established between KMT2A (11q23.3), KMT2B (19q13.12), KMT2C (7q36.1), and KMT2D (12q13.12) somatic variants and different cancer types. [2][3][4] KMT2A-related cancers are strongly associated with large somatic chromosomal rearrangements with over 100 translocation partners creating fusion proteins linked to disease phenotypes of particularly aggressive adult and infantile leukemias. 2,5 Emerging data on KMT2B has found potential associations between somatic short pathogenic variants and hepatocellular carcinoma as well as translocations with undifferentiated spindle cell sarcomas. 6,7 However, KMT2B has yet to be as confidently identified as a driver of oncogenesis. 8 Advancements in cancer exome sequencing studies have supported somatic KMT2C haploinsufficiency in acute myeloid leukemia, pancreatic ductal carcinoma, bile duct carcinoma, cutaneous squamous cell carcinoma, gastric adenocarcinoma, and hepatocellular carcinoma. 9 Similar studies have shown somatic KMT2D variants to be associated with non-Hodgkin lymphoma, renal carcinoma, squamous cell carcinomas, mantle cell lymphoma, and pediatric tumors more generally. 9 All these associations are additionally complicated by potential polygenic combinatorial effects causing a statistically significant co-occurrence rate, suggesting variants in several KMT2 genes may be required for the development of various cancer types. 3 Beyond the malignant neoplasm associations, there are additional correlations with autosomal dominant congenital disorders resulting from de novo heterozygous germline variants 3 12,13 and KMT2D with Kabuki syndrome (KABUK1 [MIM: 147920]). 11,14 Of particular interest is the relation of the KMT2 genes to pediatric cancers, such as IL. IL describes children diagnosed with either acute lymphocytic leukemia (ALL) or acute myeloid leukemia (AML) up to one year of age. 15,16 This sporadic, rare cancer (approximately 150 cases diagnosed in the United States annually) has a poor prognosis with a 5-year event-free survival rate of approximately 50%. 16,17 Additionally, those who survive initial cancer frequently experience lifelong late effects such as developmental and cognitive deficits. 18,19 Such lack of progress likely results from an inadequate understanding of IL given the unique natural history and epidemiology of this disease compared to the other pediatric leukemia types, suggestive of novel biology. 20 Further, IL has rapid onset in utero and is subsequently understood to result from improper hematopoietic development. 21 One of the defining features of IL is the disproportionate presence of KMT2A chromosomal rearrangements. 22,23 Such translocations are seen in ~34%-50% of infant AML cases 24 and approximately 80% of ALL cases, 25 however, new evidence suggests additional factors are needed to fully explain leukemogenesis. 3,26 With one of the lowest known rates of somatic variants beyond KMT2A rearrangements compared to any other cancer type, it is now hypothesized that inherited germline variation (notably through an excess of rare, non-synonymous variants) may lead to the development of IL. 26 One study, building off an observed enrichment of germline missense variants in KMT2C, found complete loss of this epigenetic regulator hinders mesoderm to hematopoietic specification in human pluripotent cells (hPSCs) in vitro. 26,27 Thus, germline variants and alterations in other KMT2 genes, independent of KMT2A translocations, must be further explored to gain a more comprehensive look at IL predisposition.
The recent availability of large-scale genomic data from publicly available population datasets, such as the Genome Aggregation Database (gnomAD), allows an estimation of pathogenic germline variation within specific genes to be obtained. 28 Particularly, gnomAD provides insight on individuals screened for no history of childhood disease (i.e., cancer or cardiology condition, etc.) among participants or in any of their first-degree relatives, encompassing a wide range of ages (18-85 years), ancestry, and sex assigned at birth information. Understanding the non-synonymous variation, P/LP prevalence, and frequencies of germline heterozygous KMT2 variants in a cancer-free adult population could supplement the broader understanding of how different genetic variants contribute to the phenotype of certain conditions. Germline heterozygous variants in these four genes are associated with severe congenital and pediatric conditions. 10,26,29,30 Hence, this analysis may provide insight into the penetrance of these variants seen in cancer-free adults. To our knowledge, this is the first description of the variation of pathogenic heterozygous germline KMT2A-D variants in a large population database.

| Large, publicly available datasets in gnomAD
Version 3 (v3.1.2) of gnomAD was used to examine KMT2A-D non-synonymous sequence germline variation in individuals within non-cancer datasets as there is an overall cancer predisposition in pathogenic KMT2 variants. Due to the larger sample size of African American individuals and the inclusion of Amish and Middle Eastern populations, version 3 was chosen over version 2. These populations span both whole genomes and exomes from unrelated individuals mapped to the GRCh38 build of the human reference genome. Participant genomes and exomes were sequenced as part of both population-wide and disease-specific genetic studies and had no personal history of common disease, or known cardiac, cancer, or neurologic diagnosis. After gnmoAD's internal sequencing QC filtering 28 was applied, each gene had between 74,019 and 74,032 individuals included in this analysis (Table S1). The use of non-cancer datasets confirmed the variants in this analysis were not from individuals ascertained to have cancer in cancer studies allowing a closer look at the baseline frequency of variation in a cancer-free general population.
All exonic, splice-site region, and intronic variants from the non-canonical KMT2A (NM_001197104.2), KMT2B (NM_014727.3), KMT2C (NM_170606.3), and KMT2D (NM_003482.4) transcripts were utilized in this analysis including nonsense, frameshift, missense, synonymous, deep intronic, and UTR variants. Multi-allelic variants (homozygotes and hemizygotes) and structural variants (large deletions, duplications, insertions, or other DNA rearrangements) were excluded since they were not available in gnomAD v3.1 at the time of this analysis. Version 2.1 does include structural variant data; however, all variants were small deep intronic deletions, duplications, or insertions. Therefore, they did not seem relevant to this analysis. The gnomAD datasets provided sequence ontology, ClinVar significance, allele counts, allele frequencies, and chromosomal positions important for variant interpretation. ANOVAR22 was utilized to merge bioinformatic pathogenicity predictions and provided American College of Medical Genomics and American Molecular Pathology (ACMG-AMP) criteria to further differentiate hotspot locations using pathogenic moderate (PM1). UCSC Genome Browser (GRCh38/hg38) was used to determine the genomic location of pathogenic and likely pathogenic variants when necessary.

| Manual variant classification system
A schematic (Figure 1) to classify all variants within KMT2A-D as pathogenic (P), likely pathogenic (LP), variant of uncertain significance (VUS), and likely benign (LB) was modeled after the guidelines proposed by ACMG and a corresponding study that performed a similar analysis in DICER1. 31 In brief, a variant was determined to be pathogenic if indicated as a loss of function (LOF) including nonsense or frameshift, a splice donor or acceptor variant, (a missense variant located in a missense mutational hotspot, reported in at least one publication, or had a Pathogenic/Likely Pathogenic call from ClinVar. F I G U R E 1 Manual genomic variant interpretation schematic. *Missense mutational hotspot locations were identified from the agreement in 2 conservation in silico models and a prior classification of the ACMG PM1 criteria. This variant interpretation schematic was adapted from Kim et al. 31 Non-synonymous missense variants located in a nonhotspot region and having a bioinformatic pathogenicity prediction of CADD ≥30 or REVEL ≥0.75 were classified as likely pathogenic. Our scheme deviates from Kim et al.'s 31 classification system most notably in thresholds for bioinformatic pathogenicity predictor scores. Variants with CADD scores over 30 or REVEL scores over 0.75 are predicted to be the 0.1% most deleterious possible substitutions while variants with an intermediate CADD score of 20-29 and REVEL score of 0.5-0.74 are predicted to be the most 1% deleterious. 32,33 Thus, where Kim et al. 31 enforced more conservative thresholds, we employed intermediate limits to further differentiate between missense variant classifications. However, a larger emphasis of their analysis was to look for solo variants with large effect sizes while ours was more liberal in allowing for moderate effect size variants that could potentially result in a phenotype if in conjunction with certain somatic variants. MetaSVM was not included in this paper due to significant inflation of missense variants classified as likely pathogenic across all four genes (n = 1446) compared to REVEL (n = 117) and CADD (n = 110), indicating it to be a possible artifact of the analysis, and recent publications also eliminating MetaSVM from its analysis while using similar CADD and REVEL thresholds. 34 Identification of missense mutational hotspots To date, there are no reported missense mutational hotspots within the KMT2 genes except for the first PHD finger domain in KMT2C. 3 To identify remaining hotspots, the PM1 ACMG criteria and mammalian conservation in silico models (GERP and PhyloP) were utilized. PM1 is a functional data criterion that is invoked when a variant is in a mutational hotspot and/or critical and well-established functional domain (i.e., active site of an enzyme) without benign variation. Therefore, to confirm this as an identifier for missense variants being in a hotspot, variants additionally needed to be found within a conserved region in mammals and have a criterion of PM1 invoked.

| ACMG classification
ACMG has previously developed guidance for the interpretation of clinical sequence variants. 35 These recommendations primarily apply to the breadth of genetic tests used in clinical laboratories including genotyping single genes, panels, exomes, and genomes. For clinical relevance, the variants identified in gnomAD were also run through an ACMG auto-classification system interpreting the KMT2 variants in accordance with this criterion using Golden Helix VarSeq 2.2.0 (Golden Helix, Inc., Bozeman, MT) to annotate the datasets via the VSClinical pipeline. We note that due to the unavailability of parental samples for the gnomAD population, we did not include the ACMG PS2 or PM6 criteria for evidence of pathogenicity as part of our ACMG classification. To allow for further comparison with the manual variant classification system, the ACMG interpretations were classified into similar four categories of: LB, VUS, LP, and P. All variants that ACMG classified as benign or likely benign were combined into LB; all conflicting calls or variants of unknown significance were categorized as VUS including VUS/weak benign and VUS/ weak pathogenic; all likely pathogenic were pooled as LP; all pathogenic variants were classified as pathogenic (P). Sex information was also collected from this pipeline and merged with our gnomAD dataset. Sex information was missing for 6423 variants (1225 for KMT2A, 1258 for KMT2B, 2036 for KMT2C, and 1904 for KMT2D), and was manually added for any missing P/LP variants. Total chromosomes/individuals and number of P/LP variants were used for prevalence calculations.

| Data visualization and statistical analysis
Lollipop plots were created to display the spectrum of pathogenic and likely pathogenic variants across both interpretation schemes utilizing the cBioPortal Mutation Mapper. 36 Variants classified as either pathogenic or likely pathogenic from the manual approach and pathogenic, likely pathogenic, and VUS/weak pathogenic from the ACMG interpretation were visualized in the lollipop plots for all four genes. Only protein-coding variants were included, removing eight canonical splice variants (KMT2C = 5, KMT2D = 3). Logistic regression was used for analyzing ancestry and gender allele frequency differences across variant type and interpretation. All statistical analysis was done in R (4.1.2) and utilizing the package epiDisplay. The overall allele frequency for each gene was calculated by dividing the total allele count by the total allele number and was similar for the ancestry-specific frequencies. Variant prevalence was calculated by dividing the amount of pathogenic or likely pathogenic variants by the total amount of individuals (Table S1). P-values were only calculated for male/female mean allele frequencies comparison, a two-sample t-test was used to determine significance.  (Tables S2 and S3). Variants classified as pathogenic or likely pathogenic (including missense, nonsense, frameshift, and canonical splice site variants only) were further utilized for downstream analysis and stratified by ancestry. KMT2C (AF = 0.011, 0.001) and KMT2D (AF = 0.003, 0.0003) both had higher pathogenic and likely pathogenic allele frequencies across both interpretations (manual approach and ACMG, respectively) than KMT2A (AF = 0.002, 0.00005) and KMT2B (AF = 0.003, 0.00002). However, the manual approach consistently had higher frequencies when compared to the ACMG method (Table 1; Tables S3 and S4). All ACMG and manual pathogenic and likely pathogenic variants were plotted to highlight the amino acid and exonic locations of these variants (Figure 2A-G).  (Table S2). There were additionally 3837 variants of uncertain significance identified and 9816 likely benign (Table S2). In total, 461 missense variants, 38 frameshift variants, 17 nonsense, and 18 canonical splice site variants were identified as pathogenic (Table S2). All 126 likely pathogenic variants are non-synonymous missense per the schematic (Figure 1).

| Intronic and UTR variants
In the manual variant interpretation scheme, a total of 8835 likely benign intronic variants were excluded from our analysis bringing the total sample variants to 4791 utilized. Likewise, in the ACMG criteria schematic, a total of 8840 likely benign intronic variants were excluded. A comparison of interpretation methods reveals minor differences between them for both intronic and UTR variants. In KMT2A and KMT2C, the manual approach identified no intronic variants as being variants of unknown significance; however, 5 for KMT2A and 14 for KMT2C were noted in the ACMG method. KMT2B and KMT2D had intronic variants identified as VUS in the manual approach (n = 5 and n = 8) which was reflected in the ACMG criteria by identifying 11 and 9, respectively. In terms of UTR variants, the manual approach had 15 denoted as VUS for KMT2A due to their designation as ncRNA while the ACMG criteria retained only 5 of those with the other 10 reclassified as likely benign. For KMT2B, KMT2C, and KMT2D, none were classified above likely benign in the manual classification; however, in the ACMG

| Ancestry and sex-based differences in pathogenic and likely pathogenic variants
There were multiple ancestry-specific variants identified in this analysis ( For those with data on sex assigned at birth, according to the manual classification system, males tended to have higher average allele frequencies than females, except for KMT2B (Table 1). Information was missing for nearly all ACMG criteria except for likely pathogenic variants in KMT2C leading to an average AF for males of 2.64E-06 and 2.06E-06 for females along with one pathogenic frameshift variant for KMT2D (c.4168dup, p.Ala1390Glyf-sTer42) one male allele (AF = 6.01E-09) and none for females (Table 1).

| DISCUSSION
Since the relationship between KMT2A and both myeloid and lymphoid leukemias was first discovered, its crucial role as an epigenetic regulator has been further  characterized along with the importance of other gene family members. 2,37 Of recent interest has been their relationship with IL. While structural rearrangements of KMT2A have long been recognized as a key feature of this rare cancer, many components regarding the etiology of IL remain unknown. Given the inability to fully account for the incidence of IL under somatic variant mechanisms, studies now suggest this complex trait results from many rare inherited variants that, in aggregate, influence disease phenotypes. 22,26,38 It has been further complicated by the failure of murine models (in KMT2A rearrangements) to develop leukemia in infancy, suggesting other factors are necessary for rapid-onset leukemia in utero. 39 Therefore, while individual parents may not have significantly increased cancer risk, random combinations of alleles during offspring development can lead to a greater chance of leukemogenesis in infancy from an enhanced germline genetic predisposition. 26 Indeed, one analysis reported a statistically significant increase in adjusted odds ratio when a second-degree relative was confirmed to have cancer for infants with ALL (adj. OR = 1.66, 95% CI = 1.12-2.45) and near significant increase for AML (adj. OR = 1.54, 95% CI = 0.80-2.98). 40,41 Multiple KMT2 genes have also now been shown to potentially play a role in the onset of leukemogenesis with the presence of excess congenital, non-synonymous germline variation in MLL-3 (KMT2C) having been identified specifically. 26 This study, therefore, further inquired upon the prevalence of germline pathogenic variation in KMT2A-D in the general population. An overall total of 14,313 variants were identified across all four KMT2 genes (pathogenic, likely pathogenic, VUS, and likely benign). The manual approach classified variants as pathogenic or likely pathogenic 4.61% of the time compared to 0.43% according to ACMG. Five variants identified as pathogenic by both approaches included one KMT2A frameshift variant (c.134del, p.Pro45ArgfsTer105) as well as three KMT2D frameshifts (c.15953_15956del, p.Leu5318SerfsTer14; c.4168dup, p.Ala1390GlyfsTer42; and c.2994del, p.Met999Ter) and one nonsense (c.7411C > T, p.Arg2471Ter). Both c.7411C > T and c.2994del along with two other KMT2D variants identified as pathogenic and likely pathogenic by ACMG criteria (c.4168dup, p.A1390GfsTer42 and c.10180C > T, p.Gln3394Ter) have been reported to be de novo in individuals affected with Kabuki syndrome. 29,[42][43][44][45][46] Given the autosomal dominant inheritance pattern, this brings into question why damaging germline variants would be present among individuals or any first-degree relative screened for severe pediatric disease within the gnomAD database. 46 Penetrance is considered nearly complete for KMT2D; therefore, it is likely these individuals may have mosaic Kabuki, which has been found in two studies to result in individuals that are mildly/minimally affected. 29,47,48 An increased number of damaging variant classifications from the manual approach largely resulted from its automatic classification of missense variants located in a missense mutational hotspot as pathogenic and broad categorizations based on variant type more generally (Figure 1). Further, there was an enrichment of likely pathogenic missense variants given their classification based solely on bioinformatic in silico predictor scores (CADD ≥30 and REVEL ≥0.75). Besides missense variants, those downgraded from pathogenic by ACMG to VUS or likely benign included: canonical splice site variants (KMT2A = 2, KMT2C = 2, KMT2D = 6), nonsense variants (KMT2B = 2, KMT2C = 2, KMT2D = 3), and one frameshift variant for KMT2B. No variants classified as likely benign or VUS in the manual variant classification system were reclassified as pathogenic or likely pathogenic by ACMG. Overall, the ACMG method was significantly more conservative across all four genes. Such results were expected given ACMG is the current clinical standard of genetic variant interpretation, thus requiring additional criteria to classify a variant as pathogenic or likely pathogenic. Our findings indicate that approaches created to aid in population genetic research may lead to higher estimates of damaging variants, particularly for novel genes and complex diseases.
KMT2-family variants are among the most frequent somatic alterations in human cancer. 2 KMT2A rearrangements represent recurrent somatic events comprising 35%-50% of infant AML cases 24,25 and 50%-80% of infant ALL cases. 24,49 Four germline variants identified in our analysis of KMT2C (c.1173C > A and c.2976 + 1G > C) and KMT2D (c.2560dup and c.4168dup) were also found in COSMIC (the Catalogue Of Somatic Mutations In Cancer) in hematopoietic and lymphoid tumors. All four of these variants were classified as likely pathogenic or pathogenic in both the manual and ACMG interpretations. Overall, the frequency of germline pathogenic variants that were found in the somatic literature was low. This illustrates the growing importance of understanding the background frequency of KMT2 germline variants in large, cancer-free populations.
Overall, this approach allowed for an examination of germline variation within the KMT2 genes and adds to the body of research on the significance of inherited germline heterozygous variants. However, there were several limitations. The manual interpretation includes a heavy reliance on variant type, in silico models, and published literature for determining variant classification. Therefore, it has limited usage in a clinical setting and is better suited for large-scale epidemiological variant interpretation to reveal potential variations of interest for future investigation, as we have done herein. ClinVar is driven by submissions of data, and its scope is limited to variants that have been interpreted for clinical or functional significance, which is laborious for each variant. This is reflected here as ClinVar was considerably more conservative in identifying P/LP variants than the manual method. GnomAD does not provide individual genotype data; consequently, there is a possibility of one person possessing multiple P/LP KMT2A-D variants. Additionally, the sample size of each ancestry population was not equivalent and varied greatly between ancestries. This complicated our ability to discern whether the lack of variants identified in the smaller (<5000 adults) ancestry populations (Middle Eastern, Amish, Other, Ashkenazi Jewish, South Asian, and East  Asian) are due to a true absence or are due to a lack of statistical power to detect such variants. We were unable to control for the digenic or polygenic inheritance of the variants meaning the measures of frequencies used here, assumed that each variant was found in a unique person. We did not have parental genetic information; therefore, we could not differentiate between inherited or de novo germline variants and ACMG criteria PS2 was unable to be invoked to determine paternal or maternal inheritance. Lastly, gnomAD captures a picture of the cancer-free adult population (18-85 years) with no personal or first-degree family history of severe pediatric disease. However, it still may include individuals that possess other risk variants for polygenic, multifactorial, or adult-onset conditions. Publicly available population databases include whole genome and exome sequence information from large populations, making them useful in the establishment of a comprehensive resource on human genetic variation at the population level. They can be used to estimate the full spectrum of natural and human disease variation and are utilized in the ACMG criteria for classifying clinical variants. 35 We presented a comprehensive characterization of KMT2A, KMT2B, KMT2C, and KMT2D pathogenic variation in a large cancer-free adult population, with ancestryand sex-specific frequencies. Epidemiological genomic variant interpretation is often a bottleneck in population genomics analysis and is needed to understand gene penetrance, clinical genotype-phenotype relationships, and the prevalence of nonsynonymous variation. The findings from our study, if validated in other large, populationbased datasets (i.e., LOVD, UK BioBank), suggest that pathogenic heterozygous germline KMT2A-D variants are more common than previously expected and provide better insight into the penetrance of these variants.