Genome sequencing reveals molecular subgroups in oral epithelial dysplasia

Abstract This study aimed to analyze the molecular characteristics of oral epithelial dysplasia (OED), highlighting the pathways and variants of genes that are frequently mutated in oral squamous cell carcinoma (OSCC) and other cancers. Ten archival OED cases were retrieved for retrospective clinicopathological analysis and exome sequencing. Comparative genomic analysis was performed between high-grade dysplasia (HGD) and low-grade dysplasia (LGD), focusing on 57 well-known cancer genes, of which 10 were previously described as the most mutated in OSCC. HGD cases had significantly more variants; however, a similar mutational landscape to OSCC was observed in both groups. CASP8+FAT1/HRAS, TP53, and miscellaneous molecular signatures were also present. FAT1 is the gene that is most affected by pathogenic variants. Hierarchical divisive clustering showed division between the two groups: “HGD-like cluster” with 4HGD and 2LGD and “LGD-like cluster” with 4 LGD. MLL4 pathogenic variants were exclusively in the “LGD-like cluster”. TP53 was affected in one case of HGD; however, its pathway was usually altered. We describe new insights into the genetic basis of epithelial malignant transformation by genomic analysis, highlighting those associated with FAT1 and TP53. Some LGDs presented a similar mutational landscape to HGD after cluster analysis. Perhaps molecular alterations have not yet been reflected in histomorphology. The relative risk of malignant transformation in this molecular subgroup should be addressed in future studies.


Introduction
The incidence of oral cancer in the United States (U.S.) is estimated at 35,130 new cases per year, most of which correspond to oral squamous cell carcinoma (OSCC). 1 Well-known risk factors for OSCC development include tobacco and alcohol consumption and immune system impairment. [2][3][4][5] Human papillomavirus (HPV) has a driving role in the oropharynx and cervical cancer. However, it is still a controversial risk factor for OSCC. 6,7 Approximately 14% of OSCCs evolve from oral leukoplakia, a potentially premalignant oral lesion (PPOL) described as a "white plaque Declaration of Interests: The authors certify that they have no commercial or associative interest that represents a conflict of interest in connection with the manuscript. OSCC) to characterize its mutational landscapes. 7,19 The first OSCC NGS study was conducted in 2013 by Maitra et al. in an Indian population, reporting three mutational signatures: one characterized by TP53 mutations, a second composed of CASP8±FAT1, and a third miscellaneous signature. The mutations mainly correspond to single nucleotide polymorphisms (SNP) and indels. 24 Only a few NGS studies have assessed PPOL, and none have mutations encountered across grades of dysplasia and cancer. 10,25 Hence, the present study compared the mutational landscape of ora epithelial displasya OED grades and known driver genes in OSCC and other cancers. Unique genomic characteristics may help to understand the progression from dysplasia to cancer and provide novel insights for future therapies.

Methodology Samples
Ten patients from the study group with OED were retrieved. The histological slides were examined by two independent oral pathologists for a second diagnosis. A binary grading system for dysplasia was used for the histological diagnosis of the samples. 16 When no agreement was reached between the two pathologists, the diagnosis was settled by consensus. The patient information was codified to maintain anonymity. The subjects' rights were protected by the Institutional Review Board with ethical approval (FONDECYT #11140281), which complied with the Helsinki Declaration, and each subject signed a detailed informed consent form.

Exome sequencing and selection of focus gene list
Total DNA was obtained by an external laboratory (ALAMAK®) from formalin-fixed paraffinembedded tissue blocks using the DNA extraction kit RecoverAll™ (Thermofisher™, Massachusetts, USA) following the manufacturer's instructions. The DNA yield ranged from 0.2-2.0 μg. After processing each sample, 20 μl of the solution was obtained, and an aliquot of 1.0 µl complemented with 99 µL Milli-Qâ water was analyzed using a spectrophotometer (DU-640, Beckman, Palo Alto, USA) to verify the quantity and purity of each DNA sample. Genomic DNA was stored at -80 °C. DNA was prepared for whole-exome sequencing.
Whole exome capture was performed with 500 ng of DNA using Agilent SureSelect Human all Exon kit™ (Agilent Technologies™, Santa Clara, USA), following the manufacturer's instructions. Briefly, high-quality genomic DNA samples were sheared and randomly fragmented using the M220 Focused-ultrasonicator™ (Covaris®, Massachusetts, USA) with a base pared of 150-200bp. The ends were repaired, and 'A' bases were added at the 3′ end. Adapters were ligated to both ends of the fragments. The adaptor-ligated library was amplified using ligation-mediated PCR, purified, and hybridized to the SureSelect Biotinylated DNA Library for enrichment. The captured library was amplified, purified, and tested using an adaptor-ligated library, amplified by ligation-mediated PCR, purified, and hybridized to the SureSelect Biotinylated DNA Library for enrichment. The captured library was amplified, purified, and tested using an Agilent 2100 Bioanalyzer™ (Agilent Technologies, Santa Clara, USA). The samples were analyzed by Whole Exome Sequencing, combining the most robust and efficient DNA capture with enrichment protocols.
A variant calling workflow for 57 genes was performed (47 genes of known impact for several cancers and 10 most mutated genes in OSCC) ( Table 1).

Molecular analysis workflow
The first variant calling within the 57 genes region of interest was performed without restrictions based on their Minor Allele Frequency.

Variant classification
Variants were classified according to the impact ranking assigned by Variant Effect Predictor 26 from Ensembl version 91. The prediction was performed based on the type of genetic consequence of the mutation, classifying them as MODIFIER, LOW, MODERATE, and HIGH IMPACT.
Specific variant calling of HPV DNA sequences was performed for more than 170 HPV subtypes, including high-risk HPV strains. 27 .

Variant characterization
The MODERATE and HIGH IMPACT variants were selected for further investigation of their pathogenicity. Divisive hierarchical clustering was performed based on SNPs presentation on samples to evaluate variant traits between samples. R programming environment v.3.6.0 was used for this purpose. 28 In a pool of 10 patients, we performed a cluster analysis to investigate whether certain variants occurred more frequently than others. Nonaffected variants, heterozygosis, and homozygosis (0, 1, or 2) in 97 SNPs from each patient were used for clustering. We used the Manhattan distance formula to calculate the dissimilarity matrix of mutational profiles between SNPs and the hierarchical divisive clustering algorithm to identify broad clusters. We then performed a second cluster analysis to investigate whether groups of patients (n = 10) had similar mutational profiles in the 97 SNPs identified. We further evaluated whether these corresponded with certain physiological traits observed in patients. All cluster analyses were performed in R v.3.6.0, using the library cluster. 28   Panther Pathway analysis of genes affected by pathogenic variants with VAF≤0.1 was done at the functional and biological process levels to investigate molecular significance. 29 An overrepresentation test of the pathway analysis was performed for an extended review.
Variants from each sample were combined into a single gvcf file for comparative analysis between groups. The Case-Control routine included in SnpSift 30 was used to detect variants with differential occurrence between High-and Low-grade samples. P-values were obtained for different genetic models: dominant, recessive, allelic, genotypic/codominant, and Cochran-Armitage trend. Statistical significance was set at p < 0.05.
Pooled variants and their cumulative impacts were compared between LGD and HGD using a nonpaired Student's t-test.

Clinical outcomes
Six of the ten included cases were diagnosed as LGD and four as HGD ( Figure 1). M: W ratio was 1:1, mean age was 58.8 years old. Four of the cases presented clinically as homogeneous leukoplakia (two HGD and two LGD), five as erythroleukoplakia (two HGD and three LGD), and one as verrucous leukoplakia (LGD). The sites of presentation were the tongue (four cases), buccal mucosa (two cases), gingival ridge (two cases), palatal mucosa (one case), and floor of the mouth (one case). A history of tobacco consumption was positive for all except one patient. No HPV infection was detected using sequencing.

Molecular outcomes
Variant calling for the coding exons of the genes of interest was performed. The mean depth coverage of SNPs in the coding area was 47x. The total number of variants was 7,283. Of these, 6,993 (96.23%) were noncoding variants grouped as MODIFIERS, and 274 (3.77%) were LOW/MODERATE/HIGH IMPACT, according to the ENSEMBL classification. Overall 177/274 (64.59%) were LOW IMPACT, and 136/274 (50%) were synonymous variants. Of the 87/274 (31.75%) variants of MODERATE IMPACT, 83 (95.4%) were missense, and 4 (4.6%) were in-frame indels. Finally, 10/274 (3.64%) variants were HIGH impact nonsense, stop gain/lost, and splice site variants. The variant counts and workflow are shown in Figure 2.
Approximately 1,552 variants were detected in each HGD case. In comparison, about 1,254 variants per case were detected in LGD cases ( Figure  3b). Unpaired Student's t-test of the mutational count revealed statistical significance between the groups (p = 0.0347). The normality assumption for each group was verified using the Shapiro-Wilk test (p-value = 0.4914). The results are shown in Figure 3.

Qualitative analysis of molecular alterations
After excluding MODIFIER and LOW IMPACT variants, the 97 remaining MODERATE and HIGH IMPACT variants were analyzed. The genomic variant burden for the OED cases is presented in Table 2. The mean number of SNPs for LGD and HGD cases was  approximately 40 and 39, respectively, without significant differences between the groups.
Variants of the studied genes were examined in all cases, as shown in Table 2. The genes with the most significant number of variants were FAT1 (X ~ 11.2), HNF1A (X ~ 4.4), ALK (X ~ 4.4), and TRPM3 (X ~ 4.4). Interestingly, only two variants were identified for TP53: rs1042522 (a relatively common variant associated with hereditary cancer syndrome in Clinvar), which was present in all cases of homozygosis, and rs28934575, a heterozygous variant present in only one HGD case. Seven new variants were assessed, and 3 of them had HIGH IMPACT (frameshift/splice donor/start loss). The number of FLT3 gene variants was significantly higher in HGD than in LGD (p = 0.0112) (Figure 3c).

Hierarchical clustering
Two non-supervised dendrograms were obtained using "R" software, as shown in Figure 4 (97 variants distributed across cases are shown). The first dendrogram is shown at the top and comprises two broad clusters: LGD-like clusters ( green, including 4/6 LGD) and HGD-like clusters ( brown, including 2/6 LGD and 4/4 HGD). The second dendrogram is shown on the right and is divided into two main clusters: the "common variants between cases" (shown in pink) and the "rare variants between cases" (shown in blue). The orange frame depicts three FAT1 missense variants. Interestingly, they appear together in homozygosity in the HGD-like cluster, but in the LGD-like cluster, they appear in heterozygosity or are not mutated (rs11939575; rs1877731; rs367863). The purple frame depicts a heterogeneous group of variants that tend to appear more in the HGD-like cluster and is composed of missense variants of FAT1, PIK3CA, KDR, HNF1A, and GNAS. Variants with a VAF≤ 0.1 were part of the "rare variants between cases" cluster, except for ALK variant rs1569156, a HIGH IMPACT deletion with a stop gain consequence (identified in homozygosity in 9/10 cases).

Filtered variants analysis
To describe probable somatic variants among the 97 already selected variants, we retained variants with VAF≤0.1 in the general population. 51 MODERATE and HIGH impact variants were identified as probable somatic mutations (Table 3 accompanied by clinical information for every case. The FAT1 gene was affected by different variants in out of 9/10 cases. Most FAT1+ cases were CASP8+ (6/9 cases), NOTCH1+ (3/9 cases) or MLL4+ (4/9 cases). FAT1 mutations in the same cases were accompanied by a specific nonsense variant in UNC13C (3/9 cases) and infrequently by a TRPM3 heterozygotic variant (1/9). In the LGD cluster, case #5 did not show the same mutational signature and had PIK3CA/ATM/FGFR2/JAK3/MLH1+ variants without FAT1/NOTCH1/CASP8 signature or TP53+. Only one HGD case had a FAT1 mutation accompanied by a TP53 mutation (10%). Furthermore, the ALK SNP rs1569156 was present in all cases, and 9 out of 10 cases presented with homozygosis, predicting a STOP GAIN mutation. The mean number of probably pathogenic somatic variants per case was  .9, and no significant differences between groups were observed. Angiogenesis and TP53 pathways were mainly affected by pathogenic variants in the Panther analysis, while cellular and biological regulatory processes were the most affected in the biological process analysis (Table 4 and 5). Over-representation analysis showed that TP53 feedback loop 2 and     LGD and HGD cases are depicted in the first line, and red numbers correspond to the HGD-like cluster cases. The first section presents the clinical diagnosis and history of alcohol and tobacco consumption. The second section describes gene alterations based on mutations with a VAR ≤0.1, accompanied by a percentage compared to OSCC 24 . RED/pink box: nonsense/frameshift and splice-site variants. Blue/Light blue box: "Missense/in-frame indel variants". "1<" define a gene affected more than once by different variants. The total mutation accounts for the lowest number of genes affected.

Discussion
We described the mutational landscape of OED based on OSCC-affected genes and common cancers. This is the first study to compare the OED genetic landscape with the histopathological diagnosis of dysplasia using a binary grading system. It is evident that the LGD and HGD exhibit similar mutational landscapes. A significantly higher total number of variants in HGD has been described already 12 . The exclusion of passenger variants provides insights into the molecular signature of these cases. The ≈7.9 affected genes per sample were less than those reported by Villa et al. 10 (13 genes per patient). They characterized histopathological "Keratosis of Unknown Significance" (KUS) and moderate to high-grade dysplasia, demonstrating similar landscapes but different risks and times of progression to OSCC.
OED is strongly associated with tobacco consumption, which induces mutations associated with nucleotide transversions, as revealed in a series of OSCC. 24 In our study, the variants tended to accumulate in CHR 2, 9,4,7,and 15. Zhang et al. showed that these CHR presented a loss of heterozygosity and microsatellite instability when evaluating OED and HNSCC. 19,31 In the present study, a significantly higher number of variants in the FLT3 gene were observed in the HGD group. FLT3 is the most commonly mutated gene in acute myeloid leukemia and represents an important potential target in acute myeloid leukemia. 32 Mutations in FLT3 have been found in OSCC 33 ; however, there is little evidence. FLT3 is a transmembrane ligand-activated receptor tyrosine kinase that plays an important role in the expansion of multipotent progenitor cells in the bone marrow. 32 Hierarchical clustering showed well-delineated groups of "HGD-like cluster" and "LGD-like cluster" LGD-like clusters by divisive clustering of the imputed variants. As we can observe, there are 2 LGD cases classified in the "HGD-like cluster". This might be explained by the fact that the two cases already had molecular alterations consistent with HGD. However, these molecular characteristics do not affect the morphological architecture of dysplasia. This observation needs to be confirmed in a larger case series.
Specific variant distribution could not be demonstrated between "common variants between cases" and "rare variants between cases" clusters due to a lack of a more representative sample size. Most of the homozygote variants in the "common variants between cases" cluster were missense variants that might probably be germline demographic polymorphisms unrelated to the development of OED. For example, most of the HNF1A gene variants (seven variants detected) are associated with diabetes mellitus (rs1169288, rs56348580, rs2464196, rs2464195, rs55834942, rs1169304, and rs1169305), but are still annotated in COSMIC related to cancer. This study identified that FAT1 was frequently mutated in OED, and TP53 was relatively unaltered, although its pathway was significantly affected. FAT1 is a known tumor suppressor gene that plays an essential role in cellular differentiation and cell adhesion. 34 Accordingly, FAT1 had been described to be frequently mutated in an OSCC series by Maitra et al. 24 Our study found that FAT1 was greatly affected by several mutations (9 of 10 cases), in contrast to earlier studies that found lower mutation frequencies in OED 10 . Regarding OSCC, it has been observed that FAT1 mutation is present in 44% of all patients 24 and is considered a prognostic factor for OSCC progression and invasion capabilities. 35 The prognosis of PPOL for malig nant transformation. 34 In addition, FAT1 mutations are usually associated with other OSCC mutations. 24 . In the present study, positive FAT1 cases also had CASP8 (6 of 9), MLL4 (4 of 9), NOTCH1 (3 of 9), and UNC13C (3 of 9) mutations. Maitra et al. also described a specific subgroup of OSCC characterized by the presence of mutated CASP8 with or without FAT1, representing 34% of OSCC cases, and had a better disease-free survival. 24 It seems that FAT1 alone is insufficient to explain the progression to OSCC and might justify differences in signature percentage between studies.
TP53 has an average somatic mutation count of 62-84%, depending on the study. 19,24 Villa et al. 10 reported that TP53 was mutated in 35% of KUS/severe dysplastic cases. Our series showed that TP53 was mutated in 1 of 10 cases (10%). These differences could be explained by the fact that 6 of 10 cases were LGD, which is known to have a low malignant transformation rate compared to KUS and moderate/severe dysplasia. According to the results of a previous study, C ASP8/ FAT1 positive samples might be less likely to transform into OSCC than those with TP53 mutations. 24 These findings open a field of opportunities in diagnostic, prognostic, and treatment investigations, raising alarms for those cases that are TP53+. The few cases found to be TP53+ in this study might be explained by the fact that TP53 could be a late driver mutation, only present when the risk of malignant transformation is high. In addition, TP53 might be indirectly affected by this pathway, as demonstrated by the Panther analysis.
ALK is another relevant gene that was found to be mutated in the whole series (90% homozygosis, 10% heterozygosis), presenting a specific stop-gain variant. This gene has not been reported in any OED study. However, it has been reported as a hypomethylated gene in advanced OSCC, and it has been described that its co-inhibition enhances the anti-tumor action of EGFR inhibitors. 36 Future studies should consider that most of the limitations here derive from a lack of germline sequencing, and the inclusion of the latter is recommended. The manual filtering of SNPs may be an important source of bias. No copy number variation (CNV) analysis was performed, which limited the deeper characterization of the samples. RNA sequencing may be useful for better understanding protein translation patterns during malignant transformation.