Whole exome sequencing of oral epithelial dysplasias reveals an association with new genes

Background: The genetic basis of oral epithelial dysplasias is unknown and there is no reliable method of evaluating malignant transformation risk. We understand that somatic mutations are responsible for the transformation of the dysplastic mucosa to cancer. In addition, these genomic variations could represent objective markers to determine potential for malign transformation. Material and Methods: We performed whole exome sequencing in ten samples of oral epithelial dysplasia of Brazilian and Chilean patients. Results: Based on public genetic repositories, we identied 41 deleterious variants able to produce high impact changes in amino acid structures of 38 genes. Also, variants were ltered based on normal skin and native American genome proles. Finally, a total of 13 genes harboring 15 variants were found to be exclusively related to oral epithelial dysplasias. Samples of high-grade epithelial dysplasia showed a tendency to accumulate more deleterious variants. we observed that 62% of 13 OED genes pinpointed in our study were also found in HNSCC. Among the shared genes, eight were not identied in OSCC. Conclusions: We have described for the rst time in a Latin America population 13 genes which are found in oral epithelial dysplasia where 5 genes have already been observed in oral squamous cell carcinoma. This study allowed us to identify genes that may be related to basal biological functions in oral epithelial dysplasias.


Introduction
The transition of normal epithelium to oral epithelial dysplasia (OED) and oral squamous cell carcinoma (OSCC) is the result of the accumulation of genetic and epigenetic alterations [1][2][3]. This complex relationship has not yet been clari ed at the molecular level, a fact that possibly explains many failures related to these diseases. An excellent tool for the diagnosis of other benign lesions and malignant tumors is molecular strati cation. This characterization uses last-generation technology based on sequencing and the identi cation of typical mutations, i.e., those repeatedly found in the same lesions 2,3 .
In contrast to the extensive research involving different neoplasms in advanced stages, few studies have comprehensively described the genomic changes found in precancerous lesions [4][5][6] . However, the correct characterization of the molecular alterations in potentially malignant oral disorders (PMOD) and the corresponding changes in the microenvironment associated with progression would contribute to the development of biomarkers for early detection and risk strati cation, in addition to suggesting preventive interventions to reverse or delay the development of cancer. Considering the complexity and diversity of the changes to be determined, more comprehensive methods such as next-generation sequencing (NGS) are necessary [7][8][9] .
Whole genome sequencing of OED lesions was rst performed in 2009. That study demonstrated genomic imbalances in lesions with a higher risk of malignant transformation. The study also showed that the genomic pro le of low-grade OED lesions that progressed to OSCC more closely resembled that of high-grade OED than that of lesions with the same histopathological diagnosis that did not progress to OSCC 10 . Another study suggested that most genomic alterations that lead to oral cancer occur in prior stages and are the result of gradual random accumulation rather than of a single event 11 .
In view of the paucity of publications on the application of large-scale sequencing in PMODs and the lack of use of this technology in the Latin American population, the aim of the present study was to identify genomic alterations by whole exome sequencing in 10 low-and high-grade OED samples obtained from Brazilian and Chilean patients with a clinical diagnosis of leukoplakia. Knowledge of DNA variations in these samples may indicate new genes associated with malignant transformation and new therapeutic targets for OED lesions.

Patient samples
The study was approved by the Ethics and Biosafety Committees of the School of Dentistry, University of Chile (FONDECYT No. 11140281), and was conducted in full accordance with local ethical guidelines and with the Declaration of Helsinki. The patients weren´t direct involved in this study. Ten samples of OED, 6 classi ed as low-grade dysplasia (LGD) and 4 as high-grade dysplasia (HGD), were selected from the databases of the Pathological Anatomy Service of University of Chile, Federal University of Pelotas and Federal University of Bahia. The histopathological diagnosis of OED was made by a specialist using the binary system described by Kujan et al 12 . The selected samples had a clinical diagnosis of leukoplakia according to the criteria of Van der Waal 13 . Oral medicine specialists performed the clinical and biopsy assessments. Genomic DNA extraction Genomic DNA was extracted from the para n-embedded tissue samples following the instructions of the kit's manufacturer (Puregene® DNA Puri cation Tissue Kit, Gentra Systems, Inc., Minneapolis, MN, USA). The DNA yield ranged from 0.2 to 2.0 µg. After processing each sample, 20 µl of the solution was obtained and an aliquot of 1.0 µl complemented with 99 µl Milli-Q→ water was analyzed in a spectrophotometer (DU-640, Beckman, Palo Alto, CA, USA) to verify the quantity and purity of each DNA sample. The genomic DNA was stored at -80 ºC. Library preparation and sequencing Whole exome sequencing of the 10 samples was carried out by Macrogen Inc., Seoul, South Korea. The SureSelect XT Library Prep Kit (Agilent Technologies, CA, USA) was used for library preparation and exome sequence capture of the DNA samples. The sequencing library was prepared by random fragmentation in each DNA sample, followed by 5' and 3' adapter ligation. The adapter-ligated fragments were then ampli ed by PCR and puri ed on gel. A post-capture classi cation protocol, SureSelectXT Target Enrichment System for Illumina Version B.2, was used to ensure high e ciency and coverage. The exome libraries were sequenced using 101-bp paired-end reads in a Hiseq-2500 sequencer (Illumina→), with a target sequencing depth of at least 100x. The total number of bases, reads, GC (%), Q20 (%), and Q30 (%) was calculated for the 10 samples. The GC content was 48.78% and Q30 was 95.59%.

Data analysis
The data were analyzed in collaboration with the BioinfoGP group (Spanish National Biotechnology Centre, CNB-CSIC, Spain). The GATK work ow was used for variant calling 7,14 , followed by quanti cation of the variants detected, comparative statistical analysis by group based on the presence or absence of each variant, functional annotation, variant ltering, and format of the nal data in order to generate human readable information for the end-user.
The quality of the raw sequences was analyzed using the FastQC software 8 . The raw sequences were then aligned with BWA-MEM 9 against the human reference genome (Ensembl release GRCh38.91 15 ) using parameters per default. MarkDuplicates, BaseRecalibrator and ApplyBQSR routines from GATK were applied to detect read duplicates and to recalibrate alignment qualities, respectively. Recalibration was based on the 1000 genomes gold standard provided by GATK.
The HaplotypeCaller and GenotypeGVCF GATK functions were used for SNP/indel calling and genotyping. Annotations of recalibration and variant ltering were added. Recalibration with the GATK routines VariantRecalibrator and ApplyVQSR was based on HapMap 16 , 1000 genomes high con dence omni SNPs 17 and dbSNP 18 . Variants from each sample were then combined into a single le for comparative analysis. The Case Control routine included in SnpSift 19 was used to detect variants with differential occurrence between high-and low-risk samples. P-values were obtained for different genetic models. To determine if the detected variants are germ-line, genomic sequences of unrelated subjects were analyzed. The exome sequences of induced pluripotent stem cells from skin biopsies of healthy volunteers (PRJEB11751 32 ) were ltered employing the variant coordinates after transformation to equivalents in the human genome reference GRCh37 (https://genome.ucsc.edu/cgi-bin/hgLiftOver). In contrast, four Native American ancestral individuals (PRJEB24629 33 ) were analyzed with the GATK software and the obtained variants were contrasted with the coordinates of the variants found in the present study. The pipeline employed GATK functions MarkDuplicates, HaplotypeCaller, SelectVariants, and VariantFiltration, similarly as described previously. Finally, VariantRecalibrator and ApplyVQSR were employed with known variants from the dbSNP version 138, 1000 Genomes phase 1, 1000 Genomes OMNI 2.5, HapMap 3.3, Mills gold-standard, and Axiom Exome Plus. Data obtained from https://console.cloud.google.com/storage/browser/genomicspublicdata/resources/broad/hg38/v0.
In addition, the effects of the sequence variants were evaluated using the following computer programs: PANTHER 34 , STITCH 35 , and PMut 36 .
Speci c variant calling of HPV DNA sequences was performed for more than 170 subtypes, including high-risk HPV strains 37 . Data was uploaded at European Nucleotide Archive, https://www.ebi.ac.uk/ena under the accession number PRJEB42475.
The relationship between the number of variants per sample and per group was analyzed using the chi-squared (X 2 ) and Fisher's exact tests. The correlation between the total number of variants and the degree of dysplasia was determined by Spearman's test. All statistical calculations were performed using GraphPad Prism 6.03 (San Diego, CA, USA) and signi cance was established at p < 0.05.

Results
The clinical and histopathological diagnostic data of the samples are systematized in Table 1 and Fig. 1. A total of 3.055.651 variants were identi ed and analyzed in the 10 OED samples; of these, 90.4% (2.761.210) were single nucleotide variants (SNV). There were 1.069 variants that showed the highest probability of causing changes with a low, moderate and high impact on amino acid structures. These variants were responsible for changes in 773 genes (Table S1), with the impact on amino acid structures being low in 416, moderate in 319, and high in 38.
Among the 773 altered genes in the OED samples, the molecular functions clustering the largest number of genes evaluated are binding and catalytic activities, with 196 (25%) and 171 (22%) genes, respectively. These genes participate in more than 60 signaling pathways; however, the pathways clustering a larger number of genes are the Wnt and integrin signaling pathways, involving 15 and 14 genes, respectively.
It was estimated that 41 variants had a high impact on the amino acid structure of 38 genes. A list of these variants with detailed information is given in Table S2, highlighting six variants that have not been described. Among the variants with a high impact, 20 (48,9%) were SNV and 19 (46%) include a frameshift variant as the calculated functional consequence. Table 2 summarizes all high-impact variants per sample. The mean number of variants was higher in HGD samples than in the LGD group (p > 0,05). Figure S1 shows the genes harboring variants with a high impact on amino acids, chromosome location, samples affected, and genotype. Among all genes identi ed, 18,4% were detected exclusively in LGD samples, 44,7% only in HGD samples and 36,8% altered genes were identi ed both samples ( Figure S2).
In order to exclude other putative germ-line variants, the normal skin series (HIPSCI) database was accessed 32 , remaining 22 genes harboring 24 variants (Table S2). In addition to the HIPSCI database, PRJEB24629 series 33 was also analyzed remaining 13 genes harboring 15 variants exclusive from our samples of OED (Table 3).
Finally, taking into account described mutated genes in head and neck malignant tumors, Table 4 shows the shared genes identi ed between the present study with mutated genes in all types of head and neck cancer, head and neck squamous cell carcinoma (HNSSC), and those reported in OSCC.

Discussion
Considering the paucity of publications in the literature applying NGS to OED samples, the present study contributes to the description of genomic alterations in the exome of 10 samples from Chilean and Brazilian patients with leukoplakia associated with low-and high-grade OED. In view of the di culty in obtaining viable tissues and of the small size of PMOD samples for whole genome or exome analysis 10,11,38 , previous studies employing this technology used OSCC samples and selected OED areas adjacent to the tumor 11,38 . Since the genomic pro le of PMODs is unknown, it is possible that tissues containing OED areas extracted from OSCC samples may exhibit genetic alterations that are not typical of dysplasia lesions. Within this context, despite the di culties in obtaining samples with su cient quality and quantity to perform NGS, the present study included samples from patients clinically diagnosed with different types of leukoplakia that were compatible with low-and high-grade OED.
The advantage of the present study is the good representativeness in the correlation of the genomic data found in OED with clinical features.
Most variants identi ed in the present study were SNV, in agreement with the literature which describes this alteration as the most common in whole genome or exome analysis 39 . Although indel variants were less frequent than SNV, the former are in general extremely important in NGS since they are implicated in many constitutional and oncological diseases 40 . The SNV and indel data for each sample were ltered using databases of already described human genetic variants in order to remove all known germinal variants. The highest signi cant average of SNV was observed in the HGD group and this result is supported by studies showing that pre-cancerous lesions are characterized by progressive changes in the DNA sequence, gene expression and protein structures, as well as by microscopic rearrangements 4,5,41,42 . In addition, a previous studies also reported a smaller number of mutations in LGD sample when compared to HGD and OSCC 11 .
Most of the genes identi ed in the present study, with a high impact on the amino acid structures, are related to metabolic functions such as binding and catalytic activities and participate in the Wnt and integrin signaling pathways. Functional dysregulation of the Wnt signaling pathway has been shown to promote the development and progression of oral cancer. It is therefore an interesting target to elaborate treatment strategies for this cancer 43,44 . Integrins, the main components of cell adhesion, have been implicated in almost all stages of cancer progression from the development of the primary tumor to metastasis 45 .
We identi ed six new variants, including three SNV with functional consequences at splice acceptor and splice donor sites and three deletions that lead to changes in the reading frame. Harboring these variants, the GAREM1, GIPC1 and LRRC37A2 genes are associated with mutations described in HNSCC and OSCC 46-49 .
Like HGD samples that accumulated a larger number of total variants, the same trend was observed when only high-impact variants were considered; however, this association was not signi cant. This nding agrees with a previous study that reported a smaller number of mutations in LGD samples compared to HGD 11 . Regarding the clinical characteristics of the patients, it was not possible to establish a relationship with the variants found since the sample size was too small for this type of correlation. The correlation between clinical characteristics and genomic variations has not yet been established in the literature.
Regarding the CELA1 gene, it is important to note that this study detected three variants with a high impact on this gene, which were identi ed in the same samples, with the same type of inheritance. The CELA1 gene, also known as ELA1, encodes elastase-1 and is localized on chromosome 12q13, near the locus for diffuse non-epidermolytic palmoplantar keratoderma. Expression of this gene was observed in cultured human primary keratinocytes 50 .
It has long been known that the distribution of mutations in the genome is not completely random. In the present study, the observation of variants that affect CELA1 in the same group of samples may be explained by the phenomenon of mutation showers that is not yet fully understood. This phenomenon is characterized by the simultaneous presence of multiple mutations in the same gene or in small regions of the chromosomes 51 . There are still not many studies that explain or associate these alterations with cancer; however, analysis of available mutation catalogs revealed clustered mutagenesis in multiple myeloma and prostate and head and neck tumors 51 .
Comparing the most severely affected genes identi ed here with the mutated genes reported in the study of Wood et al. 11 , although they are different variants, a match was found with the mutated WNK1 gene only in OED samples, with the mutated MCF2L gene in OED and OSCC samples, and with the mutated LAMA5, FARP1 and SHANK2 genes exclusively in OSCC samples 11 . The observation of this coincidence in only two mutated genes in OED might be explained by the fact that, contrary to the present study that used clinically and histopathologically representative samples, Wood et al. 11 extracted OED areas from OSCC samples, which may increase the probability of molecular differences. In fact, although reporting fewer mutations in OED samples than in OSCC samples, in that study most of the mutations detected in OED were also observed in OSCC 11 .
In the present study, no normal paired controls were available for WES but we used bioinformatics methods to remove false-positive variants. In order to address similarities and differences with normal epithelial tissue, the HIPSCI genome database 32 was analyzed and we could remove variants found to be germ-line that were not ltered within other public genetic repository. It is hard to explain how this variants now found in normal skin were not previously found after ltering using tools such as 1000 genomes, cosmic, dbSNP, etc. O´Huallachain et al 52 con rmed the presence of a high number of variation in somatic tissue. This can be, in part explained by the theory that choosing the relevant tissue for comparison of genomic pro les might in uence data analysis.
It is also important to understand the history and diversi cation of human populations in the southern tip of the Americas. South American population has an unique genetic conformation composed by pre-Columbian and post colonization. This heterogeneity could play an important role in explaining the number of variants found in this study after performing variant calling based on international databases, including the HIPSCI normal skin genomes 32 . To address this point we have used the only available genome pro les of native Americans representing the pre-Columbian southerners. Interestingly, 38% of variants not described either in public genetic repositories or in the normal skin database were found in the native American genomes and those could be considered, thus, germ-line. It also important to note that these "southern variants" are localized within the same mutated genes referred in previous studies of OED, such as Wood et al 11 . Also, a few genes are considered as harboring a high malignant potential 11,38 .
It was also important to evaluate the role of this 13 OED genes in malignization. Like The Cancer Genome Atlas (TCGA) that was established for consultations on the genomic diversity of various types of cancer, the Pre-Cancer Genome Atlas (PCGA) project was started in 2016. However, mutated genes for any of the lesions that precede different types of cancer, including OED, are not yet available on platforms 4,5,42 . Based on TCGA 46-49 , we observed that 62% of 13 OED genes pinpointed in our study were also found in HNSCC. Among the shared genes, eight were not identi ed in OSCC.
It is important to mention that 9 of the 11 altered genes identi ed in more than half of the samples of this study are also mutated in HNSCC and 6 in OSCC. The study of Wood et al 11 adds SHANK2 and FARP1, mutated in OSCC, and MFC2L, mutated in OSCC and OED, which were also altered in the present samples and in other studies on HNSCC 46,47,49 and OSCC [46][47][48][49] . Similarities in the genomic pro le of OED and cancer have been described for intestinal, breast, brain, kidney, lung and skin epithelium, showing that the mutational process can cause the clonal evolution from normal to neoplastic cells 4,5,41 . However, Wood et al. 38 observed subclonal heterogeneity with OSCC in ve OED samples and suggested that mutational changes in stages prior to cancer do not predict the onset of invasion 38 .
Already in 2009, a study demonstrated completely different genomic pro les of OED that progress to OSCC compared to other OED that do not progress to cancer despite histological similarities 10 . Despite this observation, 10 years after these discoveries, the histopathological diagnosis, including the identi cation of different stages of OED, continues to be the standard complementary test for outlining the risk of progression and treatment decisions for PMODs. However, this method remains subjective and diagnostic agreement between pathologists is low. In addition, regardless of their degree, not all OED progress to OSCC and this information cannot be obtained by histopathological analysis. Given this current scenario, our study describes 13 genes hourboring 15 variants, providing relevant information on the genomic characterization of OED. Despite the small number of samples, the use of a sample comprising a heterogeneous population and of an in-depth genomic evaluation method, which currently has an extremely low error rate, allowed us to establish highly reliable results. On the other hand, it is important to complement the main data found in prospective multicenter studies using a larger number of samples and including validations and healthy controls.

Declarations
Con ict of interest statement: Con ict of Interest: The authors declare that they have no competing interests.
Ethical approval: All procedures performed in this study were in accordance with the ethical standards of the University of Chile (Approval N o. 2014/29) committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. The patients weren´t direct involved in this study.