Identification of susceptibility genes in non-syndromic cleft lip with or without cleft palate using whole-exome sequencing

Background Non-syndromic cleft lip with or without cleft palate (NSCL/P) is among the most common congenital malformations. The etiology of NSCL/P remains poorly characterized owing to its complex genetic heterogeneity. The objective of this study was to identify genetic variants that increase susceptibility to NSCL/P. Material and Methods Whole-exome sequencing (WES) was performed in 8 fetuses with NSCL/P in China. Bioinformatics analysis was performed using commercially available software. Variants detected by WES were validated by Sanger sequencing. Results By filtering out synonymous variants in exons, we identified average 8575 nonsynonymous single nucleotide variants (SNVs). We subsequently compared the SNVs against public databases including NCBI dbSNP build 135 and 1000 Genomes Project and obtained an average of 203 SNVs. Total 12 reported candidate genes were verified by Sanger sequencing. Sanger sequencing also confirmed 16 novel SNVs shared by two or more samples. Conclusions We have found and confirmed 16 susceptibility genes responsible for NSCL/P, which may play important role in the etiology of NSCL/P. The susceptibility genes identified in this study will not only be useful in revealing the etiology of NSCL/P but also in diagnosis and treatment of the patients with NSCL/P. Key words:Non-syndromic cleft lip with or without cleft palate, whole-exome sequencing, sanger sequencing, susceptibility gene, single nucleotide variants (SNVs).


Introduction
Non-syndromic cleft lip with or without cleft palate (NSCL/P) is one of the most common human birth defects in all populations worldwide. The average birth prevalence of NSCL/P is estimated to be 1/700 ranging from 1/500 to 1/1,000, depending on the economic, geographical, and genetic back-ground variation (1). Individuals with NSCL/P may experience problems with feeding, digesting, speaking, hearing and social engagement which can be corrected to varying degrees by multiple rounds of surgical repair along with dental treatment, speech therapy and psychosocial intervention from birth until adulthood. Furthermore, affected children have higher morbidity and mortality throughout life than do unaffected individuals (2). Therefore, NSCL/P will bring substantial psychosocial and financial burdens to individuals, their families, and society, which creates a major social problem (3). The etiology of NSCL/P has been a focus area of research for centuries. The etiology of NSCL/P is complex and is currently considered to be due to a combination of both environmental and genetic factors (4), which is distinct from that of syndromic CL/P (SCL/P). SCL/P has been shown to associate with single-gene mutations, chromosomopathies, and exposure to teratogens (5). However, the etiology of NSCL/P remains unknown because of its etiologic complexity (6). Furthermore, NSCL/P is the most prevalent type of CL/P and approximately 70% of cases are considered to be NSCL/P (7,8). Therefore, more studies are necessary to elucidate genetic risk factors related to NSCL/P. To date, approaches to uncover genetic determinants controlling risk of NSCL/P include linkage analysis, association studies, identification of chromosomal anomalies or microdeletions in cases, and direct sequencing of DNA samples from affected individuals (9). Various loci could have a causal role in NSCL/P including certain regions of chromosomes 1, 2, 4, 6, 14, 17, and 19 in previous human studies (10,11). Variants of candidate genes linked to NSCL/P including TGF-α, TGF-β3, MSX1, IRF6, TBX22, CYP1A1, MTHFR, and RARA have been identified in population-based studies (1,4,(12)(13)(14). Although many genes and regions of chromosomes have positive results in one or more of the studies, few of those findings were consistently positive across all studies. The methods to genotype all single nucleotide polymorphisms (SNPs) using genotyping arrays greatly facilitate the development and progress of genome-wide association studies (GWAS) (15) . GWAS have become a powerful and ideal technique in studying complex human diseases. To date, there have been a number of GWAS for CL/P (16)(17)(18)(19). The results of the GWAS are encouraging. GWAS are novel and rapidly advancing approach in the study of molecular genetics underlying human diseases and have led to many exciting discoveries. However, traditional array-based GWAS primarily test the common disease-common variant hypothesis (15). Therefore, GWAS are limited to known and common variants. Only a fraction of the genetic basis of common diseases can currently be explained by associations with common variants (20). Uncommon and rare variants may play a major role in many phenotypes. Recently, next-generation sequencing (NGS) technology has developed rapidly and has been suggested as a reliable method for finding variations associated with various disease states. NGS technology enables researchers to interrogate the complete human genome or exome for the detection of both common and rare variants, hence improving the chance of finding disease causal variants (21). Whole-exome sequencing (WES) is both a costeffective and a time-efficient alternative to whole-genome sequencing, so the method is becoming a popular NGS strategy by capturing and sequencing the ~1% of the human genome coding for protein sequences. Ng SB and colleagues first showed that WES can be used to identify disease genes in 2009 (20). Now, WES has been successfully utilized to identify the causative genetic variants responsible for Mendelian disorders and complex hereditary diseases (22,23). One recent study has conducted a WES study to search for potentially causal variants using affected relatives drawn from multiplex cleft families (24). The oral cleft families include two of Chinese origin (one each from Taiwan and Shanghai) with four subjects. In this study, we conducted WES in eight fetuses with NSCL/P in China. The purpose of this study was to illuminate the susceptibility genes related to NSCL/P.

Material and Methods
-Subjects and samples Eight fetuses with NSCL/P were recruited from Maternal and Child Health Hospital of Xuzhou, Jiangsu, China. The cases of NSCL/P were diagnosed by prenatal B-ultrasonography and confirmed by autopsy. The parents of the fetuses were informed of the purpose of the study. The signed informed consent sheets were obtained from the parents of all study participants. After obtaining informed consent, the lip tissue specimens of the fetuses were obtained by sectioning the lip tissue around the cleft of the lip. The study was approved by Ethical Committee for Human Research of Xuzhou Medical College.
-DNA preparation and library construction Genomic DNA was extracted from the tissues using QIAamp DNA Mini Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocols. The quality of the DNA sample was ascertained by electrophoresis and determined to be of high molecular weight with no visible degradation. Quantity was determined using standard PicoGreen assays. A high quality DNA was used as the starting material. The DNA was fragmented by Bioruptor Sonicator (Diagenode, USA). High-quality DNA library was constructed from genomic DNA using the TruSeq DNA Sample Preparation Kit for end repair, dA tailing, adaptor ligation and DNA fragment enrichment.
-Exome capture and Illumina exome sequencing The TruSeq Exome Enrichment Kit was used to capture exome sequences from the prepared human DNA library. After two rounds hybridization and washing, the DNA exome library was captured and enriched. The size was checked using a DNA specific chip, the Agilent DNA 1000 on Agilent Technologies 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). The libraries were quantified using Qubit (Invitrogen, CA, USA) according to the manufacturer's instructions. All sequencing of post-enrichment shotgun libraries was carried out on the Illumina HiSeq 2000 Analyzers for 90 cycles per read following the manufacturer's protocols and using the standard sequencing primer.
-Data filtering and bioinformatics analysis After the entire run was completed, image analyses, error estimation and base calling were performed using the Illumina Pipeline (version 1.3.4) to generate primary data. Indexed primers were used to identify the different reads from different samples in the primary data. Only reads that were perfectly matched to the theoretical adapter indexed sequences and reads that matched the theoretical primer indexed sequences with a maximum of three mismatches were considered to be acceptable reads. The local dynamic programming algorithm was used to remove a few unqualified sequences from the primary data using. After filtering, for aligning the sequences to the human genome, the Burrows-Wheeler Aligner (BWA) was used to map short-reads from fastq files against the human reference genome. Then, the Genome Analysis Toolkit 1.6 (GATK) was applied to carry out regional realignment and quality score recalibration and Picardwas applied to mark duplicates. Variants were called using the GATK variant calling program (Fig. 1). The ANNOVAR (Annotate Variation) was used to annotate possible mutation candidates. The variants were compared with NCBI dbSNP 135 and 1000 Genomes Project reported SNP to remove the published variants.
-Sanger sequencing The Sanger DNA sequencing method has been consid- Fig. 1. Variants analysis strategy for NSCL/P using GATK tools. Data analysis process was carried out according to step 1-4 including initial mapping, refinement of the initial reads, multi-sample indel and SNP calling and quality score recalibration in phase 1-3 including NGS data processing, variants discovery and genetyping and integrative analysis. The SNPs in the exomes were identified. e766 ered to represent the gold standard for screening variants of genes of interest in many scientific fields. In this study, to verify the DNA sequence variants detected by WES, the variants of the affected individuals were sequenced using Sanger sequencing. The primers were designed for the target region using Primer 6.0. All exons of selected genes were amplified. The PCR products were sequenced on an ABI 3730 DNA analyzer following standard procedures (Life Technologies, USA). The sequence reads were analyzed using the Sequencher software package (GeneCodes Inc, USA). All sequence traces were manually reviewed to ensure the reliability of the genotype calls. The sequencing traces were visually inspected in Finch TV v1.4 (Geospiza Inc, USA).

Results
A strategy of WES by hybrid capture was employed (Table 1). The average total effective data were 12. An average of 138766 high-confidence single nucleotide variants (SNVs) per sample was found after the application of initial quality filtering criteria (coverage≥5X and SNP quality score≥40) following the manufactur-er's protocols. Then, we used the analytical procedures to significantly reduce the candidate list and focus on variants that were most likely to be relevant to NSCL/P. On average, a total 18759 SNVs per sample remained after filtering the SNVs not in exons. By filtering out synonymous variants in the coding region unlikely to be disease-causing variants, average 8575 exon nonsynonymous SNVs remained predicted to change an amino acid or a consensus splice site. The variant of IRF6, a known gene causing NSCL/P (12,14,25) , was included in these variants. Several other reported candidate genes including CRISPLD2, MYH9, PDGF-C and ABCA4 were also identified (16,(26)(27)(28). NCBI dbSNP 135 and 1000 Genomes Project were utilized for filtering variants to remove presumably common variants. Each sequenced individual harbored an average 203 SNVs ranging form 167 to 231 ( Table 2). In order to verify the DNA sequence variants detected by WES, the SNVs including some reported and novel genes through conventional Sanger sequencing method. In total, 12 reported candidate genes including IRF6, CRISPLD2, MYH9, PDGF-C and ABCA4 were detected by Sanger sequencing (Table 3). Except for the reported genes, we found 16 novel SNVs were concordant with the gold standard method Sanger sequencing. The 16 SNVs were present in two or more samples (Table 4, Fig. 2).  shared by two or more affected samples which might play an important etiological role in NSCL/P. The study uncovered a list of susceptibility genes of NSCL/P and highlighted the important role of WES in identifying the etiology of NSCL/P. Owing to its genetic heterogeneity and departure from Mendelian inheritance patterns, the identification of causative variants of NSCL/P has remained elusive.

Discussion
With the recent development of innovative approaches, whole-genome and WES projects become available now. WES has recently been successful in identifying causative genetic variants for Mendelian traits including Miller syndrome and Kabuki syndrome (22,23). In one recent study, the WES was used to study potentially causal variants in affected relatives drawn from multiplex cleft families (24). In our present study, we applied WES to identify genetic variants in Chinese subjects. We have shown WES may serve as a cost-effective and reproducible strategy for the identification of variants causing protein-coding changes in human genomes of NSCL/P. The average mean coverage sequencing depth on official target was 101.1-fold and with the average coverage of target region was 95.8%. Therefore, the coverage should have been adequate to reliably detect DNA variants within the majority of the targeted regions. After filtering the SNVs not included in exons, the average number of SNVs per sample was 18,759 and the number of nonsynonymous SNVs was 8,575. This study indicates that the current WES protocol can dissect the genetic etiology of NSCL/P. We speculate WES will be used in diagnosis of NSCL/P. The susceptibility genes of NSCL/P are enormously diverse and complex screened by WES. The key challenge is how to identify the candidate genes from the large number of variants. In the present study, the average SNVs per sample were 18759 after filtering the SNVs not in exon. Even excluded the synonymous SNVs in the coding region, total 8575 exon nonsynonymous SNVs remained. We then applied NCBI dbSNP 135 and 1000 Genomes Project for filtering of variants to remove common variants. The availability of NCBI dbSNP 135 and 1000 Genomes Project is clearly helpful in generating a catalogue of common variation and necessary for filtering out them. Combining the two catalogues, the number of candidates reduced considerably and the susceptibility genetic variants list could be narrowed. Each sample carried 167 to 231 novel genetic variants with the average of 203.
In this study, the reported candidate genes were validated using Sanger sequencing. The results were consistent with some previous findings. The IRF6 variant known to be associated with NSCLP was also identified in these NSCLP-affected fetuses. In the studies of candidate genes for NSCL/P, many genes had positive results in one or more of the studies (5,11,13). However, only the positive association between IRF6 variants and NSCL/P has been confirmed in multiple populations of different ancestry (12,14,25). IRF6 is a key determinant of the keratinocyte proliferation and differentiation switch and also plays a key role in the formation of oral periderm, spatio-temporal regulation of which is essential in ensuring appropriate palatal adhesion (29). The studies confirmed IRF6 gene was one of the main candidate genes associated with NSCLP. With the exception of IRF6, many other candidate genes were present in the susceptibility genes identified by WES. CRISPLD2 is expressed in the craniofacial region during critical time points of palatal fusion. Genetic variation in CRISPLD2 has been shown to have a role in the etiology of NSCLP (27). MYH9, the gene coding for the heavy chain of non-muscle myosin II, has been considered as a good candidate gene in NSCL/P on the basis of its expression profile during craniofacial morphogenesis (26). The PDGF-C is essential for palatogenesis and is associated with NSCL/P (28). ABCA4 encodes an ATP-binding cassette transporter. A genome-wide association study of NSCL/P identifies risk variants near ABCA4 with stronger signals in Asian compared to European populations (16). The candidate genes associated with NSCL/P reported previously were identified in this study. The study suggests that direct sequencing of exons of small numbers of individuals can serve as a useful approach for identifying genetic variants of NSCL/P. As the costs of WES become more reasonable, we believe the project can be useful in clinical diagnosis of NSCL/P. However, this study failed to identify some reported candidate genes including FOXE1, NOG and VAX1. The gene FOXE1 variants involve all combinations of primary and secondary palatal clefts and have a significant role in the etiology of NSCL/P (30). Mangold and colleagues identified two candidate genes NOG and VAX1 were associated with NSCL/P in populations from European ancestry (19). The failure to detect the association of the genes FOXE1, NOG and VAX1 with NSCL/P may due to small size of the study or genetic heterogeneity. In this study, Sanger sequencing was further applied to validate some novel genes. In total, 16 SNVs were present in two or more samples. The 16 SNVs may be predicted to lead to a significantly changed protein product. Then, the 16 nonsynonymous variants shared by two or more samples are highly likely to be causative for NSCL/P. Therefore, WES and Sanger sequencing confirmed 16 different types of prioritized SNVs associated with NSCL/P. Our study presented a preliminary overview of the genetic etiology of NSCL/P. The SNVs identified in the study may play important roles in susceptibility to NSCL/P. Although we were unable to confirm rigorously whether any of these genes in-deed contributed to NSCL/P, our results provided a prioritized shortlist for further association and validation studies. We speculate the susceptibility genes identified by WES may also contribute in some way to NSCL/P. These new candidates are needed be evaluated in future studies to establish their true relevance to NSCL/P susceptibility. Although the exact biological mechanism of the susceptibility genes involved in NSCL/P pathogenesis still needs to be clarified, genetic findings in this study will provide helps in diagnosis, counseling and treatment of NSCL/P. In this study, we have demonstrated WES represents a cost-effective, reproducible and robust approach for revealing the etiology of NSCL/P. We have successfully applied WES for genetic variants screening of NSCL/P. Our study has generated a list of susceptibility genes of NSCL/P to lead to a full understanding of NSCL/P and improved information and clinical applications for affected individuals. The susceptibility genes identified in fetuses affected by NSCLP may lead to improved diagnosis, counseling and treatment.