Exome-based Variant Detection in Core Promoters

Core promoter controls the initiation of transcription. Core promoter sequence change can disrupt transcriptional regulation, lead to impairment of gene expression and ultimately diseases. Therefore, comprehensive characterization of core promoters is essential to understand normal and abnormal gene expression in biomedical studies. Here we report the development of EVDC (Exome-based Variant Detection in Core promoters) method for genome-scale analysis of core-promoter sequence variation. This method is based on the fact that exome sequences contain the sequences not only from coding exons but also from non-coding region including core promoters generated by random fragmentation in exome sequencing process. Using exome data from three cell types of CD4+ T cells, CD19+ B cells and neutrophils of a single individual, we characterized the features of core promoter-mapped exome sequences, and analysed core-promoter variation in this individual genome. We also compared the core promoters between YRI (Yoruba in Ibadan, Nigeria) and the CEU (Utah residents of European decedent) populations using the exome data generated by the 1000 Genome project, and observed much higher variation in YRI population than in CEU population. Our study demonstrates that the EVDC method provides a simple but powerful means for genome-wile de novo characterization of core promoter sequence variation.


Results
Characterization of promoter-mapped exome sequences. We analysed the relationship between core-promoters and exome sequences. As a model for the study, we used three sets of exome sequences generated independently from human CD4+ T cells, CD19+ B cells and neutrophils of a healthy Caucasian male 19 . The use of these three sets of exome sequence data from the same individual allowed us to measure the reproducibility of results. We performed the following analyses: 1. Portion of exome sequences from promoter region. We mapped the three sets of exome sequences to the human reference genome (hg19) to determine the origins of the exome sequences (Table 1). Many genes have multiple TSS sites, and the region downstream TSS sites is also involved in transcription initiation 20 . Therefore, we included the region between −500 to + 100 for the measure. The results showed that on average, an exome data set contained over six million bases (1.63% of the total exome data) originated from promoter region between − 500 to + 100 (Table 1); 2. Distribution of promoter-mapped exome sequences. Between − 500 to + 100, the promoter-mapped sequences had the highest abundance at + 100 and then decreased towards more upstream. At − 100 upstream TSS, there were on average around 2 million bases from promoter-mapped exome sequences in each exome data set (Fig. 2); 3. Number of promoters detected by exome sequences. The highest number of promoters detected was at + 100, covering 85.1% promoters for the 20,794 genes targeted by the exome kit. The rates decreased towards upstream. At − 100 upstream TSS, on average 78.7% of the promoters were detected in each exome data set ( Table 2); 4. Variant call and sequence coverage. We called the variants from promoter-mapped sequences from high It shows that random fragmentation of genomic DNA generates the DNA templates connecting core-promoters and coding exons. Exon-targeting probes will isolate such templates. Sequencing the isolated templates will result in the sequences originated from core-promoters.

Average (%)
Promoter (− 500 to + 100) * coverage till single sequence coverage. The number of called variants increased smoothly from >= 10 to 3 sequences. Thereafter, the number of called variants increased substantially. However, the reliability of such increased variants is questionable as they could be originated from sequence errors in the 2 to 1 mapped sequences (Fig. 3).
Based on these results, we conclude that the promoter-mapped exome sequences are most suitable to analyze core promoters between − 100 to + 100, and a minimum of four-sequence coverage should ensure the reliability of the variants called from the promoter-mapped sequences.
Variants called from the core promoters in the individual genome. Using the conditions set above, we analyzed the core promoter variation in this individual genome using hg19 as the reference. We identified 291 variants distributed at different frequencies within the core-promoter region (Fig. 4A   and T), but substantial differences existed for the SNP sites, with the highest A to G transition (Ts) for 35 times, and the lowest A to T transversion (Tv) for only once. Ts/Tv ratio was 2.82, which is higher than the 2-2.1 across the entire human genome (Fig. 4B). 3. Genes with variable core promoters. The 291 variants were distributed in the core promoters of 241 genes.
The core-promoters in multiple genes, including CRIP1, HLA-F-AS1, ISG15, OR1N2, and TREML4, were highly variable with three to six variants.  Table 1). For example, two variants of A to G and A to C were at the third and fourth positions of a putative INR motif (CCAATTC) in the core promoter of C12orf10 (Fig. 5). We did not find any variants in TATA boxes, suggesting that TATA box sequences in this individual genome are stable. Searched for variants located in conventional transcriptional factor binding motifs (TFBS) within core promoters identified the variants located in two TFBS of two genes, including a variant in JUN binding motif at + 40 of the UMPS core promoter and a variant in E2F4 binding motif at + 37 of the HSP90B1 core promoter. Each of the variants was located at the intolerant position of its TFBS with the Position Frequency Matrix (PFM) index of 2 ( Fig. 5).

Core-promoter variation between Caucasian and African genomes.
It is well known that African genomes are more divergent than those of other ethnicities are 22,23 . Using the EVDC method, we compared the similarity and difference of core promoters between two ethnical groups, the YRI (Yoruba in Ibadan, Nigeria) and the CEU (Utah residents of European decedent), using 27 exome data sets from each group collected by the 1000 Genome Project 24 for the analysis (Supplementary Table 3). The results showed that 1. Core promoters were more variable in YRI than in CEU. There were a total of 14,372 variants in YRI comparing to 11,380 in CEU (p = 1.51E −30 ), and 4,823 variants in single individuals in YRI group comparing to 3,202 in CEU (p = 9.71 × 10 −21 ). Variants shared in multiple individuals within each group were more common in CEU than in YRI, as exemplified by the 230 variants common in all 27 CEU individuals comparing to 137 in YRI (p = 6.97E10 −13 ) ( Table 3, Supplementary Tables 4A-C and 5A-C). The fluctuated P-values among the variants between YRI and CEU subgroups were likely caused by both technical and biological factors, such as the limited sample size used in the study and non-random distribution of variants in the core-promoters; 2. Features of the variants in YRI and CEU. There were 19 indels in the variants shared within YRI and CEU groups, but the numbers increased to 816 in the variants only present in single individuals in YRI and CEU groups. In the shared variants, CEU had more Indels than in YRI (12 to 3); but in the individualized variants, YRI had more than in CEU (475 to 320). Similar patterns were present for SNPs: there were 268 SNPs in the variants shared within YRI and CEU groups but the numbers increased to 7,021 in the variants only present in single individuals in YRI and CEU groups, and the number in individualized YRI group was much larger than in CEU group (4,159 to 2,693). The highest SNP change was G to A in individualized variants in YRI (940) and the lowest SNP change was T to A in individualized variants shared between YRI and CEU groups (1 only). Ts/Tv ratios were all over 3 in each subgroup except 2.45 in the shared variants between the variants not shared within YRI and CEU groups (    dimethylated H3K4 levels to repress transcription involved in mesoderm formation during embryonic development. In the individualized variants in YRI, five variants of 3 deletions and two SNPs were present in SNAI1 core promoter between − 48 to − 38 in four individuals, the SNP at − 38 (C > G) converted the wild type sequence CCACCCC into a BRE site CCACGCC. In comparison, SNAI1 promoter in CEU group maintained wild type sequences (Fig. 6B).   6. Core promoter motifs in YRI were more affected than these in CEU. Of the variants common in 27 cases, 17 motifs were affected by the variants in YRI only compared to 8 only in CEU; of the individualized variants, 402 motifs were affected by the variants in YRI only compared to 259 in CEU only. The most affected motif was DPE in both groups. Only one and three variants in shared and individualized YRI variants were in TATA box but none in CEU group, indicating that TATA box sequences in these two ethnical groups are stable (Table 5). Searching for the variants affecting TFBS didn't find any in the shared variants in either groups except two individualized variants in YRI group in ZBTB33 and MYC sites of MAPK15 and UQCRBP1, and two in CEU group in Egr1 and CTCF sites of RICTOR and RMRP (data not shown).
Data from these analyses revealed that the core promoters in YRI population are more variable than in CEU population.

Discussion
The unique features of the EVDC method include: (1) It provides de novo information without the need for a priori knowledge of core promoter sequences. As the promoter sequences are collected indirectly through probes targeting coding exons, they preserve the native information of the core promoter sequences; (2) It detects the variants in core promoters at genome level. This is achieved by taking the advantage of exome sequences targeting the entire coding genes; (3) It extends the value of exome sequences from the coding region to the core-promoter region; (4) It is cost-effective.
Around 80% of gene core promoters were confidently detected by our method but 20% of promoters were missed ( Table 2). A part of the missed promoters was in fact mapped by exome sequences but at lower coverage. Although variants from these promoters could be called, these called variants are less reliable and distributed in more random manner, which is troublesome, particularly when comparing the variants between different genomes. Other cause for the missed promoters can be attributed by the lack of coding-exon connected core promoter templates. We quantified these missed promoters in the 3 exome data sets from the donor. The results showed that 12.4% were missed in all three samples and the remaining had lower sequence coverage. Improvement of designing exon probes closer to the 5′ end of 5′ -UTR, generating longer fragments during random fragmentation should increase the rate of core promoter detection.
Sequence length can influence the promoter region covered by exome sequences. In our study, we used paired-end read at 2 × 100 by Illumina sequencer HiSeq2000. All Illumina sequencers use bridge amplification-based cluster generation and sequencing by synthesis. The nature of these methods determines that the rate of artificial bases will significantly increase after 100 bases 22 . Although new Illumina sequencers can provide longer reads up to 300 bp, causing needs to be taken that increased sequencing errors towards the longer sequence ends can be problematic for variation analysis in core-promoter region.
Our data show that over 80% of exome sequences were from non-coding regions. This rate is higher than previously reported 50-65% 16,17,25 . We quantified the rate in the 27 CEU exome data sets from the 1000 genome project (Supplementary Table 3), which were generated by using early-generation exome kits. The results show that 60% of the sequences were from non-coding regions. This difference can be due to the different exome kits used in the studies. Exome kits are under constant evolution. The early exome data were collected by using the Scientific RepoRts | 6:30716 | DOI: 10.1038/srep30716 Comparison of SNAI1 core promoter between YRI and CEU populations. In the individualized variants, five variants of 3 deletions and two SNPs, of which three are known variants in dbSNP, were present in SNAI1 core promoter between − 48 to − 38 in four YRI individuals, the SNP at − 38 (C > G) converted CCACCCC into a BRE site CCACGCC, whereas all CEU individuals remained the same as in the reference genome sequences (hg19).  first-generation exome kits, which mainly targeted the coding exons 24 . Later on, all exome kits have doubled their targeted regions by inclusion of 5′ UTR, intron-exon boundary, 3′ UTR, and non-coding genes etc. As such, the proportion of the exome sequences from the non-coding region has been substantially increased over these generated by the early exome kits. This feature increases the value of using exome sequences for non-coding sequence analysis. Many variant-affected core promoter motifs were not located at the conventionally defined positions in core promoters. For example, none of the 15 variant-affected DPE motifs identified in the individual genome was located within the DPE position defined as 27-33 bases downstream of TSS site 1,2 . A possibility is that certain variant-affected core promoter motifs at the non-conventional positions could be related to one of the multiple TSS, a common phenomenon existing in many genes 26 . Another possibility can be related with the limited length of the consensus motif sequences. BPE and INR motifs have seven bases but DPE has only five bases, which makes variant-mapped DPE motif less specific or even simply hit by random chance 27 .

Item
We compared the sensitivity of exome sequences and whole genome sequences for core-promoter variant detection by using the exome and whole genome data of the same individual from 1000 Genome project 28 . Comparing the variants between the data sets shows that 79% of exome detected variants were also detected by whole genome sequences, but the later detected 1.4 times more variants than exome sequences did; furthermore, a portion of variants was only detected by each method (Supplementary Figure 1). While whole genome sequences can detect more variants than exome sequences, the sequencing cost is an important factor to consider. Current cost of whole genome sequencing is about 3-5 times higher than that of exome sequencing. We consider that at current art of sequencing technologies, exome sequencing is a more cost-effective solution than whole genome sequencing for core promoter analysis.
Large quantities of publically available exome data have been collected by various genomics projects, such as the Exon Variant Server Project 29 , the 1000 Genome Project 30 , the Cancer Genome Atlas Project 31 , and the Exome Aggregation Consortium 32 . These exome data have been used primarily for coding-exon analyses. Applying the EVDC method to mine these rich exome resources should lead to a comprehensive characterization of core-promoters in humans and other species for better understanding of transcriptional initiation regulation.

Samples used for the study. CD4+ T cells, CD19+ B cells and neutrophils from venous blood of a healthy
Caucasian male donor were purchased from AllCells Company (AllCells, Alameda, California). The use of the cells for research purpose has AllCells company IRB approval (Biomed IRB). The donor signed the written consent form, which is archived with their medical records. All experiments were performed in accordance with relevant guidelines and regulations. According to US Federal Regulations, 45 CFR Part 46.101(b)(4)-Protection of Human Subjects, using this type of human cells for research is exempted from the requirement for IRB review.
Exome sequences. Exome sequences were generated as reported 19 . Briefly, genomic DNA from three purified cell types of CD4+ T cells, CD19+ B cells and neutrophils of the individual was extracted using a FlexiGene DNA kit (Qiagen). An exome library was constructed for each DNA sample by random fragmentation using the Covaris II system (Covaris Inc. Massachusetts, USA), and by isolation of exome templates using the TruSeq Exome Enrichment Kit (Illumina ® ) following the manufacturer's protocols. Exome sequences were collected by using a HiSeq 2000 sequencer (paired-end 2 × 100). Exome sequence data were deposited in the National Center for Biotechnology Information (NCBI) with accession number SRR933549. Exome data from the 1000 genome project were downloaded from 1000 Genomes FTP site 24 . Exome data mapping and variant call from promoter-mapped exome sequences. Exome sequences were mapped to the human genome reference sequence hg19 by Bowtie2 using the default parameters in paired mode 33 . The resulting SAM files were converted to BAM files and sorted. Duplicates were removed using the Picard program. The mapped reads were processed using Genome Analysis Toolkit 34 with RealignerTargetCreator and IndelRealigner for local realignment. BaseRecalibrator was used for base quality score recalibration. GATK HaplotypeCaller was used for variant calls, followed by variant recalibration with VariantRecalibrator. The called variants were annotated with ANNOVAR 35 using RefSeq, dbSNP144 and 1000 Genomes databases. Promoter regions between − 500~+ 100 of the exome kit-targeted 20,794 genes were extracted from hg19. A minimum of four promoter-mapped sequences was set for variant calling. For overlapped genes, the first gene was used for the mapping analysis.
For variants located in core promoter motifs, the following consensus sequences were used as the references 1,2 : SSRCGCC (BRE), TATAWAAR (TATA box), YYANWYY (INR), and RGWYV (DPE). For the variants located at transcription factor binding motif, the motifs listed at JASPAR CORE database for human were used as the reference 36 .