Nucleotide diversity analysis highlights functionally important genomic regions

We analyzed functionality and relative distribution of genetic variants across the complete Oryza sativa genome, using the 40 million single nucleotide polymorphisms (SNPs) dataset from the 3,000 Rice Genomes Project (http://snp-seek.irri.org), the largest and highest density SNP collection for any higher plant. We have shown that the DNA-binding transcription factors (TFs) are the most conserved group of genes, whereas kinases and membrane-localized transporters are the most variable ones. TFs may be conserved because they belong to some of the most connected regulatory hubs that modulate transcription of vast downstream gene networks, whereas signaling kinases and transporters need to adapt rapidly to changing environmental conditions. In general, the observed profound patterns of nucleotide variability reveal functionally important genomic regions. As expected, nucleotide diversity is much higher in intergenic regions than within gene bodies (regions spanning gene models), and protein-coding sequences are more conserved than untranslated gene regions. We have observed a sharp decline in nucleotide diversity that begins at about 250 nucleotides upstream of the transcription start and reaches minimal diversity exactly at the transcription start. We found the transcription termination sites to have remarkably symmetrical patterns of SNP density, implying presence of functional sites near transcription termination. Also, nucleotide diversity was significantly lower near 3′ UTRs, the area rich with regulatory regions.


Results
Genome-wide mutation rates. We analyzed the distribution of SNPs in the genome of Oryza sativa.
We found that the minor allele frequency (MAF) distribution varies between parts of genome. The original 29M dataset of bi-allelic SNPs mapped to the Nipponbare genome contains many rare SNPs: 22% of them have MAF < 0.00035 (occurring in one genome in 3,000 in a homozygous state), and 67% of them have MAF < 0.01 (present in less than 30 genomes). The genome-wide average MAF is 0.052 while the mean MAF for CDS is 0.028 and, contrastingly, 0.061 for introns. Exons contain more SNPs with lower MAF scores as compared to introns and UTRs. With an increase in the MAF cut-off (to control sequencing errors), a larger portion of SNPs in coding regions are excluded from the dataset. Rare exonic SNPs originated from five genomes of Oryza glaberrima, showing massive levels of genomic structural variation, with particularly high instability in defense-related genes 39 . These accessions were excluded from our analysis since they are too evolutionary distant from O. sativa. Twelve other genomes were excluded since one had a very low coverage and others had excessive numbers of heterozygous singleton SNPs, which is highly unlikely for an inbred species. To reduce the number of possible false positive polymorphisms, we imposed two levels of constraints, as described in Materials and Methods, arriving at two datasets ("Base" 16M, and "Filtered" 5M) for the analysis ( Table 1).  Table 1. Fraction of variable nucleotides stratified by genomic regions for the "Filtered" and "Base" datasets. Whole-genome fraction of variable positions is 1.3% for the "Filtered" and 4.2% for the "Base" datasets, respectively.

Fraction of variable positions per region
Scientific RepoRts | 6:35730 | DOI: 10.1038/srep35730 Transcription start and termination sites. Genomic regions in the vicinity of transcription start (TSS) and termination (TTS) sites are enriched in important regulatory elements [40][41][42] . The complex formed by transcription factors and RNA polymerase binds to specific conserved sequences within the promoter (Fig. 1). Between the translation and transcription start and, subsequently, termination sites, there are untranslated regions (UTRs) containing conserved elements that regulate translation. Figure 2 shows the distribution of SNPs near the transcription start and end sites. In order to confirm the expected lower sequence variability in the vicinity of TSS  . Panels (A,B) correspond to the "Filtered" dataset, panels (C,D) to the "Base" dataset. TSS regions contain the following amount of SNPs: 553,800 "Filtered" and 1,443,188 "Base" datasets. TTS regions contain the following amount of SNPs: 482,685 "Filtered" and 1,231,570 "Base" datasets. Blue curve shows the distribution of 5′ (A,C) and 3′ (B,D) UTR lengths. Whole-genome fraction of variable positions is 1.3% for the "Filtered" and 4.4% for the "Base" datasets, respectively. and TTS, we extracted 2 kb long fragments centered at TSS and TTS for 20 K high-confidence rice genes (see Methods) and computed the position-specific density of SNPs. At approximately 250 nucleotides upstream of the TSS (see Fig. 2A,C), the SNP density starts to decline, reaching the minimum at the TSS at approximately half of the intergenic value. The steep decline in the SNP rate in the promoter region is in line with the restrictions imposed by the presence of the conserved TFBS in the core promoter [− 250, 0] upstream of the TSS (Fig. 3).
Within the transcribed region, the density steadily increases from 5′ to 3′ end of the gene and plateaus approximately 300 nucleotides downstream from the TSS as shown in the Fig. 2(A,C). This region of linear increase roughly corresponds to the 5′ UTR. Lower sequence variability in the 5′ UTR regions may be explained by the observation that mutations in this region can change local mRNA structure near the 5′ cap and, therefore, affect translation process. The protein-coding region features fewer variants because of the demand to preserve the amino acid sequence of the product, which is supported by changes mostly in the third codon positions (Fig. 4). The final stage of transcription is its termination, when the complete transcript dissociates and the RNA polymerase is released from the DNA template. The mechanism of termination is the least understood of the three transcription stages; two competing, yet not fully satisfactory 43 models known as "allosteric" and "torpedo" 44 are proposed as possible mechanisms. The distribution of SNPs around the transcription termination site almost mirrors the trend at the TSS, Fig. 2(B,D). SNP density begins to show a decline at about 500 bp upstream from the TTS, reaching the minimum just before the TTS. It then steadily increases for 300 nucleotides downstream from the TTS and then levels off reaching the intergenic level of SNP density. However, the plateau is achieved at over 1000 nucleotides downstream. We suggest that this difference is due to the intrinsic variability of the termination process, with the position of transcription termination being less precisely defined than the transcription start site. This profile of SNP density variation suggests the existence of evolutionary constraints protecting the TTS area, such as requirements to terminate transcription at the appropriate positions 45 , to interact with RNA-binding proteins to regulate mRNA translation 46 , and to accommodate miRNA target sites 47 . SNPs in the protein coding regions. Overall, coding regions are more conserved than intergenic regions, promoters and UTRs ( Table 1). The frequency of SNPs in introns is 30% higher than in exons, as calculated for the "Filtered" dataset. These results agree with previous observations 2 . There are several interesting trends in SNP distributions near the translation starts and terminations (Fig. 4). First, there is a conserved region immediately upstream of the first codon. The context around the start codon is critical for the translation initiation, as the ribosome may bind at the 5′ -most ATG of the mRNA or proceed to the next start codon 48 . Across eukaryotes, the consensus around the start codon is (gcc)gccRccAUGG 49 (Kozak consensus sequence). In our analysis of 3,000 rice genomes, the consensus is C(G/C)GC(G/C)AUGGCGG, adjusting (but not contradicting) the previously reported consensus for rice (g/c)(A/G)(A/C)(G/C)AUGGC 50 . The initiation of translation (and, therefore, conservation requirements for 5′ UTR) is affected by multiple regulatory mechanisms involving binding of regulatory proteins to specific regions within 5′ UTR, open reading frames and ribosome entry sites 17 .
Translation termination depends on codon-specific release factors that recognize the stop codon sequence in mRNA, as well as on GTP-binding release factors 51 . Efficiency of translation termination depends on the consensus sequence immediately downstream from the stop codon 52,53 . Kochetov et al. 54 suggested that nonsense mutations occur significantly more often at the very beginning of 3′ -UTR and in the same reading frame as the CDS. They found that A. thaliana and O. sativa genes with UGA stop codons had more nonsense codons in the first triplet position of 3′ UTR (that may result from weak natural selection for a "reserve" stop signal). Approximately 20-30 nucleotides (nt) upstream of the cleavage-polyadenylation site is the AAUAAA hexamer, which is the consensus signal for the transcript cleavage 17 . Therefore, there are constraints affecting the 3′ UTR sequences manifested as reduced SNP density 55 (Fig. 4(B,D)). SNPs density gradually increases downstream from the TTS, although the incline of the growth function is smoother than the decline upstream of the first codon. This may be explained by differences in distributions of the 5′ and 3′ UTR lengths ( Fig. 2) and weaker constraints on the positional precision of the termination process as compared to transcription initiation.
The third position of the codon, as expected, is more variable than the first two, and that the first one is slightly more variable than the second, which follows from the redundancy of the genetic code and agrees with prior observations in human and mouse 19 .
We also analyzed sequence variability in introns (Fig. 5). In both "Filtered" and "Base" datasets, there are more SNPs in introns as compared to the adjacent exons. The exon/intron boundaries are particularly conserved, with the shape of SNP density function around the splicing sites similar to that around the translation and transcription initiation sites and the stop codon.
Nucleotide composition and SNPs. There are two classes of rice genes that differ in function, nucleotide composition, promoter organization, gene expression, and other features 36,37,[56][57][58] . Here, we investigated distribution of SNPs as a function of the nucleotide composition asymmetry expressed as CG-skew and AT-skew 59-62 ; both dependencies have remarkable shapes (Fig. 6). CG-skew was computed for each 500 nt long genomic region as a difference between the numbers of cytosines and guanines in this region divided by the sum of cytosines and . Nucleotide asymmetry is an important measure, since CG-skew is correlated with the duration a particular DNA region stays in an unpaired, unprotected state during transcription and replication where the longer the duration, the higher the probability of that cytosine transitions to thymine 57,59 . Position-specific imbalance between cytosines and guanines was speculated to be related to translational efficiency 63 , avoidance of "kissing interactions, " 64 DNA methylation 57 , distinct RNA polymerase pause sites in CpG island promoters 65 , and adaptation to extreme environments 37,64,66,67 . In addition, mRNAs tend to be purine-rich (where there is an excess of Gs over Cs) 64,68,69 . We evaluated the CG skew distribution in rice by dividing the genome into 500 nt long fragments and calculating the nucleotide composition and the number of SNPs in each window (Fig. 6B,D). Since the number of genes in both DNA strands is approximately equal, the plot is symmetric around zero CG-skew. Regions with high absolute values of the CG-skew . Panels (A,B) correspond to the "Filtered" dataset, panels (C,D) to the "Base" dataset. ATG regions contain the following amount of SNPs: 221,158 "Filtered" and 583,175 "Base". TER regions contain the following amount of SNPs: 193,941 "Filtered" and 505,877 "Base" datasets. The lines are colored by the position of nucleotide in codons: red -1 st , blue -2 nd , and green -3 rd . Whole-genome fraction of variable positions is 1.3% for the "Filtered" and 4.4% for the "Base" datasets, respectively.
(> 0.3) tend to have high SNP density (Fig. 6B). Genes with high values of the CG-skew are less likely than the genes with low CG-skew to correspond to proteins of unknown function (Cramer's V = 0.34, see Supplement for details). This observation can be explained by an earlier findings that the peak of CG-skew near the TSS is associated with high transcriptional efficiency 36,38,59 and that highly expressed genes have a higher chance to be studied as compared to the genes expressed at a low level. Interestingly, twelve out of top twenty genes by SNP density with |CG-skew|> 0.3 are located on chromosomes 11 and 12, which are enriched in disease resistance genes and recent gene duplications 70 . Chromosomes 11 and 12 constitute the most recently evolved part of the rice genome 70 , containing genes that may work differently in different ecotypes of rice subjected to distinct stress conditions present in their respective sources of origin.
Distribution of the AT-skew is shown in Fig. 6(A,C). Variability of the AT-skew was previously linked to tDNA insertion sites 71 , purine loading of mRNA 64,68,69 , and promoter activity 72 . We found that approximately two thirds of the genes occurring in regions with an absolute value of AT skew above 0.3 correspond to transposable elements (TE), in contrast to genes in genomic regions with AT skew between [− 0.3, 0.3], where only 30% of genes correspond to TEs. This effect may also be explained by the observation that protein-encoding regions have distinct patterns of nucleotide composition (manifested as GC content, AT and CG skew patterns) 37,59,73 and also have fewer SNPs than the intergenic regions.
Motifs in promoters associated with SNP density. In general, DNA sequences in promoters, enhancers and other regulatory elements in eukaryotes are less variable than the rest the genome [74][75][76][77] . However, the patterns of variability may be organism specific and are not well studied in plants. We applied the cisExpress algorithm 78,79 to find regulatory motifs associated with altered SNP density in the region [− 250, 50] around the TSS, corresponding to the core promoter and 5′ UTR. Presence of a TATA-box at − 30 nt from the TSS and motif GCCC at [− 250, − 50] are negatively correlated with promoter sequence variability. On the other hand, an (AT) n repeat at 100 or more nucleotides away from the TSS is positively associated with promoter sequence variability (Fig. 7). To estimate statistical significance, we compared the mean number of SNPs in promoters with and without these motifs. The Z-score for the GCCC[ca] motif, which is possibly related to the GCC-box [80][81][82] and involved in defense mechanisms, was − 16.95, Z-score for the TATA box at − 30 was − 9.95, and the Z-score for TTAT[ca] correspond to the "Filtered" dataset and panels (C,D) to the "Base" dataset. Intron/Exon junctions contain the following amount of SNPs: 40,258 "Filtered" and 99,908 "Base" datasets. Exon/intron junctions contain the following amount of SNPs: 43,138 "Filtered" and 104,797 "Base" datasets. The lines are stratified by position of nucleotide in codons: red -1 st , blue -2 nd , and green -3 rd . Whole-genome fraction of variable positions is 1.3% for the "Filtered" and 4.4% for the "Base" datasets, respectively. was 20.6, suggesting that the GCCC[ca] and TATA-box containing promoters do not tolerate sequence variability, while TTAT[ca]-containing promoters are enriched in SNPs. We suggest that this effect is due to the possible alternative starts of transcription in various ecotypes of O. sativa that require a modified promoter to adapt gene expression level to changing environment. To test this hypothesis, one needs to analyze a large collection of 5′ -full transcripts from various ecotypes.
Gene sequence conservation is associated with functionality. Sequence variability in protein-encoding genes may be related to their group-wise functionality, as defined by type-specific biological pathways and processes 83,84 . Using the complete 29M set of SNPs, we performed the one sample t-test to find possible associations between DNA conservation in rice genes with GO categories (Table 2). Functional bias was, indeed, pronounced, with the most conserved genes enriched for transcription regulation processes, whereas the least conserved genes encoded membrane located proteins that are involved in stress response and other processes associated with adaptation to the changing environment. GO analysis of the filtered set of SNPs ("Base"  dataset) shows similar trends and also indicates that the most conserved genes belonged to the functional category "sequence-specific DNA binding transcription factor" (Supp. Table 5).
Protein family analysis (gene enrichment with PFAM categories) showed that transcription factors feature fewer SNPs than other families. Genes with DUF1618, Myb DNA-binding, zf-C3HC4, and AP2 domains were among the least variable genes. Importantly, we saw the same variability patterns at transcription regulation level. SNP density was the lowest in the promoter regions of the genes belonging to "sequence-specific DNA binding transcription factors".

Discussion
Variability of DNA and protein sequences reflects the balance between adaptation and conservation of essential functionality. Analysis of conserved patterns enables "reverse engineering" of function from the sequence data. Active sites of enzymes can be predicted using conserved motifs in protein families, transcription factor binding sites may be identified using conserved elements in homologous promoter sequences, and gene structures can be predicted from conserved fragments of orthologous genes. In general, an unusually high conservation in a DNA or protein sequence implies existence of a biologically important function.
Here, we analyzed the genetic variability within a unique collection of nearly 30M SNPs from the 3,000 Rice Genomes Project 25 . Although the Human 1000 genome project 85 discovered a larger number of SNPs (80M) from sequencing 2,504 persons, the rice SNP set has a much higher SNP density since the rice genome is almost 10 times smaller than the human genome. To the best of our knowledge, this study has the most precise resolution (one SNP per 20 nucleotides, on average using the "Base" set) of any such study of genome-wide patterns of genic SNP diversity to date.  Here, we demonstrated patterns in genetic variability, established earlier for mammals and observed some novel trends. The rate of polymorphism in introns is known to be higher than in exons, which is consistent with evolutionary constraints in the mutability of protein-coding regions 86 . We confirmed this trend on a subset of common SNPs in rice (MAF> 0.01).
We identified promoter motifs associated with gene conservation using the cisExpress program 79 . We observed that TATA+ promoters with TATA-boxes at − 30 tend to have slightly fewer SNPs than those without the TATA-boxes. This might be due to the rigidly controlled expression mechanisms of these genes. TATA-box binding proteins recruit other transcription factors into the transcriptosome, with controlled expression profiles needed for genes of more conserved functions. This is consistent with our observation of the differences in distribution of SNPs between various GO classes: gene bodies and promoters of transcription-factor encoding genes have fewer SNPs than other functional classes. At the same time, a number of genes carry "alternative" TATA-boxes 100 or more nucleotides upstream from the usual location, and these regions feature higher SNP densities, perhaps indicating that they are evolving at faster rates. For example, TATA-genes annotated as having "kinase activity" have over 55 SNPs per 250 nt promoter region while TATA+ genes annotated as "sequence-specific DNA binding transcription factor" or "DNA-binding" have three or less. There is also a difference in cellular locations where variable TATA-genes predominate in vacuoles while conserved TATA+ genes occur preferably in cell walls.
In terms of gene families, we have shown that transcription factors feature unusually low SNP density agreeing with their importance of as hubs modulating various downstream cascades. The most genetically variable genes tend to encode membrane-bound proteins, ATPases and serine/threonine protein kinases, key regulators of plant responses to abiotic stresses. Similar trends were highlighted in comparative analysis of human and mouse genomes 87 . The relative conservation of transcription factors and higher genetic variability in the genes responsible for interactions apparently correlates with the need to adjust rapidly changing environmental conditions.
Finally, CG-skew coupled with the associated transcriptional start sites, distributions of regulatory elements, expected lengths of UTR and other trends will be useful to refine gene prediction models in plants, especially the positions of TSSs and TSTs, the boundaries of gene models and the foundation for refinement of genotype to phenotype predictions.

Materials and Methods
We used a collection of SNPs obtained using the BWA-mem/GATK pipeline from the 3 K rice genome sequence data. SNP calls were made using the BWA-MEM alignment of short reads against the Nipponbare IRGSP-1.0 RefSeq and GATK pipeline. SNPs are available from the SNP-Seek database (http://snp-seek.irri.org) 25 .

SNP Filtering.
We downloaded 29M bi-allelic SNPs called on Nipponbare genome from the SNP-Seek portal.
This data was used to generate datasets with varying quality cut-offs.
"Base" dataset (16M SNPs). We excluded SNPs detected in 5 genomes of Oryza glaberrima, restricting our analysis of O. sativa accessions. Twelve other genomes were excluded due to excessive amounts (i.e., > 10,000) of heterozygous singleton SNPs. (see Supplementary Data (Supp. Fig. 1) for distribution of singleton heterozygotes per genome). Next, for each remaining SNP we measured the observed proportion of heterozygotes (H obs ) and computed the expected proportion of heterozygosity for its allele frequency (H exp ). The expected ratio H obs /H exp for rice is ~0.05, because of high level of inbreeding; however, many SNPs had H obs /H exp > 1. SNPs that exhibit high degree of heterozygosity could represent alignment errors due to paralogs that do not occur in the reference genome. We estimated inbreeding coefficients for the 3 К dataset (F 3K = 0.9520), indica (F indica = 0.9251), and japonica (F japonica = 0.9689) as median value of 1 − H obs /H exp , excluding all SNPs of low frequency (< 0.05) and high heterozygosity (H obs > H exp ). The SNPs, which had more than 2 heterozygotes and violated the condition H obs /H exp < 10*(1 − F 3K ), were flagged for removal. The same procedure of flagging was repeated separately for the indica and japonica subsets, because paralogs are likely to be population specific (e.g. there are SNPs that are nearly 100% heterozygous in indica but not in the entire 3000 genomes dataset). All SNPs that have been flagged (in all 3 K samples, in indica, and japonica) were removed from the 29M dataset, arriving at the "Base" dataset consisting of 16,854,442 SNPs.
Genomic data and gene selection. The current MSUv7 annotation (http://rice.plantbiology.msu.edu/) of rice contains 55,986 predicted genes and 66,338 gene models 88 . Upon exclusion of pseudogenes, transposable elements, and genes with atypical lengths of 5′ UTR (below 20 nt or above 1000 nt long), we created a high-confidence subset of 20,367 expressed protein-coding genes. Rice genome coverage data were obtained from the Rice 3,000 Genomes project 89 . Genomic regions with zero coverage in all samples were excluded from the analysis. Next, the SNPs were stratified by quality and by genomic regions in terms of their coding, UTR, transcription start and termination, translation start and termination, promoter, etc. For each of the 20,367 genes the number of SNPs per nucleotide was obtained by summing the number of SNPs from the 3000 genomes at this position. A subset of non-synonymous mutations was selected using the SnpEff 87 tool.
Scientific RepoRts | 6:35730 | DOI: 10.1038/srep35730 The map between MSU and RAP-DB locus names was downloaded from the RAP-DB site (http://rapdb.dna.affrc. go.jp/download/archive/RAP-MSU.txt.gz). Enrichment analysis was performed using one-sample t-test implemented in oracle as function stats_t_test_one. We excluded GO classes containing less than ten genes.

Identification of motifs in promoter.
Motifs in promoter were found using the cisExpress 79 tool, which is an improved and enhanced adaptation of an earlier algorithm, Motifer 78 , specifically modified to process large datasets. cisExpress is based an important assumption that function of promoter motifs is position specific. It works in two stages: (1) find 'seed' motifs associated with the phenotype of interest, and (2) optimize the 'seed' motifs using a genetics algorithm. cisExpress was originally developed to find motifs in promoter associated with gene expression patterns of interest. In this work, we used cisExpress in the 'off-label' fashion to find motifs in promoter associated with the SNP density. For each gene, two pieces of information were obtained: sequence of core promoter, defined as region [− 250, 50] around the TSS, and number of SNP in this region. Positions of TSS were obtained from the Rice Genome Annotation Project (http://rice.plantbiology.msu.edu/) database and validated using the NPEST algorithm 90 .
Custom scripts for data analysis. Custom scripts for statistical data analysis and visualization were written in R and C+ + . In order to calculate the SNP density per genomic region, we wrote an R script that counted the number of SNPs per position for each of the 20,367 selected genes. The number of SNPs was divided by the total number of genes (20,367), producing the number of SNPs at each position per 1000 genes.