Extensive Natural Epigenetic Variation at a De Novo Originated Gene

Epigenetic variation, such as heritable changes of DNA methylation, can affect gene expression and thus phenotypes, but examples of natural epimutations are few and little is known about their stability and frequency in nature. Here, we report that the gene Qua-Quine Starch (QQS) of Arabidopsis thaliana, which is involved in starch metabolism and that originated de novo recently, is subject to frequent epigenetic variation in nature. Specifically, we show that expression of this gene varies considerably among natural accessions as well as within populations directly sampled from the wild, and we demonstrate that this variation correlates negatively with the DNA methylation level of repeated sequences located within the 5′end of the gene. Furthermore, we provide extensive evidence that DNA methylation and expression variants can be inherited for several generations and are not linked to DNA sequence changes. Taken together, these observations provide a first indication that de novo originated genes might be particularly prone to epigenetic variation in their initial stages of formation.


Introduction
DNA mutations are the main known source of heritable phenotypic variation, but epimutations, such as heritable changes of gene expression associated with gain or loss of DNA methylation, are also a source of phenotypic variability. Indeed, several stable DNA methylation variants affecting a wide range of characters have been described, mainly in plants [1][2][3]. In most instances, epimutations are linked to the presence of structural features near or within genes, such as direct [4][5][6] or inverted repeats [7,8] or transposable element (TE) insertions [9], which act as units of DNA methylation through the production of small interfering RNAs (siRNAs) [3,10]. Examples of epimutable loci in Arabidopsis thaliana (A. thaliana) include the PAI [7] and ATFOLT1 genes [8], which have suffered siRNA-producing duplication events in some accessions and also the well characterized FWA locus, which contains a set of SINE-derived siRNA-producing tandem repeats at its 59end [4,5]. Repeat-associated epimutable loci are almost invariably found in the methylated form [5][6][7][8][9] in nature, which reflects, at least in part, that DNA methylation is particularly well-maintained over repeats [11,12]. Indeed, epigenetic variation at PAI, ATFOLT1 and FWA has only been observed in experimental settings. Similarly, sporadic gain or loss of DNA methylation associated with changes in gene expression has only been documented in A. thaliana mutation accumulation lines [13,14] and examples of natural epigenetic variation in other plant species are few [15][16][17].
Here we report a case of prevalent natural epigenetic variation in A. thaliana, which concerns a de novo originated gene [18]. We show that expression of this gene, named Qua-Quine Starch (QQS), is inversely correlated with the DNA methylation level of its promoter and that these variations are stably inherited for several generations, independently of DNA sequence changes. Based on these findings, we speculate that epigenetic variation could be particularly beneficial for newly formed genes, as it would enable them to explore more effectively the expression landscape than through rare DNA sequence changes.

QQS Is a Novel Gene Embedded within a TE-Rich Region of the A. thaliana Genome and Is Negatively Regulated by DNA Methylation
The A. thaliana Qua-Quine Starch (QQS, At3g30720) was first described as a gene involved in starch metabolism in leaves [19,20]. Despite being functional and presumably already under purifying selection (dN/dS = 0.5868; p-value,0.045), QQS is likely a recent gene that emerged de novo. Indeed, QQS has no significant similarity to any other sequence present in GenBank [18,19], suggesting that it originated from scratch since A.thaliana diverged from its closest sequenced relative A. lyrata around 5-10 million years ago. Furthermore, QQS encodes a short protein (59 amino acids) and it is differentially expressed under various abiotic stresses [18], which are also hallmarks of de novo originated genes [21][22][23].
As shown in Figure 1, QQS is surrounded by multiple transposable element sequences ( Figure 1A) and contains several tandem repeats in its promoter region and 59UTR ( Figure 1B). In the Columbia (Col-0) accession, these tandem repeats are densely methylated and produce predominantly 24 nt-long siRNAs ( Figure 1B, Figure S1A and S1B). Publically available transcriptome data [24,25] and results of RT-qPCR analyses ( Figure S1C) show that steady state levels of QQS mRNAs are higher in several mutants affected in the DNA methylation of repeat sequences, including met1 (DNA METHYLTRANSFERASE 1), ddc (DOMAINS REARRANGED METHYLTRANSFERASE 1 and 2 and CHROMO-METHYLASE 3), ddm1 (DECREASE IN DNA METHYLATION 1) and rdr2 (RNA-DEPENDENT RNA POLYMERASE 2), which abolishes the production of 24 nt-long siRNAs as well as most CHH methylation. These findings indicate that QQS expression is negatively controlled by DNA methylation and point to the siRNA-producing tandem repeats as being potentially involved in this repression.

Stably Inherited Spontaneous and Induced Epigenetic Variation at QQS
We first observed epiallelic variation at QQS unexpectedly, in a Col-0 laboratory stock (hereafter referred to as Col-0*) with increased expression of the gene and decreased DNA methylation of its promoter and 59UTR repeat elements (Figure 2A). No sequence change could be detected in the Col-0* stock within a 1.2 kb region covering the QQS gene ( Figure 1B), which excluded local cis-regulatory DNA mutations at the QQS locus as being responsible for DNA methylation loss in Col-0*. Additionally, comparative genomic hybridization analysis as well as genomewide DNA methylation profiling using methylated DNA imunoprecipitation assays revealed no major differences between Col-0 and Col-0* ( Figure S2).
We next investigated the QQS epigenetic status in pooled seedlings (S1) derived from the selfing of 12 individual Col-0* plants ( Figure S3). Results revealed a range of QQS epialleles and a strong negative correlation between DNA methylation and expression of the gene ( Figure 2B and 2C). To explore further this variation, a single S1 individual was then selfed for each of the 12 lines and seedlings (S2) were analyzed in pool for each line, as above ( Figure S3). Remarkably, the differences in QQS expression and DNA methylation observed at the S1 generation were also observed at the S2 generation ( Figure 2B and 2C). Taken together, these results suggest therefore the existence of a range of epiallelic variants at QQS, which are stably inherited for at least one generation.
The inheritance of QQS hypomethylated epialleles was also examined in a random sample of 19 ddm1-derived epigenetic Recombinant Inbred Lines (epiRILs) obtained by crossing a Col-0 wild-type (wt) line with an hypomethylated Col-0 ddm1 line [26]. High DNA methylation/low expression and low DNA methylation/high expression of QQS were observed in 14 and 5 epiRILs, respectively ( Figure 2D). This is consistent with Mendelian segregation of the highly methylated/lowly expressed Col-0 wt and lowly methylated/highly expressed Col-0 ddm1 parental QQS epialleles (75%/25% expected because of backcrossing rather than selfing of the F1; Chi 2 = 0,017, p-value.0.05). Indeed, examination of the epi-haplotype obtained for 17 of these epiRILs [27] confirmed the wt or ddm-origin of the QQS locus in each case (data not shown). These results demonstrate therefore that, like many other ddm1-induced epialleles [28,29], QQS hypomethylated epialleles can be stably inherited for at least eight generations and are not targets of paramutation.

QQS Is under Autonomous Epigenetic Control
We next investigated the degree to which DNA methylation of QQS and of flanking TEs are independent from each other. To this end, we first analyzed DNA methylation patterns of TE sequences flanking QQS in a series of epiRIL with contrasted QQS epialleles. Unlike for ddm1-derived QQS, hypomethylation was not inherited for the three TEs located immediately upstream of the gene, as they did systematically regain wt DNA methylation levels ( Figure 3A and 3B), presumably because of their efficient targeting by RNA-directed DNA methylation (RdDM) [28]. In addition, although the TE just downstream of QQS was always hypomethylated when inherited from ddm1, hypomethylation was also observed in one epiRIL that inherited the QQS region from the wt parent. Thus, there is no strict correlation between DNA methylation at QQS and this downstream TE. We next examined the effect of several T-DNA and transposon insertions located ,3.1 kb or 153 bp upstream of the transcription start site (TSS), 653 bp downstream of the 39UTR and within the second coding exon of QQS. Whereas three of these insertions had no effect on DNA methylation and expression levels of QQS, the T-DNA insertion located closest to the TSS was associated with a drastic reduction of DNA methylation of both the promoter and 59UTR of the gene, as well as with an increase in QQS expression ( Figure 3A and 3C). However, this insertion had no impact on DNA methylation of upstream and downstream TEs ( Figure 3A and 3D). Taken together, these results suggest that epigenetic variation at QQS is most likely determined by sequences within the promoter and 59UTR of the gene, not by the TEs that are located immediately upstream or downstream.

QQS Exhibits Epigenetic Variation among Natural Accessions
We next investigated the possibility that QQS is subject to epigenetic variation in natural populations. To this end, we first analyzed QQS expression and DNA methylation in 36 accessions

Author Summary
Epigenetics is defined as the study of heritable changes in gene expression that are not linked to changes in the DNA sequence. In plants, these heritable variations are often associated with differences in DNA methylation. So far, very little is known about the extent and stability of epigenetic variation in nature. In this study, we report a case of extensive epigenetic variation in natural populations of the flowering plant Arabidopsis thaliana, which concerns a gene involved in starch metabolism, named Qua-Quine Starch (QQS). We show that in the wild QQS expression varies extensively and concomitantly with DNA methylation of the gene promoter. We also demonstrate that these variations are independent of DNA sequence changes and are stably inherited for several generations. In view of the recent evolutionary origin of QQS, we speculate that genes that emerge from scratch could be particularly prone to epigenetic variation. This would in turn endow epigenetic variation with a unique adaptive role in enabling de novo originated genes to adjust their expression pattern.
representing the worldwide diversity [30]. QQS was methylated and lowly expressed in 29 accessions, but unmethylated and highly expressed in seven accessions distributed over the entire geographic range ( Figure 4A). This indicates that epigenetic variation at QQS is widespread in nature. In contrast, upstream and downstream TEs were consistently methylated in all accessions ( Figure S4A and S4B), thus confirming that the epigenetic state at QQS is not determined by that of flanking TEs. We then sequenced a 2.8 kb interval encompassing the QQS gene and its flanking regions from the seven accessions carrying the hypomethylated/highly expressed epiallele as well as from three accessions carrying a methylated/lowly expressed epiallele. Although several SNPs and indels were identified ( Figure S4C), no correlation between any specific sequence alterations and QQS DNA methylation or expression states could be established ( Figure 4A). In addition, while Kondara and Shahdara have identical QQS sequences, they have contrasted DNA methylation/ expression patterns at the locus ( Figure 4A and Figure S4C), which provides further evidence that natural epiallelic variation at QQS is independent of local cis-DNA sequence polymorphisms and is thus most likely truly epigenetic. Analysis of a Cvi-0 vs. Col-0 Recombinant Inbred Line (RIL) population revealed in addition that QQS expression is controlled by a large-effect local-expression quantitative trait locus (local-eQTL; http://qtlstore.versailles.inra. fr/) [31]. This suggests that like the Col-0 wt and Col-0 ddm1 QQS epialleles ( Figure 2D), the Cvi-0 hypomethylated QQS epiallele is stably inherited across multiple generations. This further demonstrates that epigenetic variation at QQS is not appreciably affected by sequence or DNA methylation polymorphisms located elsewhere in the genome and indicates also that QQS is not subjected to paramutation [29].
To validate experimentally the causal relationship between DNA methylation and repression at QQS, seedlings of several accessions were grown in the presence of the DNA methylation inhibitor 5-aza-29-deoxycytidine (5-aza-dC). In the two accessions Col-0 and Shahdara, which harbor distinct methylated and lowly expressed QQS alleles, treatment resulted in reduced DNA methylation and increased expression of QQS ( Figure S4D). In contrast, seedlings of Jea, Kondara and Cvi-0 accessions, all of which harbor a demethylated/highly expressed QQS allele, did not show further reduction of DNA methylation or increased expression when grown in the presence of the demethylating agent ( Figure S4D). Moreover, whereas expression of QQS in F1 hybrids derived from crosses between Col-0 (methylated QQS) and Kondara (hypomethylated QQS), was always higher for the epiallele inherited from the hypomethylated parent, further confirming that QQS is not subjected to paramutation [29], treatment with 5-aza-dC reduced dramatically this expression imbalance, most likely as a consequence of demethylation of the Col-0-derived QQS allele ( Figure S4E). Taken together, these results clearly demonstrate that DNA methylation at QQS is causal in repressing expression of the gene.

Wild Populations from Central Asia Exhibit Epigenetic Variation at QQS
Finally, we asked whether epigenetic variation at QQS could be observed in natural settings or if such variation only emerged in the laboratory, where accessions are grown under controlled growth conditions. To this end, we analyzed QQS expression and DNA methylation in plants grown from seeds directly collected from wild populations in Tajikistan, Kyrgyzstan and Iran (NeoShahdara, Zalisky and Anzali populations, respectively). Widespread QQS epiallelic variation was observed, both between and within these diverse wild populations ( Figure 4B). In addition, QQS epigenetic variation was examined in the offspring (after two single seed descent generations) of 25 NeoShahdara individuals. These individuals were randomly sampled among a single patch of several thousands of plants that presumably represent the direct descendants of the Shahdara accession. Based on 10 microsatellite markers and one InDel marker, two genetically distinct subpopulations could be identified. While QQS was highly methylated/lowly expressed in all 16 individuals of subpopulation #1, clear differences in DNA methylation and expression were detected among the 9 individuals of subpopulation #2 ( Figure 4C). Whether epiallelic variation at QQS in subpopulation #2 reflects inherent fluctuations or an intermediary stage in the route to fixation of one of the two epiallelic forms remains to be determined.

Discussion
QQS is a protein-coding gene that likely originated de novo in A. thaliana within a TE-rich region ( Figure 1A). We have shown that this gene, which contains short repeat elements matching siRNAs ( Figure 1B, Figure S1A and S1B), varies considerably in its DNA methylation and expression in the wild (Figure 4). We also show that these variations are heritable and independent of the DNA methylation status of neighboring TEs or of DNA sequence variation, either in cis or trans (Figure 2 and Figure 3, Figures S2  and S4). Thus, we can conclude that QQS is a hotspot of epigenetic variation in nature. Consistent with this, QQS is among the few genes for which spontaneous DNA methylation variation was observed in Col-0 mutation accumulation lines [13].
Cytosine methylation at QQS concerns CG, CHG and CHH sites, which is the pattern expected for sequences with matching siRNAs ( Figure 1B, Figure S1B). All three types of methylation sites likely contribute to silencing of QQS, as judged by the reactivation of QQS in the met1, ddm1, ddc and rdr2 mutant backgrounds ( Figure S1C; [24,25]). Yet, among the different DNA methyltransferases targeting DNA methylation at QQS, MET1 may play a more prominent role, given that DNA methylation at this locus is only fully erased in met1 mutant plants [25]. QQS demethylated epiallelic variants may thus preferentially arise through spontaneous [13] or stress-induced [10] defects in DNA methylation maintenance and be stably inherited for multiple generations as a result of the concomitant loss of matching siRNAs, which would prevent efficient remethylation and silencing of the gene [28,29]. Indeed, although we could not detect QQS siRNAs by Northern blot analysis, presumably because of their low abundance, deep sequencing data indicate that they accumulate less in met1 mutant plants than in wild type Col-0 [25].
Few genes have been shown so far to be subject to heritable epigenetic variation in A. thaliana [5][6][7][8]13,14,32] and QQS is unique among these in exhibiting this type of variation in nature ( Figure 4). This therefore raises the question as to what distinguishes QQS from other genes, such as FWA, for which epigenetic variation can be readily induced in the laboratory in advanced generations of ddm1 and met1 mutant plants [5,33], but for which this type of variation is not observed among accessions [11,34]. One possibility is that unlike QQS epivariants, fwahypomethylated epialleles are strongly counter-selected because of their potentially maladapted phenotype, namely late flowering [5]. Consistent with this explanation, epiallelic variation with no phenotypic consequences has been documented at FWA in other Arabidopsis species. In these cases, however, inheritance across multiple generations has not been rigorously tested [35]. Another possibility is that de novo originated genes, such as QQS, are particularly prone to heritable epigenetic variation. This is a reasonable assumption considering that these genes tend to lack proper regulatory sequences initially, unlike new gene duplicates, which by definition come fully equipped [21]. In turn, given that epigenetic variation enables genes to adjust their expression in a heritable manner much more rapidly than through mutation while preserving the possibility for rapid reversion, it could prove particularly beneficial in the case of genes that are created from scratch. Once the most adaptive expression state is reached, it could then become irreversibly stabilized (i.e. genetically assimilated) through DNA sequence changes. Although speculative, this proposed scenario could be highly significant given the recent discovery that de novo gene birth may be more prevalent than gene duplication [23].

RT-qPCR analysis of QQS expression
Total RNA was isolated as described in [42] and cDNA was synthetized using oligo(dT) primers and IMPROM II reverse transcriptase (Promega). Real time PCR reactions were run on an Applied Biosystems 7500 Real-Time PCR System using Platinum SYBR green (Invitrogen). QQS expression levels relative to Actin2/ PP2A or PP2A/GAPDH internal references were calculated using the formula (2 -(Ct QQS -mean Ct internal references) )*100. Primers are listed in Table S1.

Analysis of DNA methylation by McrBC-qPCR
Total DNA was isolated using Qiagen Plant DNeasy kit following the manufacturer's recommendations. Digestion was carried out overnight at 37uC with 200 ng of genomic DNA and 2 to 8 units of McrBC enzyme (New England Biolabs). Quantitative PCR was performed as described above on equal amounts (2 ng) of digested and undigested DNA samples using  Table S1. Results were expressed as percentage of molecules lost through McrBC digestion (1-(2 -(Ct digested sample -Ct undigested sample) ))*100. As a control, the percentage of DNA methylation for At5g13440, which is unmethylated in wt, was estimated in all analyses.

Allele-specific expression assays
To assess the relative contribution of each allele to the population of mRNA in F1 individuals from reciprocal crosses between Col and Kondara, a single pyrosequencing reaction using the primers QQS_pyro_F1 (PCR) -TCAAAATGAGGGTCA-TATC ATGG, QQS_pyro_R1-biotin (PCR) -ATTGGATA-CAATGGCCCTATAACT and QQS_pyro_S1 (Pyrosequencing) -GATATTGGGCCTTATCAC was set up on a SNP polymorphic between the QQS parental coding sequences ( Figure S4C; position +285). Pyrosequencing was performed on F1 cDNA, as well as on 1/1 pools of parents cDNA to establish the allelic contribution to QQS expression. F1 genomic DNA is used as pyrosequencing control to normalize against possible pyrosequencing biases. Anything significantly driving allele-specific expression in hybrids is by definition acting in cis, since F1 nuclei contain a mix of all trans-acting factors [43,44].

Comparative genome hybridization (CGH)
CGH experiments were performed for Col-0* vs. Col-0 using Arabidopsis whole-genome NimbleGen tiling arrays [45]. The normalmixEM function of the mixtools package on R was used to found the normal distribution for the distribution of the Col-0*/Col-0 ratio with an expected number of gaussians of two. A Hidden Markov model [46] was used to find regions with copy number variation.

Analysis of genome wide DNA methylation (MeDIP-Chip)
DNA was extracted using DNeasy Qiagen kit and MeDIP-chip was performed on 1.8 mg of DNA as previously described in [47]. The methylated tiles were identified using the ChIPmix method [48]. Probes methylated in one line only (Col-0 or Col-0*) were used to create domains. Domains contain at least three consecutive or nearly consecutive (400 nt min, with one gap of 200 nt max) tiles with identical methylation patterns.

Overall codon-based Z-test of purifying selection
Available QQS coding-sequences (464 different accessions) were downloaded from the ''Salk Arabidopsis 1001 Genomes'' database (http://signal.salk.edu/atg1001/index.php). A. suecica QQS sequence (coming from the A. thaliana genome of this allotetraploid [49]) was also included in the analysis. The aligned sequences were used to calculate the probability of rejecting the null hypothesis (H 0 ) of strict-neutrality (dN = dS; where dN = number of nonsynonymous and dS = number of synonymous substitutions per site) in favor of the alternative hypothesis of purifying selection (H A ; dS.dN). The analysis was done using the MEGA5 software under the Nei-Gojobori method [50] with the variance of the difference calculated by the bootstrap method with 100 replicates. Our overall analysis of 465 sequences rejected H 0 in favor of H A (dN/dS = 0.5868; p-value,0.045).