Functional evaluation of transposable elements as transcriptional enhancers in mouse embryonic and trophoblast stem cells

The recurrent invasion and expansion of transposable elements (TEs) throughout evolution brought with it a vast array of coding and non-coding sequences that can serve as substrates for natural selection. Namely, TEs are thought to have contributed to the establishment of gene regulatory networks via their cis-acting elements. Both the embryonic and extraembryonic lineages of the early mouse embryo are thought to have benefited from the co-option of TEs as distal enhancer elements. However, there is little to no evidence that these particular TEs play significant roles in the regulation of gene expression. Here we tested for roles of TEs as enhancers in mouse embryonic and trophoblast stem cells by combining bioinformatic analyses with genetic and epigenetic editing experiments. Epigenomic and transcriptomic data from wildtype cells suggested that a large number of TEs played a role in the establishment of highly tissue-specific gene expression programmes. Through genetic editing of individual TEs we confirmed a subset of these regulatory relationships. However, a wider survey via CRISPR interference of RLTR13D6 elements in embryonic stem cells revealed that only a minority play significant roles in gene regulation. Our results suggest that a small proportion of TEs contribute to the mouse pluripotency regulatory network, and highlight the importance of functional experiments when evaluating the role of TEs in gene regulation.


Introduction
Our knowledge of the tissue-specific regulatory landscape of genomes has vastly increased over the last 10 years, thanks in part to large efforts from consortia such as ENCODE and NIH Roadmap (ENCODE Project Consortium 2012;Roadmap Epigenomics Consortium et al. 2015). But whilst such mapping efforts have been instrumental in categorising the non-coding genome into different types of biochemical activity, our understanding of the associated functional roles remains limited. One of the grand challenges of the post-ENCODE era has been to ascribe regulatory function to the biochemically active non-coding portion of genomes.
This question is particularly pertinent to transposable elements (TEs) (Elliott et al. 2014;Doolittle and Brunet 2017), which often display marks of regulatory activity in a species-specific manner (Jacques et al. 2013;Carninci 2014). On one hand, successful TEs are expected to display such active profiles, which serve the selfish interests of TEs but may act neutrally with respect to host fitness. On the other, TEs can be co-opted (or exapted) by the host to serve gene regulatory roles, such as alternative promoters or enhancers (Chuong et al. 2017). The distinction between these two scenarios relies on approaches that query the causal links between TEs, gene expression and phenotype.
Primarily based on epigenomic data, evidence to date suggests that particular TE subfamilies have contributed to the evolution of tissue-specific gene regulatory networks in contexts such as early development (Kunarso et al. 2010), placentation (Chuong et al. 2013), pregnancy (Lynch et al. 2011) and innate immunity (Chuong et al. 2016), amongst others. Transcription factor (TF) binding motifs hosted within the regulatory portion of TEs enable their expression in specific tissues! (Sundaram et al. 2014;2017), presumably in a manner that enables vertical inheritance of new TE insertions via the germline. In this respect, mammalian preimplantation development is a seemingly well exploited context for TE expansion, driving genetic conflicts with the host, as well as creating opportunities for TE exaptation (Rodriguez-Terrones and Torres-Padilla 2018). In the mouse, TE-derived regulatory activity has been implicated at multiple stages of preimplantation development. Namely, MERVL elements become highly activated upon zygotic genome activation and are thought to play a role in the establishment of the 2-cell stage expression programme (Macfarlan et al. 2012). Transition from the 2-cell stage and development progression to the blastocyst stage appear to depend on LINE-1 expression (Jachowicz et al. 2017;Percharde et al. 2018). Finally, work from embryonic and trophoblast stem cells (ESCs and TSCs, respectively), suggests a divergence in TE regulatory activity that is concomitant with the separation of the embryonic and extraembryonic lineages at the blastocyst stage (Kunarso et al. 2010;Chuong et al. 2013). In ESCs, TE families such as RLTR13D6 bind key ESC TFs (e.g., OCT4, NANOG), whereas a distinct set of TE families (e.g., RLTR13D5) bind factors essential for the maintenance of the TSC state (CDX2, ELF5, EOMES). These elements are enriched for histone marks that are characteristic of distal enhancers and lie near genes that are expressed in the lineages where they are active (Kunarso et al. 2010;Chuong et al. 2013). TE enhancer activity depends on the cooperative action of multiple TFs, whose binding motifs appear to have been already present in the corresponding ancestral TE insertions (Sundaram et al. 2017). However, it remains unclear to what extent such lineage-specific TEs are important for maintaining gene expression programmes during preimplantation development.
Here we have tested the gene regulatory function of specific TE subfamilies in ESCs and TSCs using genetic and epigenetic editing approaches, and compared them with predictions from extensive analyses of epigenomic and transcriptomic data. We identify a number of TEs that are important to drive expression of lineage-specific genes. However, our data suggest that these constitute a minority of all the putative TE-derived enhancers identified through bioinformatic analyses, highlighting the importance of functional tests when assessing the contribution of TEs to gene regulatory networks.

TE-derived enhancers in ESCs and TSCs are highly tissue-specific
To identify TEs with putative regulatory potential in embryonic and extraembryonic lineages of the blastocyst ( Figure 1A), we focused on a set of TE subfamilies that were previously shown to bind key TFs in ESCs (RLTR9, RLTR13D6) (Kunarso et al. 2010) or TSCs (RLTR13B, RLTR13D5) (Chuong et al. 2013). Using publicly available data, we selected TEs bearing the hallmarks of enhancer elements, namely open chromatin status (from ATAC-seq data), binding of at least one of three key TFs (NANOG,OCT4 or SOX2 in ESCs;ELF5,EOMES or CDX2 in TSCs) and enrichment for H3K27ac. To stringently rule out gene promoters we excluded TEs enriched for H3K4me3 and/or lying within 500 bp of known mRNA transcriptional start sites. These putative 'TE+ enhancers' also displayed H3K4me1 marking ( Figure   1B) and were bound by multiple proteins normally associated with enhancer activity, such as p300 and the Mediator and cohesin complexes (Supplementary Figure 1A). This stringent selection led to the identification of 634 TE+ enhancers in ESCs and 358 in TSCs, which represent respectively 9.6% and 13% of all the TE copies in the subfamilies considered.
Strikingly, whilst a substantial proportion of ESC TE-enhancers displayed open chromatin in multiple tissues, TE+ enhancer activity was far more restricted to ESCs ( Figure 1E). Similar results were obtained for TSC enhancers (Supplementary Figure   2), in line with previous work (Chuong et al. 2013). These results suggest that TE+ enhancers are particularly optimised for activity within their respective early embryonic tissues, possibly through the synergistic action of multiple TF binding events (Sundaram et al. 2017). Co-option of TE+ enhancers may therefore particularly benefit genes that require highly tissue-specific expression.
TE enhancer activity cannot be faithfully predicted from TF binding motifs Despite their sequence similarity, only a relatively small fraction of TEs from any given subfamily bear enhancer-like profiles. It was previously suggested that TE enhancer activity in the ESC and TSC contexts is determined by the presence of key TF binding motifs, which have otherwise been mutated in inactive TEs (Chuong et al. 2013;Sundaram et al. 2017). However, it remains unclear whether such motifs are fully determinant of TE enhancer activity. We therefore performed TF motif analyses of TE+ enhancer sequences. For comparison, we identified TEs from the same subfamilies with high read mappability but that did not display enhancer marks, henceforth termed "non-enhancer TEs" (Supplementary Figure 3A). Although we found enrichment of several motifs at TE+ enhancers (versus non-enhancer TEs), no single motif was predictive of enhancer-like profiles (Figure 2A  '2'-'TF'mo4fs'do'not'predict'TE'enhancer'poten4al.!A,B)!Abundance!of!TF!mo<fs!at!TE+! enhancers!and!nonCenhancer!TEs!from!RLTR13D6!(A)!and!RLTR9!(B)!subfamilies.!C)!A!minority! of!TE!copies!with!enhancer!ac<vity!in!a!reporterCbased!assay!(FIREWACh)!harbour!SOX2,! NANOG!or!OCT4!mo<fs,!as!indicated!in!green.!D)!Analysis!of!BSCseq!and!TABCseq!data!shows! that!nonCenhancer!TEs!display!higher!levels!of!DNA!methyla<on!than!TE+!enhancers!and!only! moderately!higher!levels!of!hydroxymethyla<on!(***!p<1EC10,!Wilcoxon!test).!E)!DNaseCseq! and!H3K27ac!ChIPCseq!profiles!of!TE+!enhancers!and!nonCenhancer!TEs!in!wildtype!and!Dnmt! TKO!ESCs,!showing!similar!profiles!between!both!cell!lines.!! Figure 3B). For example, whilst SOX2 binding motifs were present in nearly all (81-91%) RLTR13D6 and RLTR9E TE+ enhancers, a high proportion (48-58%) of nonenhancer TEs also contained this motif. Notably, motifs for other TFs (ESRRB, KLF4) that have been shown to cooperate with SOX2 for RLTR9E enhancer activity (Sundaram et al. 2017) were present in similar abundance at both enhancer and nonenhancer TEs (Figure 2A). We then asked whether TF binding motifs predicted plasmid-based enhancer activity better than they predict enhancer-like chromatin profiles. Using data from a high-throughput reporter assay in ESCs (Murtha et al. 2014), we found that SOX2, OCT4 and NANOG binding motifs were present in only 12-30% of TEs with enhancer activity ( Figure 2C). This suggests that simple sequence features, such as the motifs considered here, are poor predictors of TE enhancer activity, which is in line with recent findings in human ESC enhancers (Barakat et al. 2018).
Notably, 64% of RLTR13D6/RLTR9 copies with enhancer activity in the reporter assay did not display an enhancer-like chromatin profile. We therefore asked whether TF binding and chromatin opening at non-enhancer TEs was repressed by chromatin features. As we previously described (la Rica et al. 2016), non-enhancer TEs display higher levels of DNA methylation than TE+ enhancers ( Figure 2D). However, removal of DNA methylation did not lead to increased enhancer activity at nonenhancer TEs, as judged from DNase-seq and ChIP-seq data from in ESCs lacking DNA methylation (triple knockout of Dnmt1/3a/3b; Figure 2E) (Domcke et al. 2015).
Similar results were obtained with data from hypomethylated naïve ESCs (Supplementary Figure 3C) (Kim et al. 2018). Additionally, we found no evidence of other chromatin marks that could be maintaining TE enhancer activity repressed (Supplementary Figure 3D).
All together, these data show that TE enhancer capacity appears to behave nondeterministically with respect to TF motifs or repressive chromatin marks. Therefore, whilst chromatin profiling and reporter assays are useful probabilistic indicators of potential enhancer activity, the regulatory action of TEs has to ultimately be tested through molecular manipulations in their genomic environment.

TE-derived enhancers interact with lineage-specific genes
To establish correlations between TE+ enhancer activity and gene expression, studies to date have largely relied on the linear proximity between TEs and genes.
This disregards 3D genome conformation, which enables long-range interactions and is not restricted to one-to-one relationships between TEs and genes. We therefore coupled TE+ enhancers to genes they putatively regulate based on promoter capture Hi-C (PCHi-C) data that we recently generated in ESCs and TSCs (Schoenfelder et al. 2018). Only 34-44% of TE+ enhancers interacted with at least one gene promoter, which was nonetheless higher than the proportion of non-enhancer TEs with gene promoter interactions (21-28%, Supplementary Figure 4A). In contrast, a high proportion of TE-enhancers (65-70%) interacted with gene promoters. The contrast between TE+ and TE-enhancers could be explained by the fact that the latter are preferentially positioned within gene-rich, active regions (known as the 'A' spatial compartment), whereas TE+ enhancers tend to be located within gene-poor, inactive regions ('B' compartment; Supplementary Figure 4B). Accordingly, TE+ enhancers and TE-enhancers interact with largely non-overlapping groups of genes (Supplementary Figure 4C).
To analyse correlations between enhancers and gene expression, we only considered genes that interact exclusively with TE+ or TE-enhancers ( Figure 3A).
For both ESCs and TSCs, we found that TE+ enhancers interacted with genes that displayed higher expression levels when compared to the genome-wide average or to genes interacting with non-enhancer TEs ( Figure 3B). Given the tight tissue specificity of TE+ enhancer profiles that we described above ( Figure
Heterozygous deletion of the Smarcad1-interacting TE also led to a small but significant decrease in gene expression ( Figure 4H). It remains unclear whether failure to isolate homozygous clones was due to a loss of ESC self-renewal caused by Smarcad1 depletion (Hong et al. 2009). Excision of the remaining two TEs in ESCs had no effect on the expression of target genes ( Figure 4I,J), including Akap12, suggesting that the previously reported regulatory role of this TE is cell linedependent (Sundaram et al. 2017).
In addition to genetic editing experiments, we also analysed data from an ESC line of a hybrid 129 × Cast background, which displays substantial sequence variation between alleles. Using ATAC-seq data from this line (Giorgetti et al. 2016), we first identified 98 TEs with biased chromatin accessibility signal across the two genetic backgrounds, suggestive of allele-specific enhancer activity. We then analysed RNAseq data from the same cell line (Gendrel et al. 2014) to test for effects on allelic Analysis of the ChIP-seq data revealed a reduction in H3K27ac signal at RLTR13D6 elements that were predicted to be targeted by each of the sgRNA sets, whereas H3K27ac levels at RLTR9 elements were unaffected ( Figure 5A,B). Notably, elements targeted by multiple sgRNAs displayed a larger reduction in H3K27ac than those targeted by a single sgRNA ( Figure 5A,B), in line with recent findings (Fuentes et al. 2018). For sgRNA set 1, CRISPRi resulted in a >2-fold loss of H3K27ac signal at 30 (56%) RLTR13D6-targeted elements that overlapped a H3K27ac peak. A similar number was obtained for sgRNA set 2 (n=25, 46%). We then asked how these changes affected gene expression. Strikingly, only three genes were significantly differentially expressed upon CRISPRi across both sets of sgRNAs: Tdrd12, Spp1 and Hook3 ( Figure 5C,D). In all three cases there was an associated RLTR13D6 element targeted by one or both of the sgRNA sets ( Figure 5E-G). Most
The effect of CRISPRi on Hook3 expression was far more subtle, despite a pronounced loss of H3K27ac at the associated RLTR13D6 element ( Figure 5G).
Notably, Spp1 downregulation occurred in a manner that seemed largely independent of changes in H3K27ac levels ( Figure 5F), suggesting that enhancer inactivation occurred through deposition of repressive marks and/or via impairment of TF binding by dCas9-KRAB. For the remainder of the targeted TEs, although as a group the associated genes displayed a significant downregulation upon CRISPRi, these changes were limited to at most 1.4-fold (Supplementary Figure 10). This included genes interacting with TE+ enhancers displaying a similar or greater H3K27ac loss to that observed at Tdrd12-or Spp1-associated RLTR13D6 elements.
These results suggest that only a small minority of TE+ enhancers play a substantial role in the regulation of gene expression in ESCs. This is in contrast with the broader correlations found by analysis of epigenomic and transcriptomic data in wildtype cells, highlighting the need to establish causal roles via direct molecular manipulation of TEs.

Discussion
TEs are increasingly being presented as major contributors to gene regulatory networks in a variety of contexts (Chuong et al. 2013;Kunarso et al. 2010;Jacques et al. 2013;Sundaram et al. 2014;Lynch et al. 2011;Fuentes et al. 2018;Imbeault et al. 2017). Yet most of these studies have relied largely on the idea that biochemical activity at the chromatin level is indicative of function, a concept famously associated with the findings of the ENCODE project (ENCODE Project Consortium 2012) that triggered a still ongoing debate (Graur et al. 2013;Ponting 2017;Doolittle and Brunet 2017). The use of genetic and epigenetic editing tools as presented here, and also used by other labs (Chuong et al. 2016;Jachowicz et al. 2017;Fuentes et al. 2018), initiate a much needed move to evaluating causal roles for TEs in gene regulation.
Our work has revealed that a set of TEs with regulatory potential in ESCs act mostly neutrally with respect to their effects on gene expression, which contrasts with earlier suggestions from analyses of chromatin profiling experiments (Kunarso et al. 2010).
In the absence of a substantial contribution of ESC TE+ enhancers to gene regulation, the striking correlations emerging from profiling efforts most likely reflect the fact that TE insertions are best tolerated in regions where their tissue-specific enhancer action matches the expression profiles of nearby genes. Whilst the enhancer activity of these TEs could be inconsequential for gene expression, this genomic 'safe niche' would be permissive for fixation by genetic drift. Additionally, we cannot exclude the possibility that some TE+ enhancers act redundantly with TEenhancers, despite our attempt to isolate the effects of each of these enhancer groups. Such redundant TE+ enhancers could still be important to ensure regulatory robustness, but selective pressure would be largely reduced.
In a contrasting example to our findings, epigenetic editing work by the Wysocka lab has revealed that a large proportion of LTR5Hs elements play significant roles in the regulation of nearly 300 genes in an human embryonal carcinoma cell line (Fuentes et al. 2018). It is worth noting that in the latter study a larger number of elements were targeted by CRISPRi and underwent H3K27ac loss, probably due to the use of twelve sgRNAs simultaneously (Fuentes et al. 2018). Whilst it remains possible that a similar approach in mouse ESCs would uncover additional regulatory RLTR13D6 elements, our data suggests that they would likely remain a minority amongst all silenced elements. A more parsimonious explanation to the differences between LTR5Hs and RLTR13D6 action is that the regulatory effects of TEs are variable between subfamilies and cellular contexts. Indeed, other TEs considered here and whose effects we did not test by CRISPRi, may play important roles in ESC and TSC gene regulation. These considerations further emphasize the need to perform functional experiments on a case-by-case basis.
Despite the neutral action of most TEs analysed here, we have uncovered a number of elements that are act as enhancers of gene expression in ESCs (Tdrd12, Smarcad1, Spp1 and Hook3) and TSCs (Map3k8 and Scarf2). But do these TE insertions impact on cellular and organismal phenotypes, ultimately affecting host fitness? TDRD12 is a protein essential for secondary piRNA production in mice and is essential for male fertility (Pandey et al. 2013;. It is therefore possible that the RLTR13D6 element we identified is also active during male germ line development, which would ironically implicate it in genome defence against the mobility of younger, piRNA-targeted TEs. Smarcad1 knockout mice are subviable, displaying growth defects and low fertility (Schoor et al. 1999). It remains to be seen whether the activity of the RLTR9E element we identified plays any role in these phenotypes, possibly by affecting early embryonic differentiation (Hong et al. 2009).
Other genes that we found to be regulated by TEs are not essential for development: both Map3k8 (Dumitru et al. 2000) or Spp1 (Rittling et al. 1998) knockout mice are viable and appear to develop normally. In these cases it is likely that the action of the associated TEs is inconsequential to the organism, although there could be more subtle embryonic phenotypes or these TEs could play additional roles outside of the early developmental context. Notably, most exemplars of phenotypically relevant TEderived regulatory elements have been uncovered by analysing naturally occurring phenotypes (Lisch 2013;Chuong et al. 2017). The reverse approach of searching for phenotypes linked with putative cis-acting TEs has the potential to reveal a wide array of adaptive TE insertions, although this is nonetheless challenging and examples to date are limited (Chuong et al. 2017).
At a time when epigenomic data are providing abundant indications that TEs play functional regulatory roles, our findings place these observations into perspective and provide a reminder that a large proportion of mammalian genomic sequences are neutrally evolving. Yet evolutionary tinkering (Jacob 1977) may still benefit from a large amount of 'junk DNA' that occasionally can be put to good use. As was fittingly put by Goodier and Kazazian, "evolution has been adept of turning some 'junk' into treasure" (Goodier and Kazazian 2008).
Single GFP-positive cells were sorted into 96-well plates 48 hours post-transfection.
After 7-10 days, growing colonies were genotyped using DNA isolated using QuickExtract (Lucigen) and the primers listed on Supplementary Table 1 ChIP-seq Cells were fixed with 1% formaldehyde in PBS for 12 minutes, which was then quenched with glycine (final concentration 0.125 M). Fixed cells were washed and lysed as previously described (Latos et al. 2015). Chromatin was sonicated to an average size of 200-700 bp using a Bioruptor Pico (Diagenode). Immunoprecipitation was performed using 15 µg of chromatin and 2.5 µg of anti-H3K27ac antibody (Active Motif #39133). DNA purification was performed using the GeneJET PCR Purification Kit (Thermo Fisher) with DNA eluted in 80 µL of elution buffer. ChIP-seq libraries were prepared from 1-5 ng eluted DNA using NEBNext Ultra II DNA library Prep Kit (New England BioLabs). Libraries were sequenced on an Illumina NextSeq 500 with single-ended 75 bp reads at the Barts London Genome Centre.

Primary data processing
ChIP-seq and ATAC-seq data generated here or from external datasets (Supplementary Table 2) were mapped by trimming reads using Trim_galore! and aligning to the mm10 genome assembly using Bowtie2 v2.1.0 (Langmead and Salzberg 2012), followed by filtering of uniquely mapped reads. ChIP-seq peak detection was performed using MACS2 v2.1.1 (Zhang et al. 2008) with -q 0.05; for histone marks the option --broad was used. ATAC-seq peak detection was performed using F-seq v1.84 (Boyle et al. 2008) with options -f 0 -t 6. For multi-tissue DNAseseq data, peak annotation files generated by ENCODE were used.
RNA-seq data generated here or from external datasets (Supplementary Table 2) were mapped by trimming reads using Trim_galore! and aligning to the mm10 genome assembly with Tophat v2.0.9 (Trapnell et al. 2009) using a transcriptome index from Illumina's iGenomes. For ENCODE multi-tissue RNA-seq data, FPKM expression values were downloaded directly from ENCODE and the data were normalised by histogram matching.
Processed CpG calls from publicly available BS-seq and TAB-seq data were downloaded from the respective GEO submissions (Supplementary Table 2).

TE and enhancer annotations
To identify TE+ enhancers, coordinates for RLTR9, RLTR13D6, RLTR13D5 and RLTR13B elements were taken from the mm10 RepeatMasker annotation and filtered to remove elements either intersecting H3K4me3 ChIP-seq peaks or lying within 500 bp of a TSS from the NCBI RefSeq annotation. Enhancer-like elements were then selected if they intersected with all three of the following: H3K27ac ChIPseq peaks, ATAC-seq peaks and binding sites for any of three key TFs (OCT4, NANOG or SOX2 for ESC TE+ enhancers; CDX2, ELF5 or EOMES for TSC TE+ enhancers).
To define TE-enhancers, ATAC-seq peaks that did not overlap any TEs annotated by RepeatMasker were used as a basis for potential enhancer regions. These regions were then filtered in the same manner as described for TE+ enhancers.
Non-enhancer TEs were defined as mappable elements displaying low ATAC-seq signal. Mappability scores were obtained by mapping in silico generated reads and measuring the proportion of the element's length covered by uniquely mapped reads.
Elements with a score higher than 0.5 were kept and from those with the lowest ATAC-seq signal selected as non-enhancer TEs. The number of elements selected for each class equated to twice the number of TE+ enhancers identified for the same class.
Linking TEs and enhancers to gene promoters Processed promoter-genome spatial interactions were downloaded from the respective ArrayExpress submission (Supplementary Table 2). Each element of interest (TE+ enhancers, TE-enhancers, non-enhancer TEs) was intersected with the list of non-promoter restriction fragments in the PCHi-C data and coupled to the gene promoter(s) it interacted with. Expression values from RNA-seq data were assigned to each element based on these relationships. To distinguish the putative effects of TE+ and TE-enhancers, only genes interacting exclusively with one type of enhancer were considered.

Motif analysis
Motif analysis of TE copies was performed using the FIMO tool of the MEME SUITE v5.0.1 (Bailey et al. 2015) using the HOCOMOCO v11 TF motif database. For the FIREWACh enhancer reporter assay, coordinates for the identified regulatory elements and the input library fragments were obtained from the respective publication (Murtha et al. 2014).
Hybrid 129 × Cast ESC data Processed allele-specific data from ATAC-seq experiments on 129 × Cast hybrid ESCs were downloaded from the respective GEO submission (Supplementary Table   2). Peaks containing at least 5 reads in one of the alleles and an allelic ratio (129 reads / total reads) larger than 0.8 or lower than 0.2 were selected as allele-specific regulatory elements. These were intersected with TE annotations to identify putative allele-specific TE-derived regulatory elements. RNA-seq data from the same cell line (Supplementary Table 2) were used to extract expression values for genes within 100 kb of allele-specific TE regulatory elements.

Datasets
ChIP-seq and RNA-seq data generated in this study are available from the NCBI Gene Expression Omnibus repository under the accession number GSE122856.
Sources of external datasets used are detailed in Supplementary Table 2.