Genome-wide mapping of binding sites of the transposase-derived SETMAR protein in the human genome

Graphical abstract


Introduction
Transposons of the mariner family are present in a wide variety of eukaryotic genomes, including humans [1][2][3]. These transposons contain a single gene encoding the transposase, flanked by short, <30-bp inverted terminal repeat (ITR) sequences. Mariner elements mobilize through a cut-and-paste mechanism catalyzed by the transposase, which belongs to a large family of recombinase proteins including retroviral/retrotransposon integrases and transposases, characterized by the DDE/D signature in the catalytic domain of the proteins [2,3]. Transposition results in the accumulation of hundreds or thousands of transposon copies over evolutionary time. However, most mariner copies appear to be dead remnants of once active transposons inactivated by mutations [4].
Mariner elements are represented by two subfamilies in the human genome: Hsmar1 [5] and Hsmar2 [6]. The first Hsmar1 element entered the primate genome lineage approximately 50 million years (Myr) ago, and transposition was ongoing until at least 37 Myr ago, producing 114 ''full-length" Hsmar1 copies [5] ( Fig. 1. However, none of the present copies encodes a functional transposase protein due to mutational inactivation. The Hsmar1 transposon copies are accompanied by 42 ''gappy" Hsmar1 elements containing internal deletions in their transposase coding sequences, 5252 copies of solo-ITRs (containing a single ITR) and 2679 copies of an Hsmar1-related, paired-ITR element,  [5,7] (Fig. 1). Such miniature inverted-repeat transposable elements (MITEs) are thought to have been generated by internal deletions of longer transposons (median MADE1 length: 68 bp, Fig. 1); they make up the predominant fraction of DNA elements in flowering plants, and are often found in animal genomes [8].
Despite their parasitic nature, there is increasing evidence that transposable elements are a powerful force in gene evolution. Indeed, about 50 human genes are derived from transposable elements [7], among them genes that are responsible for immunoglobulin gene recombination in all vertebrates [9]. One of these ''domesticated", transposase-derived genes is SETMAR (also called Metnase), a fusion gene containing an N-terminal SET domain fused in-frame to an Hsmar1 transposase [5,10]. The SET-MAR gene has apparently been under selection; the transposase open reading frame is conserved, and shows only 2.4% divergence from a consensus Hsmar1 transposase gene sequence (vs. 8% average divergence between Hsmar1 transposase genes) [5]. The SET domain can be found in histone methyltransferases that regulate gene expression by chromatin modifications [11]. Accordingly, the SETMAR protein has been shown to methylate histone H3 lysines 4 and 36 in vitro, and has been proposed to play a role in DNA double-strand break (DSB) repair [12].
The cellular function(s) of SETMAR remain enigmatic. Cordaux et al. have found that selection has been preserving the ITRbinding activity of SETMAR [10]. Accordingly, both the transposase domain of SETMAR as well as the full-length SETMAR protein were shown to bind to Hsmar1 ITR sequences in vitro [10]. Thus, a function of the SETMAR protein is likely associated with its ability to specifically recognize numerous genomic binding sites represented by the Hsmar1 ITRs. Through its ability to bind to Hsmar1 transposon ITR sequences, and to catalyze specific histone modifications [12], SETMAR could contribute to transcriptional gene regulation by inducing targeted chromatin modifications. Indeed, mariner transposase domains were recently described to have a propensity to undergo domestication by recurrent fusion to host transcriptional regulatory domains, especially the Krüppel-associated box (KRAB) domain; these KRAB-transposase fusion proteins repress gene expression in a sequence-specific fashion [13].
SETMAR is broadly expressed in human tissues ( Supplementary  Fig. S1) and cell lines ( Supplementary Fig. S2), suggesting a housekeeping function [12]. In addition, transcriptional variants of SETMAR show a broad expression pattern in human diseases including cancer [14][15][16]. Overexpression of SETMAR is favourable in kidney cancer and unfavourable in liver cancer, while most TCGA cancers have no significant survival association with SET-MAR (https://www.proteinatlas.org/ENSG00000170364-SETMAR/pathology). Molecular explanation for these heterogeneous relationships is still unknown. To elucidate the pro-and anti-tumorigenic activities of SETMAR in a mechanistic detail, it is crucial to identify genomic targets to which SETMAR specifically binds in cancer cells and link these sites to the regulation of gene expression. A recent study used the ChIP-exo approach to map Flag-tagged SETMAR binding sites in the hyper-aneuploid U2O2 osteosarcoma cell line [17], which allowed the first evaluation of SETMAR cistrome in human tumour cells. However, the majority of ChIP-exo peaks (69% À 605 out of 875) could not be enriched at the expected target ITRs of the Hsmar1 transposons, which are considered as natural landing sites for SETMAR chromosome binding. Significant off-target binding have been reported in another (unpublished) study [18], but the reason for SETMAR's non-ITR binding remained unexplained. We therefore decided to map the genomic landscape of SETMAR in a near-haploid human leukemia cell line (HAP1) to identify on-target and off-target binding sites at high resolution and to elucidate their role in terms of gene expression. Our analysis revealed a perfect correlation between SETMAR and ITR sequences without any untargeted events, calling into question the previously proposed off-target regions. In addition, we identified ITR sequence conservation as a key factor for determining the affinity of SETMAR for chromosomes.

Cell line and plasmids
The HAP1 cell line were maintained in complete Iscove's Modified Dulbecco's Medium (IMDM, Sigma) supplemented with 10% heat inactivated Tetracycline free Foetal Bovine Serum (iBiotech), 1% penicillin/ streptomycin (Sigma) at 37°C with 5% CO 2 . The SET-MAR knockout cell line was generated by the CRISPR/Cas9 technology. The CCTGATCATGTAGTTGGACC gRNA sequence was designed to target the endonuclease cleavage to the beginning of the 2nd exon of the SETMAR gene (chr3 4,312,904 (hg38), transcript: NM_006515). The mutated cells harbour a 10 bp deletion at the target site, which resulted in a frame shift and a premature stop codon 30 bp downstream to the cleavage locus. The generation of the knockout cell line was performed by Horizon Genomics (https://horizondiscovery.com/). The SETMAR knockout cell line was made transgenic with the Sleeping Beauty (SB) technology to express an N-terminally hemagglutinin (HA) tagged version of the SETMAR protein as follows. The SB transposon donor was created by blunt-end cloning the BamHI/XbaI fragment of pcDNA-HA/ SETMAR to the SalI/NotI site of the pTOV-T11-SV40puro [19]. 500 ng of the resulting SB transposon donor plasmid, pTOV-HA-SETMAR-puro, was co-transfected with 100 ng of pcGlobinSB100X transposase expressing vector [20] with polyethylene imine into the knockout HAP1 cells, which were subjected to 1 lg/ml puromycin selection to obtain the polyclonal HA-SETMAR-expressing cell line. The expression of the HA-SETMAR transgene and doxycycline inducibility were verified with Western-blot analysis using anti-HA antibody (11867423001, Roche).

Illumina sequencing and bioinformatic analysis
NGS libraries were prepared by the Nugen Ovation Ultralow System V2 library preparation kit (NuGEN Technologies) following the manufacturer's instructions. 241 million reads were sequenced (paired-end) from two independent biological replicate experiments using an Illumina Nextseq 500 machine and the NextSeq Ò 500/550 Mid Output Kit v2 (Illumina). 97.31% of reads were mapped on the GRCh38 (hg38) human reference genome by Bow-tie2 version 2.3.4.1 [21] using default parameters. Picard was used to remove PCR duplicates from BAM files created by Samtools version 1.10 [22], applying default parameters. Repetitive segments of the genome were blacklist filtered (according to 05.05.2020, Stanford University, Anshul Kundaje Lab) and BAM files containing 185 million mapped reads were RPKM normalised using deeptools version 3.3.1 [23] applying bamCoverage processing (bin size = 100 bp; smooth length = 300 bp). MACS2 version 2.2.6 [24] was used to identify ChIP peaks from bedGraph files, applying default parameters. IP and corresponding Input data were processed in parallel. Peaks identified in Input were filtered out from IP samples using Bedtools version 2.29.0 [25]. Eleven ChIP peaks fell into unmappable segments of the hg38 reference genome and were therefore excluded from further analysis. Computer randomized peak sets were generated by Bedtools as a null model for significance tests. Blacklisted regions were excluded from random peak set generation. Annotation of SETMAR-HA chromosomal binding sites was performed according to the genomic categories of HOMER [26]. Peaks (observed and random) were extended by +/-500 bp and their overlap ratios were determined with the appropriate annotation categories. In the pie charts, only peak summits were considered (peak sizes were not extended).

Results and discussion
To map the chromatin binding sites of SETMAR with high spatial resolution, we set up an experimental system in the nearly haploid HAP1 lymphoblastoid leukaemia cell line [33] in which the endogenous SETMAR locus was knocked out by CRISPR/Cas9 technology followed by complementation by a doxycycline-inducible isoform of SETMAR carrying an N-terminal hemagglutinin tag (pTOV-HA-SETMAR-puro, Fig. 2A). The haploid chromosome set allows us to maximize NGS resolution and peak calling accuracy, while knockout of the parental allele is expected to prevent competition between endogenous and epitope-tagged SETMAR isoforms during chromatin binding. Western blot analysis shows that the kinetics of SETMAR-HA induction linearly scaled with the dose of dox concentration, while the tagged protein was not expressed in the absence of drug treatment (Fig. 2B)   significance thresholds (Supplementary Table S1) associated with the 23 chromosomes except the mitochondrial genome (mtDNA), which was used as an internal negative control. Representative binding sites were validated by ChIP-qPCR measurements in doxtreated and untreated samples ( Fig. 3 and Supplementary  Fig. S4), confirming the specificity of our peak detection. We next analysed the overlap of SETMAR-HA binding sites with annotated genomic categories of the hg38 reference genome (Fig. 4). Functional annotation revealed that most SETMAR-HA sites were located in intergenic regions (52% À 398 peaks) and introns (43% À 329 peaks; Fig. 4A), however, the observed frequencies did not differ from the expected (theoretical) distribution. Statistically sig-   Fig. 4B). The number of peaks in TSS/promoter regions represented only 4% of the binding sites (27 peaks), however, SETMAR-HA was bound to 288 protein-coding genes when TSS-exon-intron-TTS regions were considered (we note that there may be multiple peaks within the same gene). GO-term analysis of SETMAR-associated genes showed enrichment of the MAPK signalling pathway (summarized in Table 1), suggesting a possible role for SETMAR in cell cycle control. Indeed, overexpression of SETMAR significantly reduced the proliferation rate of U2OS osteosarcoma cells [17], consistent with this model. Regarding intergenic regions, all the identified peaks (398 sites) were located in ITR sequences (Fig. 5) or MADE1 elements flanked by ITRs (Fig. 4B). Pileup and Venn diagram analysis (Fig. 5B-C) highlights the perfect colocalization between peak summits and ITR motifs within genic and intergenic regions. We note, however, that only a subset of ITRs were accessible for SETMAR binding (1227 motifs À 11.3%), which is still significant compared to a randomized distribution (p < 2.2 Â 10-16). The unavailability of ITRs at a given time may be related to local chromatin openness, cis-and trans-acting factors, cell cycle stage, or other unknown elements of chromatin structure that have yet to be explored. To address the variance of ITR frequencies and SETMAR binding sites related to chromosome size, we plotted the number of ITRs and SETMAR-HA peaks per chromosome as a function of chro-mosome length (Fig. 5C). The results clearly show that the distribution of SETMAR-HA binding sites and ITR motifs was strongly correlated with chromosome length and showed significant covariation (Spearman r = 0.89; p < 0.001). The X chromosome is a notable exception, as SETMAR binding sites did not correlate with ITR numbers and chromosome size. This unexpected behaviour of chromosome X awaits explanation. To identify critical nucleotide positions in the core ITR motif that are required for SETMAR's efficient chromatin binding, we grouped ITRs based on the number of mismatches in the 19nt 5 0 -GGTGCAAAAGTAATTGCGG-3 0 sequence (0-3MM groups, Supplementary Table S1) and plotted the SETMAR-HA signal over the categories (Fig. 6). We found that ChIP-seq scores, related to the affinity of SETMAR-HA binding, were inversely proportional to the number of mismatches in the ITR motif (Fig. 6A), i.e. the greater the number of mismatches, the lower the affinity of SETMAR (p < 2.2 Â 10 -16 ). Furthermore, nucleotide positions G2, G4, T14, C17, G18, G19 appeared to be essential for the association of SETMAR and ITRs, as single-nucleotide changes in these bases significantly reduced the affinity of SETMAR-HA binding (Fig. 6B).
Based on the degree of affinity loss and the prevalence of mutational change, C-to-T and C-to-A transversions at position C17 proved to be the most critical mutations (change in affinity: greater than4-fold; cumulative allele frequency: 33%). Compared to position C17, G-to-A and G-to-T mutations of G18 were also widespread (38%) but did not cause similar affinity changes, while G-to-A mutation of G2 and G4 led to a large decrease in affinity but were rare (2%). Based on the number of mismatches, functional annotation of the ITR categories showed no difference in their genomic localization (Fig. 6C). It is noteworthy that more than 60% of the identified mutations were G-to-A and C-to-T changes that correspond to ''clock-like" mutation signatures in the COSMIC database [34]. Clock-like mutations are known to form continuously in normal (and cancerous) human cell types, generating mutations at a steady rate throughout the lifetime of cells [35]. Since many ITRs occur in pairs along the chromosomes and one motif is typically of high fidelity (0MM group), the neutral allele is free to mutate during evolution while the conserved motif can still bind and position SETMAR. We found 454 paired ITRs of which 388 (85.4%) were 0MM/1-3MM ITR pairs. In this way, clock-like ITR polymorphisms provide a rationale for fine-tuning SETMAR's biological function related to transcription. Accordingly, when SETMAR-associated genes were grouped by the number of mismatches in the ITR motif, the highfidelity group (MM0) showed significantly reduced mRNA expression levels compared to the random gene group (Fig. 7). ITR sequence fidelity was inversely proportional to gene expression levels, i.e., the lower the number of ITR mismatches, the stronger the repression of SETMAR-bound gene loci. The preferential association of SETMAR and repressed genes is fully consistent with previous results [10], which provide strong evidence for SETMAR binding to the most lowly expressed genes with FPKM values between zero and one.

Conclusions
The results presented in this study clearly show that SETMAR preferentially targets Hsmar1 transposon ends (ITRs) in living cells that are dispersed throughout the human genome. In contrast to previous studies, we could not detect any off-target binding events at non-ITR sequences. Possible reasons for the differences may include the use of different cell lines (U2OS osteosarcoma cells vs. HAP1 lymphoblastic leukaemia cells), tags (FLAG vs. HA), NGS platforms (SOLiD vs. Illumina), and the low NGS coverage of the previous study [17]. In our experiment, SETMAR was bound to the theoretically expected sequences [10] targeted by its transposase domain. The probability that SETMAR binds to ITR sequences by chance is extremely low (p-value < 2.2 Â 10 -16 ; Fig. 4B). In addition, several ChIP peaks were validated by qPCR in samples with and without doxycycline induction ( Fig. 3 and Supplementary Fig. S4), confirming the specificity of ChIP peak detection.
In conclusion, sequence fidelity of the ITR motif has been identified as the only factor that determines the affinity of SETMAR to chromosomes, such that higher ITR fidelity and increased SETMAR chromatin binding resulted in stronger suppression of SETMARbound gene loci. This mechanism may be part of a subtle evolutionary strategy to fine-tune transcriptional processes regulated by SETMAR.

Key points
1. SETMAR/Metnase preferentially targets Hsmar1 transposon ends (ITRs) in living cells 2. Sequence fidelity of the ITR motif determines the affinity of SET-MAR/Metnase to chromosomes 3. Higher ITR fidelity results in increased affinity for chromatin and stronger repression of SETMAR-bound gene loci

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.