Subfamily-specific differential contribution of individual monomers and the tether sequence to mouse L1 promoter activity

The internal promoter in L1 5’UTR is critical for autonomous L1 transcription and initiating retrotransposition. Unlike the human genome, which features one contemporarily active subfamily, four subfamilies (A_I, Gf_I and Tf_I/II) have been amplifying in the mouse genome in the last one million years. Moreover, mouse L1 5’UTRs are organized into tandem repeats called monomers, which are separated from ORF1 by a tether domain. In this study, we aim to compare promoter activities across young mouse L1 subfamilies and investigate the contribution of individual monomers and the tether sequence. We observed an inverse relationship between subfamily age and the average number of monomers among evolutionarily young mouse L1 subfamilies. The youngest subgroup (A_I and Tf_I/II) on average carry 3–4 monomers in the 5’UTR. Using a single-vector dual-luciferase reporter assay, we compared promoter activities across six L1 subfamilies (A_I/II, Gf_I and Tf_I/II/III) and established their antisense promoter activities in a mouse embryonic fibroblast cell line and a mouse embryonal carcinoma cell line. Using consensus promoter sequences for three subfamilies (A_I, Gf_I and Tf_I), we dissected the differential roles of individual monomers and the tether domain in L1 promoter activity. We validated that, across multiple subfamilies, the second monomer consistently enhances the overall promoter activity. For individual promoter components, monomer 2 is consistently more active than the corresponding monomer 1 and/or the tether for each subfamily. Importantly, we revealed intricate interactions between monomer 2, monomer 1 and tether domains in a subfamily-specific manner. Furthermore, using three-monomer 5’UTRs, we established a complex nonlinear relationship between the length of the outmost monomer and the overall promoter activity. The laboratory mouse is an important mammalian model system for human diseases as well as L1 biology. Our study extends previous findings and represents an important step toward a better understanding of the molecular mechanism controlling mouse L1 transcription as well as L1’s impact on development and disease.

Genomic L1 sequences are grouped into subfamilies according to their evolutionary history. Among L1s in the human genome, the oldest subfamilies L1MA to L1ME are shared with other mammals, but the younger L1PB and L1PA subfamilies are only found in primates. The youngest subfamily, L1PA1 (also called L1Hs), is specific to humans [21]. A remarkable feature of L1 evolution is that new subfamilies frequently emerged by acquiring distinct 5'UTRs unrelated to those found in existing subfamilies [22]. In the last ~ 70 million years during primate evolution, there were at least eight episodes of 5'UTR replacement. It is believed that new 5'UTRs provide a mechanism for emergent subfamilies to avoid competition of host factors or to escape host suppression [22]. The latest 5'UTR acquisition occurred ~ 40 million years ago (MYA) in ancestral anthropoid primates and gave rise to subfamily L1PA8 [23]. The overall architecture of this new 5'UTR had been maintained as a single lineage in later subfamilies from L1PA7 to L1PA1. Nevertheless, these subfamilies were subjected to continued host-L1 conflicts. For example, subfamilies L1PA6 to L1PA3 had evolved a ZNF93 binding motif in their 5'UTRs, which recruits ZNF93, triggering KAP1-mediated transcriptional silencing [24,25]. In contrast, a 129-bp deletion in the 5'UTR (inclusive of the binding site) allowed a subset of L1PA3, L1PA2, and L1PA1 to escape ZNF93 suppression [25]. In addition, a single nucleotide change at position 333 created a functional m6A site, which first appeared in a subset of L1PA3 and then dominated in L1PA2 and L1PA1 [26]. Primate L1 5'UTRs also possess an antisense promoter, which drives the expression of a third open reading frame (ORF0) as well as chimeric fusion transcripts with upstream cellular genes [27][28][29].
The laboratory mouse is an important mammalian model system for human diseases as well as L1 biology [30][31][32][33]. Despite sharing many ancestral L1 subfamilies with the human genome, the mouse genome is dominated by lineage specific L1 subfamilies, which were initially evolved from ancestral L1MA6 elements ~ 75 MYA at the divergence of the two species [4]. A comprehensive analysis of full-length L1 sequences in the mouse genome identified 29 L1 subfamilies that have undergone amplification since the split between mouse and rat about 13 MYA [34]. Overall, the evolution of mouse L1 subfamilies fits in the single lineage model as seen in the human genome. Similarly, young mouse L1 subfamilies frequently evolved by acquiring new 5'UTR sequences. Since the split from the rat, the mouse genome has experienced at least 11 episodes of 5'UTR replacement [34]. The 29 L1 subfamilies feature seven types of 5'UTR sequences: Lx, V, Fanc, Mus, F, A and N (ordered by their first appearance in the genome from old to new) [34]. The F type 5'UTR was resurrected from Fanc ~ 6.4 MYA and led to the formation of subfamilies F_V to F_I, the youngest of which ceased amplification about 2 MYA. The A type 5'UTR was recruited approximately 4.6 MYA and appeared in seven L1 subfamilies (A_VII to A_I), with A_I being the youngest and active since 0.25 MYA. Remarkably, the F type 5'UTR had been revived three times through recombination of the 5' portion of an F element with the 3' portion of an A_III element, forming subfamilies Gf_II, Gf_I, and Tf_III/II/I respectively [34]. As in the human genome, the evolutionary timeline of mouse L1s is also interspersed with episodes of multiple subfamilies coexisting over extended periods of time. For both human and mouse L1s, concurrently active subfamilies often possessed distinct 5'UTR promoter sequences [23,34]. This observation has led to a hypothesis that different promoters enabled subfamilies not to compete for the same transcription factors. Unlike the human genome, which features one contemporarily active subfamily, at least three subfamilies (Gf_I, Tf_I/II, and A_I) have been amplifying in the mouse genome in the last one million years [34,35]. Interestingly, phylogenetic evidence suggests that Gf_I and Tf_I/II in the laboratory mouse genome might be acquired through inter-specific hybridization rather than evolved from within its own genome [34]. In any case, it is unclear whether all three subfamilies remain currently active in the germ line of the laboratory mice.
Owing to their lineage-specific nature human and mouse L1 5'UTRs share no sequence homology. Moreover, mouse L1 5'UTRs are distinctly different from human L1's in that the former are organized into tandem repeats called monomers [11,36]. Such monomeric structures are also present in some other vertebrate L1s, including rat, hyrax, horse, elephant and opossum, but mouse L1 5'UTRs boast the highest number of monomers among all vertebrates [37]. The number of monomers varies among individual L1s. For example, two recent full-length Tf insertions carried 5.7 and 7.5 monomers, respectively [38]. Using reporter assays, it has been demonstrated two-monomer is the minimal promoter structure to have significant transcriptional activity for L1 spa , a Tf subfamily member [39]. Similar tests have not been conducted for other mouse L1 subfamilies. Between monomers and ORF1 is a non-monomeric sequence, termed tether [40]. In both A and Tf subfamily mouse promoters, tethers lacked significant transcriptional activity in reporter assays [39,41]. In this study, we aim to compare promoter activities across young mouse L1 subfamilies and investigate the contribution of individual monomers and the tether sequence using reporter assays.

Most full-length L1s from young mouse L1 subfamilies possess two or more monomers
To profile mouse L1 promoter activities, we first analyzed the length distribution of mouse L1 5'UTRs by counting the number of monomers for full-or near full-length elements. Since elements from the old subfamilies would have accumulated numerous debilitating mutations, we limited our analysis to seven recently active subfamilies, including A_I, Tf_I, Tf_II, Gf_I, Tf_III, A_II, and A_III (listed from young to old). The estimated age for these L1 subfamilies ranges from 0.21 MYA for A_I to 2.15 MYA for A_III (Fig. 1A) [34]. To tabulate elements carrying a specific number of monomers, L1 loci containing at least a partial 5'UTR are binned according to their respective 5' start point (Fig. 1B). For example, if the 5'UTR of an element starts within the third monomer, it would be placed into the monomer 3 (M3) bin. We observed a trend of 5'UTR length shortening as subfamilies age. The vast majority of A_I elements (1032 out of 1125 or 91.7%), the youngest among this group, have at least two intact monomers. The distribution of A_I elements peaks at M3 (357 out of 1125 or 31.7%). In other words, more loci start within the third monomer than any other 5'UTR positions. In contrast, 87.6% (816/931) of the A_ III loci, the oldest among this group, have fewer than two intact monomers, and 71.6% (667/931) of the loci start in monomer 2 (M2). This shortening trend is also evident if a comparison is made among closely related subfamilies (e.g., comparing among A_I, A_II and A_III, or among Tf_I, Tf_II and Tf_III). Overall, among the loci with at least a partial 5'UTR from these seven mouse L1 subfamilies, 61.0% (3515/5765) have > 2 intact monomers, 29.7% (1710/5765) have > 3 intact monomers, 14.9% (858/5765) have > 4 intact monomers, and 7.8% (230/5765) have > 5 intact monomers. At the extreme end of the spectrum, there are seven loci that have > 10 intact monomers (i.e., falling into M11 + bin), all belonging to A_I, Tf_I, Tf_II, and Gf_I subfamilies. To calculate the average number of monomers for each subfamily, we excluded loci with either > 10 monomers or truncated within the tether (T) (Fig. 1C). On average, L1 loci from the youngest subgroup carry > 3 monomers (3.7, 3.5 and 3.1 monomers for A_I, Tf_I and Tf_II, respectively), followed by Tf_III (2.5 monomers), Gf_I (2.3 monomers), A_II (2.3 monomers), and A_III (1.5 monomers). An inverse relationship was observed between subfamily age and the average number of monomers among these seven mouse L1 subfamilies (simple linear regression: R = -0.91, p = 0.004).

Two-monomer consensus sequences from six L1 subfamilies differ in their sense promoter activities
To quantitatively evaluate L1 promoter activity, we developed a single-vector dual-luciferase reporter assay ( Fig. 2A). In this vector design, a variant of L1 promoter drives the expression of firefly luciferase (Fluc), and an invariable HSV-TK promoter drives the expression of the Renilla luciferase (Rluc). The Rluc reporter cassette is embedded on the plasmid backbone as an internal control to normalize transfection efficiency. The L1 promoter activity is reported as the average Fluc/Rluc ratio among four replicate wells of NIH/3T3 cells. For this assay to work properly, it is important that Fluc and Rluc signals are both within the linear dynamic range (i.e., not saturated). Furthermore, there should be minimal crosstalk between the two reporter cassettes. To this end, we performed a titration experiment using varying amount of pCH117 plasmid per reaction in a 96-well assay format. Note the L1 promoter in the pCH117 plasmid was derived from an active human L1, L1 RP [42]. The Fluc and Rluc signals scaled proportionally to the amount of plasmid from 5 to 20 ng but started to plateau when 25 ng or more plasmid was used (Fig. 2B). The Fluc/Rluc ratio was relatively stable within this range (Fig. 2C). In subsequent assays, 10 ng plasmid DNA was used per well for all promoter assays.
To compare promoter activities across mouse L1 subfamilies, we first synthesized the consensus 5'UTR sequence of six subfamilies (Tf_I, Tf_II, Tf_III, Gf_I, A_I, and A_II). As the length of the consensus 5'UTR varies among these subfamilies [34], we retained only the first two monomers plus the tether in this experiment ( Fig. 2D) (promoter sequences in Additional file 1: Table S1). This decision was based on two observations. First, for the L1 spa element, it has been reported that a minimum of two monomers is required for detectable promoter activity [39]. Second, as described earlier (Fig. 1B), most of the elements from the young L1 subfamilies retain at least two intact monomers. We removed A_III subfamily from this experiment as only a small fraction of A_III elements have two intact monomers (Fig. 1C). We incorporated two control plasmids in our dual-luciferase assays. pLK037 is a no-promoter negative control. It lacks a promoter sequence upstream of the Fluc coding sequence but contains an intact Rluc cassette; hence, its Fluc/Rluc ratio represents the assay background. To facilitate comparison of activities among different L1 promoters, we normalized the Fluc/Rluc ratio of each promoter construct to pLK037 (i.e., setting the Fluc/Rluc ratio of pLK037 to 1; Fig. 2D). pCH117 is a positive control. The normalized promoter activity for pCH117 ("L1 RP ") is 914, which can be interpreted as that human L1 RP 5'UTR possesses a promoter activity 914-fold above the assay background. As pCH117 usually shows the highest promoter activity among all the constructs tested, its normalized promoter activity is also an indication of the assay dynamic range. Note the assay dynamic range fluctuates to some extent from experiment to experiment (e.g., 700-to 1200-fold above background), likely due to unpredictable variations in cell status and transfection procedures. However, such fluctuations should not substantially alter the relative fold difference among promoters.
For two-monomer consensus sequences, we found the highest activity in Tf_II subfamily (394-fold above assay background), followed by A_I (274-fold), Tf_I (214-fold),  Table 1 of the original publication). We applied styling changes to highlight the Tf, Gf, and A subfamilies. Note, historically, the nomenclature of Tf and Gf subfamilies features a subscripted F (e.g., T F and G F ) [35,38] but here we follow the conventions established in Sookdeo et al. [34], which accommodate multiple Tf and Gf subfamilies (e.g., Tf_I/II/III and Gf_I/II). B Distribution of the 5'UTR start position in different L1 subfamilies. For each subfamily, the number of L1 loci is tallied according to their starting nucleotide position relative to the tether (T), the first ten individual monomers (M1 to M10), M11 and beyond (M11 +). C Inverse relationship between the average number of monomers and subfamily age. A simple linear regression line and the corresponding equation were shown along with individual data points Tf_III (189-fold), A_II (114-fold), and the lowest activity in the Gf_I subfamily (59-fold) (Fig. 2D). Overall, there appears to be a weak, but not statistically significant, inverse relationship between subfamily age and twomonomer consensus promoter activity among these six subfamilies in NIH/3T3 cells (simple linear regression: R = -0.62, p = 0.19) (Fig. 2E). In this regard, subfamily Gf_I may be considered as an outlier, which is relatively middle-aged (0.75 MYA) but showed significantly less activity (15% of that of Tf_II). The same experiment was repeated in F9 cells, a mouse embryonal carcinoma cell line known to display high levels of endogenous L1 expression [14,43,44]. In F9 cells, the highest promoter activity was found in Tf_II (243-fold), followed by Tf_I (207-fold), Tf_III (106-fold), A_I (70-fold), A_II(50-fold), and Gf_I (14-fold) (Additional File 2: Fig. S1A). Like in NIH/3T3 cells, the weak inverse relationship between subfamily age and promoter activity is not statistically significant (simple linear regression: R = -0.54, p = 0.27) (Additional File 2: Fig. S1B).
Differential and subfamily-dependent contribution of monomer 2, monomer 1, and tether to mouse L1 promoter activity DeBerardinis and colleagues have previously investigated the interactions among monomers and the tether Fig. 2 Comparison of sense and antisense promoter activities for two-monomer mouse L1 5'UTR consensus sequences in NIH/3T3 cells. A Schematic of the dual-luciferase L1 promoter reporter assay vectors. An L1 promoter, cloned in via flanking SfiI sites, drives the firefly luciferase (Fluc) expression. A built-in Renilla luciferase (Rluc) expression cassette is used to normalize transfection efficiency. Each reporter cassette ends in a polyadenylation signal (illustrated as letter A in a hexagon). Amp, ampicillin resistance gene; HSV-TK, herpes simplex virus thymidine kinase promoter; Puro, puromycin resistance gene. Not drawn to scale. B Titration of plasmid DNA for the cell-based reporter assay. Amount of plasmid DNA is titrated in NIH/3T3 cells in quadruplicate using the control vector pCH117 in which the promoter of human L1 RP drives Fluc expression. The mean and standard error are shown for both Fluc and Rluc signals in raw relative luminescence units (RLU). For example, at 5 ng, the mean Fluc and Rluc signals from pCH117 are 124,238 and 25,159 RLUs, respectively. As a reference, the raw background Fluc and Rluc signals are ~ 100 RLUs in the dual-luciferase assay. C The calculated ratio of Fluc/Rluc from above titration experiment. Mean and standard error are shown. D Normalized activity of two-monomer consensus promoter sequences from six mouse L1 subfamilies. Sequence organization of the promoters is illustrated on the left side. The length of M2, M1, and tether (T) for each promoter is annotated (in base pairs). For each subfamily, the promoter activity was tested in both sense (S) and antisense (AS) orientation. The x-axis indicates the normalized promoter activity (i.e., the Fluc/Rluc ratio of a control no-promoter vector, pLK037, was set to 1). Note a broken x-axis is used to contrast sense and antisense promoter activities. E Inverse relationship between the sense promoter activity and subfamily age. A simple linear regression line was shown along with individual data points sequence based on a single promoter variant, L1 spa , a prototypic mouse Tf element [38,39]. Specifically, they observed that tether alone lacked promoter activity, monomer 1 (M1) alone had some activity, either M1-T or M2 alone had about twofold activity above assay background, M2-M1 had about threefold activity, but three or more monomers showed even higher activity. These observations led to the conclusion that two monomers are required for L1 promoter activity [39]. When aligned to Tf_I and Tf_II consensus sequences, L1 spa showed similar levels of divergence to Tf_I and Tf_II in the 5'UTR and ORFs, but much higher similarity to Tf_I than Tf_II in the 3'UTR (e.g., all 6 SNPs are against Tf_II). Thus, we consider L1 spa as a member of the Tf_I subfamily.
To validate and expand previous findings, we conducted similar studies using consensus promoter sequences for three different subfamilies, including Tf_I, A_I, and Gf_I (promoter sequences in Additional file 1: Table S2). For Tf_I subfamily (Fig. 3A), consistent with the previous report using L1 spa 5'UTR [39], the promoter construct with two tandem monomers and the tether (M2-M1-T) showed 6.0-fold higher activity than the construct containing M1 and the tether (M1-T) in NIH/3T3 cells. The previous study showed minimal activity from tether alone or M1 alone, but M2 alone was not tested. The wide dynamic range of our assay allowed us to differentiate the relative activities of M2, M1, and tether. In the context of the consensus sequence, M2 alone displayed an activity equivalent to 22.2% of the M2-M1-T sequence. M1 alone is about twofold less active (13.0% of M2-M1-T) but remains 36.9-fold above the assay background (p < 0.05 via pairwise t-test with Benjamini-Hochberg correction for multiple testing; adjusted p values for all pairwise t-tests are provided in Additional file 3). Tether alone showed even less activity (4.1% of M2-M1-T) but remained 11.6-fold above the assay background (p < 0.05). To confirm such residual promoter activities, we included two additional control plasmids (Fig. 3A). First, we replaced the promoter sequence with a 205-bp fragment from the green fluorescent protein (GFP) coding sequence, equivalent to the length of Tf_I tether. As expected, this 205-bp GFP (GFP205) sequence showed no promoter activity (0.6-fold relative to the assay background; p > 0.05). Second, we placed the tether sequence in its antisense orientation (T_AS). Interestingly, the antisense Tf_I tether had 8.2-fold higher activity than the assay background (p < 0.05). These results suggest that the Tf_I tether sequence has some weak transcriptional activities in both sense and antisense orientations. To aid in the interpretation of the contribution of individual domains, we diagrammed promoter activities along with domain locations in an integrated manner (Fig. 3B). For Tf_I subfamily, M2-M1-T has the highest activity, 3.2-fold higher than any other permutations of its subdomains. Comparing M1-T with T and M1, it seems that the activity of M1-T is the sum of M1 and T alone, suggesting an additive role. The addition of M2 to M1-T appears to be synergistic, as the resulting M2-M1-T construct is sixfold higher than M1-T. To probe the contribution of M1 to overall two-monomer promoter activity, we generated a synthetic construct in which M2 is directly placed upstream of the tether (M2-T) (Fig. 3A). Comparing M2-T with M2-M1-T, the deletion of M1 reduced the promoter activity by at least threefold. This result suggests that M1 positively contributes to the two-monomer promoter activity for Tf_I subfamily. Taken together, all three domains contribute positively to the overall two-monomer 5'UTR activity in Tf_I subfamily. Comparable results were obtained from F9 cells (Additional File 2: Fig. S2A-B).
For A_I subfamily, M1-T displayed 30.4-fold lower activity than M2-M1-T in NIH/3T3 cells (Fig. 3C). The reduction is even more dramatic than that observed for the Tf_I subfamily. Then we examined the activities of each domain: M2, M1, and tether alone. Surprisingly, the A_I M2 showed remarkable promoter activity on its own, with 3.6-fold higher activity than the two-monomer construct. In contrast, M1 and tether had low but detectable amount of activity relative to the assay background. Specifically, both had less than 3% of M2-M1-T but still 7 ~ 8-fold above the assay background (p < 0.05). However, combining M1 and T together did not lead to any substantial increase in promoter activity (tenfold above background for M1-T). The deletion of M1 from M2-M1-T reduced the promoter activity by a mere 7% (p > 0.05; comparing M2-T with M2-M1-T), suggesting M1 contributes little to the overall two-monomer promoter. On the other hand, the presence of tether sequence reduced M2 activity by fourfold (p < 0.05; comparing M2 and M2-T), indicating that A_I tether significantly suppresses the promoter activity of M2 and likely plays a negative role in the context of two-monomer promoter. Thus, M2 dominates in its contribution to the overall A_I promoter activity. Similar to the experiment with Tf_I promoters, a 202-bp fragment from the GFP coding sequence (GFP202), equivalent to the length of A_I tether, showed little promoter activity (1.5-fold above background; p < 0.05). The antisense A_I tether had threefold higher activity than the assay background (p < 0.05). These results suggest that the A_I tether sequence also has some weak transcriptional activities in both sense and antisense orientations. To summarize, M2 is the major contributor of two-monomer promoter activity for A_I subfamily, the tether negatively regulates M2 activity in the context of two-monomer 5'UTR, while the role of M1 is minimal (Fig. 3D). Comparable results were obtained from F9 cells (Additional File 2: Fig. S2C-D).
Similar trend was observed for Gf_I promoter (Fig. 3E). Gf elements were first described in 2001 by Goodier and colleagues [35]. The Gf_I subfamily [34] conforms , A_I (C), and Gf_I (E). Sequence organization of the promoters is illustrated on the left side. The length of M2, M1, and tether for each promoter is annotated (in base pairs). The dashed line represents domain(s) that were removed in reference to the two-monomer 5'UTR sequence (M2-M1-T). The tether was tested in both sense (T) and antisense (T_AS) orientation. A short version of Gf_I tether was additionally included (T249 and T249_AS) in panel E. The x-axis indicates the normalized promoter activity (i.e., the Fluc/Rluc ratio of a control no-promoter vector, pLK037, was set to 1). Note a broken x-axis was used to highlight the wide range of promoter activities. On the right hand are 2-D representations of the promoter data for subfamily Tf_I (B), A_I (D), and Gf_I (F), corresponding to panel A, panel C, and panel E, respectively. Each domain tested is represented by a filled box. The domains are arranged in the order of M2, M1, and tether from left to right. The height of the box corresponds to the normalized promoter activity (to scale). A scale is shown in panel F; its height corresponds to a normalized promoter activity of 100. The hatched lines represent the missing M1 domain in the M2-T promoter construct to pattern II of Gf promoters in the original scheme. As described earlier, the consensus Gf_I M2-M1-T construct had much weaker promoter activity than the corresponding Tf_I and A_I constructs in NIH/3T3 cells (27.4% and 21.4%, respectively; Fig. 2D). Nevertheless, it remained 3.2-fold more active than M1-T (p < 0.05), although the magnitude of reduction was not as dramatic as in A_I and Tf_I. The activities of individual domains, M2, M1 and the 313-bp tether, were 20.2%, 9.8%, and 20.4% of M2-M1-T, respectively, but remain significantly above the assay background (p < 0.05). The antisense 313-bp tether (T_AS) also had substantial amount of promoter activity (26.6% of M2-M1-T; p < 0.05 against assay background). Note the 313-bp tether includes a truncated 64-bp monomer at its 5' end. We also subcloned the tether sequence without the 64-bp truncated monomer. The shortened 249-bp tether had detectable activities in both sense (T249, 11.8% of the two-monomer promoter; p < 0.05 against assay background) and antisense orientation (T249_AS, 13.7% of two-monomer promoter; p < 0.05 against assay background). The interactions among individual domains for subfamily Gf_I are distinctly different from both Tf_I and A_I (Fig. 3F). For Gf_I, the interaction between M1 and T appears to be additive when comparing M1-T with M1 and T alone. On the other hand, M2 and M1-T are somewhat synergistic as M2-M1-T is about twofold the sum of M2 and M1-T. In comparison, the deletion of M1 only reduced the promoter activity for Gf_I by 13% (p > 0.05; comparing M2-M1-T with M2-T), suggesting M1 plays a minor role in Gf_I subfamily. Thus, the two-monomer activity of Gf_I is mainly the result of interaction between M2 and tether. Comparable results were obtained from F9 cells despite the overall weaker activity of Gf_I promoter sequences in F9 cells (Additional File 2: Fig. S2E-F).

Length of monomer 3 has a complex nonlinear effect on overall promoter activity
Thus far, we have shown the contribution of individual M2, M1, and T sequences in the context of a two-monomer 5'UTR for Tf_I, A_I, and Gf_I subfamilies. However, many L1 promoters contain more than two monomers. Indeed, for the two youngest mouse L1 subfamilies, Tf_I and A_I, more L1 promoters start in M3 than in any other positions (157 out of 513 or 30.6%, and 357 out of 1125 or 31.7%, respectively) (Fig. 1B). On the other hand, the distribution of the 5' start positions in M3 is, albeit varied, nonrandom. For example, 16.6% (26/157) of the M3-containing Tf_I loci start at nucleotide position 83 (Fig. 4A) and 26.3% (94/357) of the M3-containing A_I loci start at nucleotide position 86 (Fig. 4B). To dissect the role of varied lengths of monomer 3, we conducted a direct comparison between M3-M2-M1-T and M2-M1-T for both Tf_I and A_I subfamilies in NIH/3T3 cells (Fig. 4C-D) (promoter sequences in Additional file 1: Table S3). Indeed, both three-monomer consensus constructs were more active than the twomonomer counterparts (p < 0.05). For Tf_I subfamily, the three-monomer promoter was 2.4-fold higher than the two-monomer version and was only 17.4% lower than the reference L1 RP promoter (Fig. 4C). For A_I subfamily, the three-monomer promoter was 4.0-fold higher than the two-monomer version and even outperformed the highly active L1 RP promoter by 19.3% (Fig. 4D). To study the impact of an incomplete monomer on the overall promoter activity, we created series of A_I and Tf_I promoter constructs by truncating the third monomer stepwise for 40 bp. For Tf_I subfamily, the deletion of the first 40 bp reduced the promoter activity to 74.0% of the three-monomer construct (p < 0.05) (Fig. 4C). The removal of the first 80 bp reduced the promoter activity further to 36.5% of the three-monomer construct (p < 0.05 between 40-bp and 80-bp deletion constructs). Deletion of the first 120 bp had additional effect (down to 23.6% of the three-monomer construct) (p < 0.05 between 80-bp and 120-bp deletion constructs). However, this diminishing trend was reversed when the promoter was further truncated. The promoter activity was restored to 31.6% of the three-monomer construct when the first 160 bp was deleted (not statistically different between 120-bp and 160-bp deletion constructs). The deletion of the entire third monomer (212 bp), giving rise to the two-monomer construct, restored the activity to 42.3% of the three-monomer construct (p < 0.05 between 160bp deletion construct and M2-M1-T construct). Similar patterns were seen with the vector series for A_I subfamily (Fig. 4D). The promoter activity was reduced to 45.6%, 18.0%, 15.7% of the three-monomer construct with 40-, 80-, 122-bp deletions, respectively, and then rebounded back to 18.1% and 25.3% of the promoter activity with deletion of 160 bp and the entire 208-bp M3, respectively. Thus, for both subfamilies, the first 80 bp of M3 has a positive impact on overall promoter activity but the last 80 bp negatively regulates the promoter activity. The interaction between the length of M3 and the overall promoter activity is nonlinear and characteristic of an asymmetrical U-shaped relationship (Fig. 4C-D). Comparable results were obtained from F9 cells for both Tf_I (Additional File 2: Fig. S3A) and A_I subfamilies (Additional File 2: Fig. S3B).

Two-monomer consensus sequences have antisense promoter activities
The human L1 contains an antisense promoter activity [27], which affects as many as 4% of the human genes [45]. An antisense promoter activity has been previously reported in ORF1 region of the mouse L1 [46]. However, it remains unclear whether mouse L1 5'UTRs have antisense promoter activities. To uncover potential antisense promoter activities, we inverted the two-monomer consensus sequences from the six young mouse L1 subfamilies and compared them to their sense-oriented counterparts in NIH/3T3 cells (Fig. 2D). In our control experiment, the antisense oriented L1 RP 5'UTR showed 106.2-fold activity above the assay background, equivalent to 11.6% of that of the sense promoter. In this context, all six L1 subfamilies demonstrated detectable levels of antisense promoter activities (p < 0.05 as compared pairwise to assay background) (Fig. 2D). The three youngest subfamilies (A_I, Tf_I, and Tf_II) all had > 40-fold activity above the assay background in the antisense orientation, equivalent to 15.0%, 21.0%, and 12.1% of the activity from the corresponding sense promoter, respectively. The antisense sequence of A_II subfamily showed 21.5-fold activity in the reporter assay, which is equivalent to 18.8% of the sense promoter. Gf_I and Tf_III subfamilies had the lowest antisense promoter activities (13.1 and 10.1-fold above assay background, respectively), corresponding to 22.3% and 5.3% of their sense promoter counterparts. In F9 cells, the antisense activity of the control L1 RP 5'UTR was 32.9% of the sense sequence. In comparison, the antisense activities of mouse L1 two-monomer consensus sequences ranged from 3.1% to 26.6% of the corresponding sense promoters (Additional File 2: Fig. S1A). It should be noted that, unlike those of Tf_I and Tf_II subfamilies, the antisense promoter activities of A_I, A_II, Gf_I, and Tf_III were relatively weak in F9 cells (as low as 3.2-fold above assay background despite being statistically different from the assay background).

Discussion
The two-monomer 5'UTRs tested in this study are consensus sequences as defined by the Boissinot group in 2013 [34]. For subfamilies with recent periods of activity, it is expected that individual copies are similar to the consensus sequence [47]. Indeed, this prediction is true for the three youngest subfamilies (A_I, Tf_I, and Tf_II; Additional file 1: Table S4). The reference mouse genome contains 21 identical loci and 134 single-mismatch loci for the 608-bp A_I two-monomer 5'UTR sequence, three identical loci and 33 single-mismatch loci for the 614-bp Tf_I two-monomer sequence, and 18 single-mismatch loci for Tf_II two-monomer sequence. In contrast, for the middle-aged Gf_I subfamily, only three single-mismatch loci are found for its 726-bp two-monomer 5'UTR sequence. The older Tf_III and A_II subfamilies do not have any loci carrying less than three mismatches. Therefore, our results not only reflect the promoter activities of the consensus 5'UTR sequences tested but can potentially be extended to a number of endogenous mouse L1 loci, especially for A_I, Tf_I, Tf_II, and Gf_I.
In the context of two-monomer 5'UTRs, the inclusion of M2 upstream of M1 is essential for its enhanced promoter activity. The enhancement by M2 is 6.0-fold for Tf_I, 30.4-fold for A_I, and 3.2-fold for Gf_I in NIH/3T3 cells ( Fig. 3; comparing M2-M1-T with M1-T for each subfamily), mirroring the 7.4-fold enhancement for Tf_I, 44.2-fold for A_I, and 2.6-fold for Gf_I in F9 cells (Additional file 2: Fig. S2). When normalized to the control L1 RP promoter, it is evident that the activity of A_I M2 consensus (108.6% of L1 RP ) far exceeds that of Tf_I (7.7% of L1 RP ) and Gf_I (1.2% of L1 RP ) in NIH/3T3 cells (Fig. 3). In comparison, in F9 cells, A_I M2, Tf_I M2, and Gf_I M2 display 55.0%, 25.7%, and 0.8% of L1 RP 's activity, respectively (Additional file 2: Fig. S2). Note the definition of individual monomers is not necessarily consistent in the literature across mouse L1 subfamilies. As expected, sequence alignment shows extensive sequence divergence among A_I, Tf_I, and Gf_I M2 sequences used in this study (Additional file 2: Fig. S4). For the 208-bp A_I M2 consensus sequence (5'-GTG CCT GCCC…GTG GAA CACA-3'), we defined its boundary in the A_I 5'UTR consensus sequence by following the convention established by Loeb and colleagues when type A monomer was first described [11] (Additional file 2: Fig. S5). Comparing with previously described A monomer consensus sequences [41,48], the A_I M2 sequence has three mismatches. BLAST search of this A_I M2 sequence in the mm10 mouse genome assembly returns 67 identical hits and 138 single-mismatch hits (Additional file 1: Table S4). Coincidentally, this A_I M2 sequence is identical to the A monomer subtype 1 recently defined by the Smith group using a profile-HMM based unsupervised approach [49]. For the 212-bp Tf_I M2 consensus sequence (5'-GAC AGC CGGC…GTG GGC CGGG-3'), we followed the convention initially established by the Kazazian group [38,39] (Additional file 2: Fig. S6). It differs from Naas's version [38] by one nucleotide at position 171 and from DeBerardinis's version [39] by an additional nucleotide at position 24. Seventeen copies identical to the consensus Tf_I M2 sequence are present in the mouse genome (Additional file 1: Table S4). Note the T monomers recently identified by the profile-HMM approach would start at nt 135 (5'-GGT GCG CCAG…-3') [49]. The 212-bp Tf_I M2 tested here displays a single mismatch with T monomer subtype 22 at nt 24 and with subtype 25 at nt 102, respectively. The 206-bp Gf_I M2 consensus sequence (5'-TGA GAG CACG…ACC TTC CTGG-3') follows the original boundary definition but differs from Goodier's version by two nucleotides at nts 152-153 [35] (Additional file 2: Fig. S7). It has 121 identical copies in the mouse genome (Additional file 1: Table S4). Note the Gf monomer subtype 2 defined by the profile-HMM approach [49] would start at position 204 but is otherwise identical to the Gf_I M2 sequence tested in this study. How individual SNPs affect each monomer variant's activity necessitates future studies.
Our study highlights the difference between M2 and M1 in promoter activity. The most dramatic example is from the A_I subfamily. In head-to-head comparison in NIH/3T3 cells, its M1 alone has a mere 7.7-fold activity above assay background but its M2 is 145-fold more active than M1 (Fig. 3C). A_I M2 is also 52-fold more active than M1 in F9 cells (Additional file 2: Fig. S2C). This functional difference reflects the sequence divergence between them. The A_I M2 and M1 are 86.5% (180 out of 208 nucleotides) identical (Additional file 2: Fig.  S5). Besides 18 single nucleotide variants, M1 possesses three short deletions, including the deletion of one copy of the tandem ACT CGA G motif noted previously [49]. For Tf_I subfamily, the M2 and M1 are 76.6% (164/214) identical overall (Additional file 2: Fig. S6). The divergence is concentrated in the second half of the monomers, with the putative YY1 binding motif preserved in M1. Despite the larger difference than seen in subfamily A_I, Tf_I's M2 and M1 only differed in promoter activity by 1.7-fold in NIH/3T3 cells (Fig. 3A) and by 2.8-fold in F9 cells (Additional file 2: Fig. S2A). For subfamily Gf_I, its M2 and M1 are highly similar with 96.6% identity (200/207) (Additional file 2: Fig. S7). The seven mismatches are located toward the 3' end of the sequence. At the functional level, Gf_I M2 is twofold more active than M1 in NIH/3T3 cells (Fig. 3E) and 1.1-fold of M1 activity in F9 cells (Additional file 2: Fig. S2E). Future studies are necessary to pinpoint the key nucleotide positions that are responsible for differential promoter activity between these M2 and M1 sequences. It should also be noted that, while our study focused on a few consensus monomers, the mouse genome contains a large number of A or Tf monomer subtypes, which display different modes of position preference within a 5'UTR monomer array [49]. It is entirely possible that a strong monomer, similar to A_I M2, is positioned directly upstream of a tether, forming a highly active one-monomer-tether 5'UTR. Therefore, one could not automatically assume low promoter activity for a shortened M1-T like locus.
Unlike monomer sequences, the tether sequences share a significant amount of homology among the three subfamilies (Additional file 2: Fig. S8). The tethers for subfamily Tf_I and A_I are similar in length and 76.6% identical. Both have modest activities in NIH/3T3 cells (11.6-fold or 7.7-fold above assay background, respectively) (Fig. 3A,C) but near baseline activities in F9 cells (2.8-fold or 1.9-fold above assay background, respectively) (Additional file 2: Fig. S2A,C). For subfamily Gf_I, two different versions of tether were tested. One is 249 bp long, which can be divided into a 3′ 208bp segment (with 84.1% identity to Tf_I tether) and a 5′ 41-bp segment (equivalent to 5' extension into the corresponding Tf_I M1 region). It showed 8.8-fold activity above assay background in NIH/3T3 cells (Fig. 3E) and 4.2-fold above assay background in F9 cells (Additional file 2: Fig. S2E). The other is 313 bp long. The addition of the extra 64 bp truncated Gf_I monomer rendered the longer tether sequence slightly more active (15.2-fold above assay background in NIH/3T3 cells (Fig. 3E) and 4.5-fold above background in F9 cells (Additional file 2: Fig. S2E)). Despite the modest activity on its own, the tether sequence seems always augment the activity from M2 or M1 to some extent. The only exception is when it is coupled with A_I M2 as described earlier. The molecular mechanism via which the tether contributes to the overall promoter activity is unknown. The high level of sequence conservation among all A, Tf, Gf and F subfamilies reflects its common ancestry [34]. Though highly speculative it is possible that the tether region has other regulatory roles during L1 replication cycle.
We demonstrated antisense promoter activity for twomonomer 5'UTR constructs from all six evolutionarily young mouse L1 subfamilies examined. The amount of antisense promoter activity is a fraction of the corresponding sense promoter activity, ranging from 5 to 22% in NIH/3T3 cells (Fig. 2D) and from 3 to 27% in F9 cells (Additional file: Fig. S1A). Notably, when tested in multiple cell lines, the antisense promoter activity of human L1PA1 5'UTR falls within this range (12.5% in HeLa cell line [50], 7.8% in human embryonal carcinoma 2102Ep cell line [29], and 25% to 33% in human embryonic stem cell lines [29]) and is reproduced in our assays using L1 RP 5'UTR in two mouse cell lines (11.6% in NIH/3T3 cells and 32.9% in F9 cells). The relative contribution of M2, M1, and tether domains to the overall antisense promoter activity remains unclear. When the tether sequence from subfamily Tf_I, A_I, and Gf_I was tested in the antisense orientation in NIH/3T3 cells, it showed 2.9%, 1%, and 26.5% of the corresponding two-monomer promoter, respectively (Fig. 3), suggesting only Gf_I tether contributes substantially to the antisense promoter activity. Indeed, when tested in F9 cells, the activity of the antisense Gf_I tether sequence alone is equivalent to 64.0% of Gf_I M2-M1-T promoter (Additional file 2: Fig. S2E), while the antisense Gf_I M2-M1-T sequence displays only 26.6% of the sense promoter activity (Additional file 2: Fig. S1A). Our findings on antisense promoter activity in mouse L1 5'UTRs contrast with a previous study, which found minimal activity for two individual A type monomers and a tether sequence when tested in the antisense orientation [41]. This discrepancy may be explained by differences in the sensitivity of the reporter assays used and the promoter sequences tested. On the other hand, our results are consistent with cap analysis of gene expression (CAGE) data from mouse embryonic testes, showing strong antisense transcription start site (TSS) signals for Gf and T monomers [49].
In reference to the computationally defined monomers, the 5' termini of endogenous L1 loci display a tendency of starting from certain nucleotide positions. The 5' truncation points of Tf monomers, including the two prototypic full-length Tf insertions, are clustered at nts 70-110 [38,39,49]. This region overlaps with a putative YY1 binding motif GCC ATC TT at nts 80-87, which has been postulated to play a similar function in controlling transcription initiation as reported for human L1 5'UTR [39,49,51]. Earlier observations from a limited number of A type loci indicated two clusters of 5' truncation points relative to a complete monomer (two loci start at nts 24-25 and ten start at nts 70-85) [11,48,52]. A recent genomewide analysis confirmed the predominance of truncation points within a 30-bp region at nts 70-100 for the 5' most A monomers [49]. Notably, a tandem ACT CGA G motif of unknown function is present at nts 98-111 [36,49]. Our own analysis at single-base resolution replicated these findings, showing a broader distribution with a dominant peak at nt 83 for Tf_I monomers (Fig. 4A) and a much tighter distribution with a dominant peak at nt 86 for A_I monomers (Fig. 4B). However, the role of a partial or incomplete monomer at the beginning of a mouse L1 5'UTR had not been addressed by previous studies. Using the consensus A_I and Tf_I 5'UTRs as a model, we found a complex nonlinear relationship between the length of the outer M3 and the overall promoter activity in both NIH/3T3 (Fig. 4C-D) and F9 cells (Additional file 2: Fig.  S3). As expected, promoters with three full monomers are much more active than those with two monomers for both subfamilies. However, the lowest promoter activities were found when 122 bp (but not when additional sequences) was removed from the 5' end of the M3. Thus, the contribution of M3 sequence to overall promoter activity is not simply proportional to its length. This phenomenon is consistent with a model in which both M3 and its downstream monomers promote parallel transcription initiation events [11]. Under this model, the deletion of 122 bp from M3 abolishes transcription initiation from M3 and unmasks negative regulation of transcription initiation from M2 by the remaining M3 sequence, leading to much reduced overall transcription output. Addition deletion of M3 sequence eliminates the negative regulation and enables unimpeded transcription initiation from M2. The consensus M3 and M2 sequences are not identical though: they differ by two nucleotides in A_I (Additional file 2: Fig. S5), and by three nucleotides in Tf_I (Additional file 2: Fig. S6). Nevertheless, according to the distribution of the 5' start positions of endogenous loci that are 5' truncated within M3 (Fig. 4A-B), one would predict that most of such Tf_I and A_I elements be transcribed at lower levels than an element with either three or two full-length monomers. This observation raises an interesting question about the molecular processes leading to such a 5' truncation pattern and any advantages or disadvantages toward subsequent rounds of L1 replication.
This study has several limitations. The first is that all promoter activities were measured by a luciferase-based reporter system. Although widely adopted, a reporter system that is based on protein activity may be confounded by uncharacterized post-transcriptional regulation, including alternative splicing and polyadenylation [53]. The prototypic Tf_I element, L1 spa , is predicted to harbor two cryptic splice donor sites (at the cognate position in M8 and M7, respectively) and two cryptic spice acceptor sites (in M1 and tether, respectively) [54]. Alternative splicing events utilizing these splice sites might be responsible for the generation of 22 endogenous L1 copies in the reference mouse genome [54]. All the Tf_I promoter constructs tested in this study lack the two cryptic splice donor sites, but the two cryptic splice acceptor sites are retained in selected Tf_I constructs (i.e., those bearing M1 and/or tether; Fig. 3A and Fig. 4C). What effect these cryptic splice sites might exert on the promoter activity is undetermined but a potential role for aberrant splicing should be considered when interpreting the data, especially when sequence deletion is involved. The second limitation is that our experiments were conducted only in two cell lines: a mouse embryonic fibroblast cell line and a mouse embryonal carcinoma cell line. As different cell types are regulated by distinct transcriptional programs [55], some aspects of our results may not be extrapolated into other cellular environments. Indeed, transcriptional activation of individual endogenous L1 loci is highly cell-type specific across a panel of human cell lines [56]. Thus, additional efforts should be devoted to cellular niches that are known to support high levels of L1 activity, such as during early embryogenesis, gametogenesis, neurogenesis, and tumorigenesis (reviewed in [57]). Lastly, while there has been significant progress in mapping transcriptional regulators of human L1 expression [58,59], the field has been lagging in understanding transcription regulation of mouse L1 promoters. Future efforts should also focus on characterizing both cis and trans regulatory elements for mouse L1 expression, including those both common and unique to specific mouse L1 subfamilies.

Conclusions
The multimeric nature of mouse L1 5'UTRs presents a challenge to investigate mouse L1 transcriptional regulation. Accordingly, unlike the human L1 5'UTR, many aspects of mouse L1 transcription remain poorly understood. In this study, aided by synthetic biology and report assays with a wide dynamic range, we compared sense promoter activities and discovered antisense promoter activities from six evolutionarily young mouse L1 subfamilies. Expanding upon a pioneering study featuring a single Tf_I element, we determined contribution of monomer and tether sequences among three main lineages of evolutionarily young mouse L1s: A_I, Tf_I and Gf_I. Our work validated that, across multiple subfamilies, having the second monomer is always much more active than the corresponding one-monomer construct. For individual promoter components (M2, M1, and tether), M2 is consistently more active than the corresponding M1 and/or the tether for each subfamily. More importantly, we revealed intricate interactions between M2, M1 and tether domains and such interactions are subfamily specific. Using three-monomer 5'UTRs as a model, we established a complex nonlinear relationship between the length of the outmost monomer and the overall promoter activity. Overall, our work represents an important step toward elucidating the molecular mechanism of mouse L1 transcriptional regulation and L1's impact on development and disease.

Computational analysis of mouse L1 5'UTR start positions
BLAST + , a suite of command-line tools to run BLAST locally [60], was used to search for the promoter region (query sequence) in each L1 sequence (subject sequence). For each subfamily, we created a query sequence containing 11 monomers and the corresponding tether sequence by removing the 5' partial monomer from the consensus sequence [34] and appending copies of the last full-length monomer to the 5' end of the consensus sequence until there was a total of 11 monomers. The monomers duplicated in the 11-monomer query sequences were the 212bp M3 for Tf_I and Tf_II, the 214-bp M3 for Tf_III, and the 208-bp M3 for A_I, A_II and A_III. We derived four separate 11-monomer query sequences for Gf_I, corresponding to the four 5'UTR monomer organization patterns defined previously [35]. However, pattern III was later excluded from downstream analyses since nearly all its alignments were short and overlapped with alignments for other patterns. Patterns I, II and IV differ from each other in tether length (377, 313, and 250 bp, respectively). Pattern II is considered as a prototype for Gf_I; its 206-bp M2 was duplicated to make the 11-monomer query. The same M2 was used to populate all monomer positions for patterns I and IV. L1 sequences belonging to subfamilies Tf_I, Tf_II, Tf_III, Gf_I, A_I, A_II and A_III were extracted from the mouse genome assembly GRCm38/mm10 using SeqTailor [61], and saved as subfamily-specific subject sequence files. The input BED files containing genomic coordinates for individual L1 loci were derived from mm10 Repeat Library db20140131, which is available from the RepeatMasker website [62]. For each subfamily, the query sequence was searched against each subject sequence in the subject sequence file using BLAST + . The parameters used were "-perc_identity 0, -num_threads 4, -max_target_seqs n" (where n is a number greater than the total number of sequences in the local database). The output alignment file was then parsed in RStudio with R version 3.6. We filtered out alignments that do not end in the last 10 bases of the corresponding tether region of the query sequence and alignments that do not start within the first 10 bases of the subject L1 sequence. This filtering step removed potential loci with a 3' truncated tether and/or with a chimeric 5'UTR composed of monomers from divergent L1 subfamilies. For Gf_I, five loci were shared between patterns I and II, and three of them were also shared with pattern IV. The redundant entries were removed, and the five loci were retained under pattern II only. To plot the 5' start position of L1 sequences in reference to the monomer or tether positions in the query sequence, the start of the alignment in query was separated into 12 bins (tether, and M1 to M11; see Fig. 1B). To calculate the average number of monomers for each subfamily, we excluded the small number of loci that start either in the tether or M11 + (see Fig. 1C). The 5' start position of each locus relative to the specific monomer position in the query was used to determine the factional length of the 5'UTR. The copy number of two-monomer promoters and individual monomer/tether domains in the mouse genome (see Additional file 1: Table S4) was determined in a similar fashion using BLAST + .

Plasmid construction
A detailed list of the promoter constructs, including primers and the corresponding promoter sequences, is provided as supplemental tables (Additional file 1). pCH036 is the base vector for inserting individual promoter sequences between two heterotypic SfiI sites ( Fig. 2A; SfiI_L = GGC CAA AA/TGGCC and SfiI_R = GGC CTG TC/AGGCC; "/" indicates the cleavage site) immediately upstream of the Fluc reporter gene. It looks nearly identical to all the derivative dual luciferase assay vectors except the "L1 promoter" sequence is substituted by a 48-bp multiple cloning site segment. Originating from pESD202, the double-SfiI cassette enables directional inert swapping via a single, robust restriction/ligation cycle [63]. We derived pCH036 from pLK003. The latter was similar in vector architecture to pCH036 but, instead of the Fluc reporter gene, pLK003 had a firefly luciferase based retrotransposition indicator cassette (FlucAI). To make pCH036, we amplified the Fluc reporter gene from pGL4.13 (Promega) using PCR primers WA1312 5'-AAA ACC TAG GGG CCT GTC AGG CCA TGG AAG ATG CCA AAA ACA TTA AGA AG-3' and WA1314 5'-AAA AGG TAC CTT ACA CGG CGA TCT TGCCG-3' . The backbone fragment of pLK003 was prepared by a double digestion with AvrII and KpnI, removing the FlucAI cassette, and subsequently ligated to the Fluc PCR fragment with the same sticky ends. In the resulting pCH036, the second SfiI site (i.e., Sfil_R) is immediately upstream of the start codon of Fluc.
pCH117 is a positive control vector that contains the human L1 RP 5'UTR as the "L1 promoter". To make pCH117, we amplified the L1 RP 5'UTR from pYX014 [64]. The PCR product was digested with SfiI (New England Biolabs), gel purified, and ligated with SfiI-digested pCH036. pLK037 is a negative control vector that contains an empty double-SfiI cassette upstream of the Fluc reporter gene. It was derived by SfiI digestion of pCH117, blunting of the 3' overhangs with Klenow fragment of E. coli DNA polymerase I (New England Biolabs), and self-ligation of the backbone fragment. pLK043, pLK044, and pLK045 are control vectors that contain 202-, 205-, and 250-bp of EGFP coding sequence in the double-SfiI cassette, respectively. The corresponding EGFP sequences were amplified from pWA003 [64] by using the same reverse primer paired with three different forward primers. The PCR product was digested with SfiI, gel purified, and ligated with SfiI-digested pCH036.
The three-monomer Tf_I consensus promoter in pLK086 was derived from a synthetic DNA fragment that is flanked by SfiI_L and Sfil_R restriction sites. All synthetic DNA fragments in this study were purchased from either Genewiz (part of Azenta Life Sciences) or Twist Biosciences. Primers were designed to serially truncating M3 by 40-, 80-, 120-, and 160-bp from the 5' end. The resulting PCR products were SfiI digested and ligated into SfiI-digested pCH036, giving rise to pLK094, pLK095, pLK096, and pLK097. The two-monomer Tf_I promoter in pLK050 was derived from a synthetic DNA fragment. Primers were designed to amplify M2, M1, and T. The resulting PCR products were digested and ligated into pCH036, resulting in pLK057, pLK056, and pLK054.
The antisense version of the tether fragment was similarly cloned into pLK055. M2-T sequence in pLK098 and M1-T sequence in pLK047 were derived from synthetic DNA fragments.
The three-monomer A_I consensus promoter in pLK085 was derived from a synthetic DNA fragment. Primers were designed to serially truncating M3 by 40-, 80-, 122-, and 160-bp from the 5' end. The resulting PCR products were SfiI digested and ligated into SfiI-digested pCH036, giving rise to pLK090, pLK091, pLK092, and pLK093. The two-monomer A_I promoter in pLK049 was derived from a synthetic DNA fragment. Primers were designed to amplify M2, M1, M1-T and T. The resulting PCR products were digested and ligated into pCH036, resulting in pLK053, pLK052, pLK040 and pLK041. The antisense version of the tether fragment was similarly cloned into pLK042. M2-T sequence in pLK046 was derived from a synthetic DNA fragment.
The two-monomer G_I consensus promoter in pLK051, the M2-T promoter in pLK099, the M1-T promoter in pLK048 were derived from separate synthetic DNA fragments. Primers were designed to amplify M2 and M1, respectively. The resulting PCR products were digested and ligated into pCH036, resulting in pLK063 and pLK062. Two different lengths of tether were considered. Primers were designed to amplify and clone the tether as a 313 bp fragment in either sense (pLK060) or antisense orientation (pLK061). A shortened 249 bp version of the tether was also cloned in either sense (pLK058) or antisense (pLK059) orientations.
The two-monomer consensus promoters for A_II (pLK087), Tf_II (pLK088), and Tf_III (pLK089) were derived from separate synthetic DNA fragments. pJT01, pJT02, pJT03, pJT04, pJT05, pJT06, and pJT07 contain antisense versions of the two-monomer promoters in pLK049, pLK050, pLK051, pLK087, pLK088, pLK089 and of the L1 RP promoter in pCH117, respectively. To make these antisense promoter constructs, primers were designed to amplify the sense-oriented promoters from the respective precursor constructs so resulting PCR fragments would reverse the orientation of the promoter with respect to the two heterotypic SfiI sites.

Cell line authentication
We maintained a subline of NIH/3T3 mouse embryonic fibroblast cells in our lab. To confirm cell identity, we submitted an aliquot of the cells to American Type Culture Collection (ATCC) for mouse short tandem repeat (STR) testing. The testing involved the analysis of 18 mouse STR loci as well as two specific markers to screen for potential cell line contamination by human or African green monkey species [65]. The STR profile of our cells is nearly identical to the ATCC reference NIH/3T3 cell line (ATCC CRL-1658). Specifically, our subline shares all 26 alleles that are present in ATCC NIH/3T3 at the 18 mouse STR loci analyzed. In addition, it has evolved a second allele at the STR locus 6-4 (the new allele is one repeat longer than the reference allele). The complete cell line authentication report is available as a supplemental document (Additional file 2: Fig. S9). F9 mouse embryonal carcinoma cell line (ATCC CRL-1720) was gifted by Dr. Michael Griswold, Washington State University. Both cell lines were propagated in a complete culture medium composed of DMEM/High Glucose, 1% SG-200, and 10% fetal bovine serum (Cytiva Life Sciences).

Dual-luciferase promoter assay
Assays were performed in 96-well format. NIH/3T3 cells were first trypsinized from a stock dish, diluted into a suspension at 200,000 cells per ml in complete medium, and kept at 37 °C before seeding into a 96-well plate. F9 cells were grown in a stock dish coated with 0.1% gelatin, trypsinized, and diluted into a suspension at 400,000 cells per ml before seeding into a 96-well plate coated with 0.1% gelatin. Lipofectamine 3000 (Invitrogen) was used following a reverse transfection protocol. Briefly, for each plasmid, two separate tubes were prepared. In one tube, 0.3 µL of Lipofectamine 3000 was diluted and well mixed into 10 µL of Opti-MEM I reduced serum medium (Gibco). In the other tube, 10 µL of Opti-MEM I was first mixed with 0.45 µL of the P3000 reagent by vertexing and then mixed with 45 ng of plasmid DNA (up to 1.75 µL volume) by flicking. The two tubes were then combined, mixed by a brief vertex, and incubated at room temperature for 10 min. For each plasmid, 5 µL of the above DNA/Lipofectamine complex was added to each well for a total of four wells. The amount of plasmid DNA was equivalent to 10 ng for each well, which was determined to be optimal in a separate titration experiment ( Fig. 2B-C). Then 100 µL of cells (20,000 NIH/3T3 cells or 40,000 F9 cells) were added to each well, mixed with the transfection complex, and returned to a CO 2 incubator (48 h for NIH/3T3 cells or 24 h for F9 cells). To measure promoter activity, cells were processed using Promega's Dual-Luciferase Reporter Assay System. To minimize assay background, all steps were conducted in dark. Firefly luciferase and Renilla luciferase signals were sequentially measured on a GloMax Multi Detection System (Promega). Signal integration time was set to one second per well. Mock transfected cells and empty wells were included to evaluate the assay background.

Data analysis and statistics
The raw luminescence readouts were processed in Excel in a stepwise manner. First, the Fluc signal was normalized to the corresponding Rluc signal for each well. Second, the average Fluc/Rluc ratio for the nopromoter vector, pLK037, was calculated from its four replicate wells. Third, the Fluc/Rluc ratio of each well was divided by the average pLK037 ratio from step 2 above. This step effectively sets the average Fluc/ Rluc ratio of pLK037 to 1, which represents the assay background. Lastly, the normalized promoter activity for each promoter construct was calculated as the average of the normalized Fluc/Rluc ratios among the four replicate wells. The corresponding standard error was calculated as the standard deviation divided by the square root of the number of replicates. Statistical comparison between any two promoter constructs was performed in RStudio using the pairwise.t.test function with Benjamini-Hochberg correction for multiple testing (adjusted p values for all data figures are provided in Additional file 3). Simple linear regression was conducted with the "stats" base package of R version 3.6. The significance level was set at 0.05 for all statistical tests.
Additional file 1: Promoter constructs and corresponding sequences. Table S1. Promoters assayed in Figure 2. Table S2. Promoters assayed in Figure 3. Table S3, Promoters assayed in Figure 4. Table S4. Copy number of two-monomer promoters and individual monomer and tether domains in the mouse genome.
Additional file 2: F9 data, sequence alignments and cell line authentication report. Figure S1. Comparison of sense and antisense promoter activities for two-monomer mouse L1 5'UTR consensus sequences in F9 cells. Figure S2. Differential contribution of monomer 2, monomer 1 and tether to overall promoter activity in F9 cells. Figure S3. Contribution of different lengths of monomer 3 to overall promoter activity in F9 cells. Figure S4. Alignment of M2 from A_I, Gf_I and Tf_I subfamilies. Figure  S5. Alignment of A_I monomers. Figure S6. Alignment of Tf_I monomers. Figure S7. Alignment of Gf_I monomers. Figure S8. Alignment of tether sequences. Figure S9. Cell line authentication report for NIH/3T3 subline used.

Additional file 3:
Adjusted p values from pairwise t-tests with Benjamini-Hochberg correction for promoter constructs in all data figures.