In vivo interrogation of regulatory genomes reveals extensive quasi-insufficiency in cancer evolution

Summary In contrast to mono- or biallelic loss of tumor-suppressor function, effects of discrete gene dysregulations, as caused by non-coding (epi)genome alterations, are poorly understood. Here, by perturbing the regulatory genome in mice, we uncover pervasive roles of subtle gene expression variation in cancer evolution. Genome-wide screens characterizing 1,450 tumors revealed that such quasi-insufficiency is extensive across entities and displays diverse context dependencies, such as distinct cell-of-origin associations in T-ALL subtypes. We compile catalogs of non-coding regions linked to quasi-insufficiency, show their enrichment with human cancer risk variants, and provide functional insights by engineering regulatory alterations in mice. As such, kilo-/megabase deletions in a Bcl11b-linked non-coding region triggered aggressive malignancies, with allele-specific tumor spectra reflecting gradual gene dysregulations through modular and cell-type-specific enhancer activities. Our study constitutes a first survey toward a systems-level understanding of quasi-insufficiency in cancer and gives multifaceted insights into tumor evolution and the tissue-specific effects of non-coding mutations.


In brief
Fischer et al. report screening methods interrogating the non-coding genome in mice, which uncovered widespread roles of subtle gene dysregulations in cancer evolution. They show that such quasiinsufficiency is extensive across entities. The discoveries give insights into the impairment of tumor suppression by human cancer risk variants and the tissue-specific effects of non-coding mutations. Human cancer risk variants linked to QI, understand their effects and tissue-specificity 6 INTRODUCTION Cancer evolution is driven by altered cellular signaling states, resulting from structural genome changes or dysregulated gene expression. 1,2 Early work on hereditary cancer syndromes, such as retinoblastoma, 3,4 linked oncogenesis to biallelic loss of tumor suppressor genes (TSGs). Experimental proof for this two-hit hypothesis was provided by the first TSG knockout mice, 4-6 while subsequent studies uncovered that, for some TSGs, inactivation of one allele is sufficient to promote oncogenesis. 7,8 This phenomenon, referred to as haploinsufficiency, can be obligate and often displays context dependencies. 9-12 Even more pronounced TSG dosage sensitivity became evident in studies analyzing hypomorphic Pten alleles, which showed that very minor variation of gene expression can lead to impaired tumor suppression-a state termed TSG quasi-insufficiency. 13,14 In analogy to TSGs, transformation induced by oncogenes often relies on an optimal dosage, which varies depending on the cellular or co-mutational context. 10, [15][16][17][18][19][20] While this dosagedependent continuum model of cancer gene function is documented for few genes, 14 scalable methods to systematically map and causally connect subtle gene dysregulations with cancer development in organisms are largely missing.
Subtle dysregulation affects thousands of genes in a cell and can result from interference with regulatory elements (REs). The protein-coding exome is 50 times smaller than the nonprotein-coding (nPC) genomic space, of which a considerable part is thought to constitute regulatory sequence. 21 During oncogenesis, the regulatory genome undergoes extensive changes, either through structural alterations (such as somatic mutations or copy-number variation) or adaptive processes (such as global chromatin remodeling through cell-intrinsic and -extrinsic triggers). 21,22 However, functional annotation of cancer-causing non-coding regulatory alterations, their combinatorial effects, and cell-type-specific functions remains a major challenge. 23 Likewise, while up to 90% of the genome is transcribed 24 (of which only a smaller part encodes for mRNAs) global functional interrogation of non-coding RNAs (ncRNAs) in cancer is in its infancy.
Genomic alterations in the nPC cancer genome are frequent, but their functional relevance is largely unexplored. 21 T-ALL is a prominent example for a disease characterized by low numbers of mutations in PC sequence (on average 6 per tumor), 25 but almost 1,000 in the nPC genome. 26 The effects of these non-coding mutations are not understood, barring few examples, [27][28][29][30][31] but could indicate a possible role of quasi-insufficiency in T-ALL evolution. Human T-ALL is a heterogeneous disease. The latest WHO classification added early T cell precursor ALL (ETP-ALL, which develops from immature T cells) as a biologically distinct-but in itself heterogeneous-sub-entity with poor prognosis. 32-34 The molecular principles shaping sequential evolution of different T-ALL subtypes is, however, not well understood.
The use of transposon systems for insertional mutagenesis in mice 35-37 made important contributions to the census of cancer genes. 20 Such screens proved particularly powerful in the discovery of drivers that are typically not mutated in human cancer but dysregulated by other means-and are hence difficult to identify by genome-sequencing approaches. Transposon insertions can also affect REs, 20,38 thereby likely causing subtle gene dysregulations. Here, we exploited insertional mutagenesis for systematic functional interrogation of the regulatory genome. We developed screening and analytical approaches, which allowed us to perform genome-wide surveys for quasi-insufficiency in solid and hematopoietic cancers. We also devise forward-directed screening approaches to interrogate in vivo cancer evolution. Using T-ALL as a model, these screens show how combinatorial codes of molecular, cellular, and temporal parameters dictate tumor subtype evolution, and highlight extensive quasi-insufficiency, which displays marked context dependencies, including cell-of-origin associations.

RESULTS
In vivo interrogation of the coding and non-coding genome using T-ALL as a model entity We previously developed PiggyBac screening systems for gene discovery in mice. 20,[37][38][39] We now set out to develop methods for systematic exploration of the nPC genome. Whole-body mutagenesis using the PiggyBac transposase and ATP2 type transposons induces tumorigenesis in the B, T, or myeloid lineage 37 ( Figure S1). To allow subtype-specific analyses, we generated a large cohort (n = 256) of Rosa26 PB/+ ;ATP2 mice, which we monitored for cancer development ( Figure S1; Table S1). Tumors were characterized using immunohistochemistry and T cell (acute) lymphoblastic lymphoma/leukemia (T-LBL/T-ALL; hereafter referred to as T-ALL; n = 51) was used as a model to investigate quasi-insufficiency in cancer ( Figures S1C and S2A).
Quantitative insertion site sequencing (QiSeq) 40 of all T cell tumors revealed 170,075 non-redundant transposon integrations ( Figure S2B). To map genomic regions affected by transposon insertions more significantly than expected by chance, we performed statistical analyses based on Gaussian kernel convolution (GKC). 41 Using CIMPL (common insertion site mapping platform), we identified 1,062 common insertion sites (CISs), of which 994 CISs were found in at least 10% of samples. Figure 1A displays the top 50 CIS genes, including: (1) known T-ALL drivers (such as Notch1, Pten, or Bcl11b), 42 (2) genes that have not been linked to T-ALL before, but to other hematologic malignancies (e.g., Cux1, Mecom, Crebbp), and (3) genes that have not yet been associated with hematopoietic cancers so far. Although the latter are typically poorly studied, some have been linked to signaling (Sh3kbp1, Sipa1l1) or immune functions (Slamf6, Ly6e, Mgat5). Moreover, we found that several of these genes are strongly regulated during T cell development (Gfra1, Nck2, Prim2, Serbp1, Fam169b) (Figure S3), indicating a function in the T cell lineage. The full list of CISs and information on known association to human cancer is provided in Tables S2 and S3.
To examine the suitability of our screening system for interrogation of the nPC genome, we first assessed general characteristics of PiggyBac transposition. By examining the global distribution of insertions we found that nearly half are located in intergenic regions ( Figure 1B). This is comparable with our hematopoietic screens performed with Sleeping Beauty (55% of insertions), a transposon system that does not have insertion biases toward intragenic insertions. This indicates that selection rather than integration preference is the source of non-coding CISs, thus supporting the functional relevance of regulatory regions in tumorigenesis.
Next, we compared PiggyBac insertion profiles with epigenetic features in the T cell lineage and investigated differences between the PC (the sum of exonic and intronic sequence; approximately 25% of the genome) and nPC genome. The overlay of all CIS regions with H3K27ac enhancer histone marks in healthy and malignant T cells revealed enrichment of active chromatin in CISs ( Figure 1C). Looking specifically at the PC genome, we found-as expected-a substantial accumulation of active chromatin marks and depletion of repressive marks at CIS-overlapping genes ( Figure 1D) compared with genes not overlapping with CISs. Notably, the enrichment of active chromatin marks in CISs is also true for the nPC genome ( Figures 1E and S4). Thus, beyond its preference for transcribed genes, 45-48 PiggyBac has a general propensity for active chromatin, supporting its application to perturb cancer-relevant REs.
Finally, because cancer driver insertions are more likely to support clonal outgrowth than passenger insertions, we examined sequencing read coverage profiles of genic and intergenic insertions. We found no major differences between these groups ( Figure 1F), suggesting comparable functional relevance of genic and regulatory CISs.

Annotation of epigenetic features in nPC CISs
For GKC statistics, commonly used scale parameters to identify protein-coding CISs range between 30 and 240 kb. Because the average size of REs (1.5 kb) is smaller than of PC genes (8 kb), we speculated that the scale parameter needs to be adjusted. Systematic comparison of different CIS window sizes used for GKC analyses indeed revealed that reducing the scale parameter to 5k increases the sensitivity of regulatory CIS discovery (Figures 2A, 2B, and S5; Table S4).
To annotate functional properties of intergenic CISs, we developed ARCIS (annotation pipeline for regulatory common insertion sites) (Figures 2C, S5, and S6; see STAR Methods for details), which we used to overlap CIS regions with epigenetic data from different T cell developmental stages or T-ALL (chromatin accessibility, histone modifications, and information on 3D organization, 43 Tables S5 and S6). The ARCIS output supports fast explorative analyses by allowing one to (1) rank CISs according to their regulatory potential (RE score), (2) sort for an RE category of interest, and (3) search for an RE target gene of interest (Tables S7 and S8). We also developed rules for final RE assignment based on manual inspection of ARCIS output data ( Figure S6). Overall, the analyses created a catalog of cancer-relevant REs in T-ALL. Specific results are shown for the 45 high-scoring REs in Figure 2E, which indicates for each CIS the related RE category and target gene/transcript. Data S1-S5 provide detailed visualizations of all related genomic regions.
To explore the human relevance of identified REs, we examined the regulatory activity of CIS syntenic human regions. To this end, we performed lift-over of mouse CIS coordinates to the human genome followed by annotation of a range of epigenomic human data ( Figures 2C, 2D, and S7; Table S9). Subsequent cross-species analyses revealed that the syntenic human regions of mouse regulatory CISs display high concordance in their regulatory activity ( Figure 2D).

Perturbation of regulatory CISs causes subtle dysregulations of target gene expression
We first examined intergenic REs and assigned potentially linked genes ( Figure 2E; Data S1; Table S8). Beyond known T-ALL drivers (such as Runx1, Lef1, Bcl11b, and Rasgrp1), this list includes genes with a role in T cell biology (Satb1 and Rag2) as well as developmental genes for which a function in T cells has not been described before (such as Sall3 and the Hoxd cluster). In principle, transposon insertions in regulatory regions can positively or negatively affect expression of target genes. Possible mechanisms include disruption of transcription factor binding sites, interference with 3D chromosomal conformation, and topology-associated domain structure (see Table S3 for details on the putative cancer relevance of target genes).
We next examined intragenic (intronic) REs, which are difficult to identify in screens, as common analytical approaches assign CISs primarily to overlapping genes. We therefore exploited 3D connectivity data 43 to assign intronic REs to their putative distant target genes. These analyses identified 30 CISs categorized as intronic REs (Data S5). Their main characteristics are: (1) clustered insertion peak in a narrow intronic area, (2) unbiased transposon orientation, (3) Hi-C connection to a distant gene, and (4) often absent CIS gene expression in the relevant tissue. Examples of genes regulated by newly identified REs include Pten (a known T-ALL tumor suppressor 49,50 ), Themis and Nrp1 (not implicated in T-ALL so far, but in T cell biology 51,52 ), or Txn1 and Iqgap2 (not studied in T cells so far).
The validity of the screen is exemplified by a narrow intronic CIS region in Rnls, which has a Hi-C connection to the $400 kb distant Pten promoter ( Figure 3A) and was recently described as a Pten enhancer. 53 Using global run-on sequencing (GRO-seq), we examined the relevance of this RE in human T-ALL patient data ( Figure 3B) and found cell-type-specific enhancer activity, with enhancer RNA signal peaks being present in T-ALL patients but not in HEK293T cells. Accordingly, CRISPR-Cas9-based deletion of the 7-8 kb RE region led to a stronger decrease of PTEN expression in human and murine T cells (34% and 24% reduction) than in HEK293 cells (15% reduction) ( Figure 3C).
We identified CISs affecting 54 nPC transcripts ( Figure 2E; Data S2; Table S8). More than 70% of these ncRNAs are expressed during T cell development ( Figure S5D). Several are in proximity to known T-ALL genes, such as Myb, Myc, and Ptprc. [54][55][56] Others are potential regulators of transcription factors and signaling genes, such as Fam126a, Il6st, and We performed detailed studies on a CIS overlapping with Zeb1, which was annotated as ''PC transcript plus ncRNA'' by ARCIS. Manual inspection of insertion patterns revealed two peaks with opposite transposon orientations, predicted to activate either Zeb1 or the Zeb1 antisense transcript Gm10125 ( Figure 3D). Human ZEB1-AS RNA can activate ZEB1 expression through recruitment of H3K4 methyltransferases. 57 Accordingly, we  Table S5). The intronic CIS region in the Rnls gene shows overlap with active chromatin and a Hi-C link to the Pten promoter. (B) Human PTEN locus. Indicated are the CIS-syntenic human regions (top, green) and GRO-seq tracks (red) of two T-ALL patients, the Jurkat and HEK293 cell lines. The syntenic region of the narrow regulatory CISs shows a typical bidirectional enhancer RNA GRO-seq signal peak in T-ALL patients and in Jurkat cells. Of note, RNLS is not expressed in T-ALL, supporting the notion that the CIS target is not Rnls itself, but its intronic RE. (E) Zeb1 expression in clones with/without CRISPR-Cas9-based knockout of Gm10125 exons 2 and 3 ($2 kb) in EL4 cells (KO n = 17, 9/17 homozygous, ctrl n = 16; deletion boundaries 8 kb upstream of the Zeb1 promoter). Each dot represents relative Zeb1 gene expression in a single-cell-derived clone normalized to Gapdh expression. *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001, Wilcoxon test. (F) Number of cancer-associated GWAS variants in CIS-syntenic human regions. Cancer-risk variants were filtered from the NHGRI-EBI GWAS catalog and pruned for linkage disequilibrium. Variants were overlapped with CIS syntenic human coordinates. The sum of all CIS sizes (width) was used for statistical calculation. **p < 0.01, c 2 test. CIS, common insertion site; HSCs, hematopoietic stem cells; DN2, double-negative stage 2; DP, double-positive stage; Rel, relative. EL4, mouse T lymphoblastic cell line; Jurkat, human T-ALL cell line; HEK293, immortalized human embryonic kidney cells. observed decreased Zeb1 expression (39% reduction) upon heterozygous Zeb1-AS deletion in mouse EL4 cells ( Figure 3E). Thus, the functional outcome of both insertion clusters is induction of Zeb1 expression. Of note, Zeb1 was shown to have tumor-suppressive function in T cells: Zeb1 knockout mice develop mature (classical) T-ALL. 58 In our screen, however, Zeb1 insertions were (1) enriched in immature T-ALL and (2) predicted to be oncogenic, as in AML. 59 These results therefore suggest a dual role of Zeb1 in T-ALL, depending on cellular context (before or after T cell commitment).
We also validated regulatory CISs affecting other loci, including Ikzf1 ( Figures S8A-S8C) or Lncpnt/mir29 (see details in Figures S8D-S8F). The leukemia-associated transcription factor Ikzf1 was marked by two CISs ( Figure S8A). One CIS overlaps with Ikzf1 itself and displays the expected gene inactivation-type insertion pattern, consistent with the known tumor-suppressive function of Ikzf1 ( Figure S8B). The second CIS overlaps with a region 100 kb upstream of Ikzf1 harboring enhancer sequence as well as a lncRNA (Gm11998), which is subject to regulation during T cell development ( Figure S8A). Deleting this region in T-ALL cell lines caused subtle but significant effects on Ikzf1 expression (32% reduction, Figure S8C), confirming the predictions of the screen.
Taken together, these data show that nPC insertions are functional and exert subtle effects on target gene expression. Their large number suggests extensive quasi-insufficiency in T-ALL.
nPC CISs are enriched with human cancer risk variants Cancer risk variants identified in human genome-wide association studies (GWASs) frequently affect non-coding sequence, suggesting subtle gene regulatory effects. To explore a possible link between putative human and mouse regulatory alterations, we intersected human GWAS and our regulatory CIS lists. We indeed found that regulatory CIS targets (n = 149 genes) were highly significantly enriched for GWAS-associated human cancer variants (p = 0.001, pan-cancer variants; p = 2.98 3 10 À6 hematopoietic cancer variants; c 2 test; Table S10).
In an orthogonal approach we performed lift-over of mouse CISs coordinates to the human genome and used the syntenic human regions to analyze their overlap with cancer-associated GWAS variants (Table S11). To exclude SNPs located on the same haplotype block, the list of GWAS variants was pruned for linkage disequilibrium using LDlink. 60 We found an enrichment of pan-cancer GWAS risk variants in human genomic regions syntenic to mouse CISs (3.09 variants per Mb) as compared with their overall frequency in the human genome (1.48 pan-cancer GWAS risk variants per Mb; p = 2.3 3 10 À5 , c 2 test; Table S11). This enrichment was more pronounced for human regions syntenic to mouse intergenic CISs (4.48 variants per Mb; p = 0.0029; c 2 test) as compared with regions syntenic to intragenic CISs (2.82 variants per Mb; p = 0.001; c 2 test, Figure 3F). These results support the human relevance of the screens.
Gene desert deletions in mice drive oncogenesis through subtle regulatory effects Statistically, it is extremely unlikely for transposon insertions to occur on both alleles of a gene or regulatory region in the same cell. We therefore assume that interference with RE function in our screen is largely mono-allelic. This suggests that even very subtle interference with gene regulation can promote malignant transformation. To date there is, however, little evidence that this assumption holds true in mouse cancer models.
The most common pediatric T-ALL translocation is t(5;14)(q35;q32), fusing a gene desert (a genomic region without protein-coding genes) downstream of BCL11B to TLX3 (20%-25% pediatric, 5% adult cases), or more rarely to NKX2-5 or ZEB2. 61 Thereby, hijacking of BCL11B REs leads to overexpression of these translocation partners, which has been shown to be oncogenic. 62,63 However, it is unclear whether mono-allelic enhancer de-commissioning in itself is sufficient to induce tumors in organisms (for example, through reduced BCL11B expression). To address this question, we first explored the syntenic mouse region in our screen, which revealed several CISs in the gene desert downstream of Bcl11b, suggesting that interference with Bcl11b enhancers can indeed in itself be oncogenic (mice do not have translocations).
The translocation breakpoints in human T-ALL are almost exclusively located downstream BCL11B. This $1 Mb region displays regulatory activity, 64,65 which we confirmed at high resolution by GRO-seq in human T-ALL ( Figure 4A). The syntenic mouse regulatory region was marked by several independent CISs ( Figure 4B). Moreover, there are multiple physical interactions of CIS-marked putative REs with the Bcl11b promoter in the T cell lineage ( Figure 4B).
Guided by the human translocation coordinates (and the mouse CISs locations), we engineered two mouse models with kilo-to megabase-scale germline deletions (Bcl11b D105kb , n = 148 and Bcl11b D1Mb , n = 49) in the gene desert with regulatory activity downstream of Bcl11b ( Figure 4B). We found that not only biallelic deletion but also heterozygous knockout mice displayed reduced Bcl11b expression in healthy tissues, although the effects were very subtle for the smaller mutant Bcl11b D105kb ( Figure 4C). In both cohorts, animals started to develop signs of sickness at a young age (Table S12). A subset of animals developed symptoms reminiscent of neurodevelopmental phenotypes ( Figure S9A), such as tremor, consistent with a function of Bcl11b in brain development. 66 The second major phenotype was cancer. Overall, 22%-45% of animals developed tumors ( Figure 4D), while none of the animals in the wild-type cohort (n = 21) developed cancer. These numbers even underestimate the oncogenic effect of the knockouts, considering that a large subset of animals had to be sacrificed at a young age because of neurodevelopmental phenotypes. The tumor spectrum comprised hematologic cancers, including T and B cell malignancies as well as a range of solid cancers ( Figures 4E, S9B, and S10; Table S12). Of note, although all genotypes displayed highly penetrant cancer phenotypes, tumor onset differed substantially between groups: median tumorrelated survival was lowest in Bcl11b D1Mb/D1Mb mice (195 days), followed by Bcl11b D105kb/D105kb (338 days), Bcl11b D1Mb/WT (466 days), and Bcl11b D105kb/WT (640 days) mice ( Figure 4F).
We next examined whether individual genotypes give rise to different cancer phenotypes and found strongly biased representation for T-ALL ( Figures 4E and 4G). Bcl11b D1Mb/D1Mb mice  Table S5). (C) Bcl11b expression in thymi of healthy (no tumor) wild-type and knockout mice (ctrl, n = 5; 105 kb-het, n = 3; 105 kb-hom, n = 6; 1 Mb-het, n = 3; 1 Mb-hom, n = 1). qPCR was performed in duplicate and gene expression was normalized to Gapdh. Data are presented as mean ± SEM (**p < 0.01, Wilcoxon test). Article ll OPEN ACCESS developed no solid tumors, but almost exclusively T cell malignancies (9/10). The difference to other genotypes (11/45 T-ALL) is highly significant (p = 0.0002, Fisher's exact test) and also holds true in age-matched analyses (animals younger than 500 days: T-ALL in 8/9 vs. 10/27 mice, p = 0.0089, Fisher's exact test). At the other end of the genotype spectrum, we found that Bcl11b D105kb/WT animals did not develop T-ALL. Almost all cancers (12/13) were other than T-ALL (p = 0.013, Fisher's exact test) ( Figure 4G). Thus, deletions of regulatory DNA on an otherwise wild-type background are not only sufficient to induce striking cancer phenotypes, but their nature (position, size) and dosage (hetero-or homozygosity) also profoundly affect the outcome (tumor type and frequency), suggesting additive effects and enhancer modularity. Overall, these data support a model in which even subtle gene dysregulation can significantly contribute to oncogenesis.

Genome-wide screens reveal extensive quasiinsufficiency across entities
To explore whether subtle gene dysregulation is of broad relevance beyond T-ALL, we examined 1,450 cancers across 15 PiggyBac insertional mutagenesis screens, including 8 different cancer types and their subentities (Table S13). We found that 10%-38% of CISs in these screens are located in intergenic regions without an overlap to a PC gene ( Figure 5A), suggesting broad relevance of subtle gene dysregulation in oncogenesis.
To examine the distribution of functional traits in non-coding CISs, we performed transcription factor motif analyses, which uncovered cancer-type-specific enrichment profiles ( Figure 5B). For example, T-ALL non-coding CISs were enriched with binding motifs for ETS transcription factors (Ets1, Fli1, Erg), which have well-described roles in T cell leukemogenesis. In contrast, nPC CISs in the pancreatic screen displayed motif enrichment for transcription factors that have known functions in pancreatic acinar cell de-diferrentiation and transformation (such as Fra1/ 2 and Fos/Junb/Ap1. To explore the allelic status of insertions, we manually inspected all 581 high-coverage nPC insertions in the two screens. We found only 22 cases where two insertions were within a distance of 50 kb in the same cancer ( Figure 5C). These few insertions likely reflect independent hits on the same allele (in different cell clones) occurring through local transposon hopping, which is commonly observed and orders of magnitude more likely to occur than insertions on the homologous chromosome. Thus, This conclusion is further supported by functional studies, which we performed to examine effects of intergenic transposon insertions in non-hematopoietic tumors. In a pancreas screen conducted in a Kras G12D mutant background ( Figure 5), one of the top intergenic CISs marks a region with putative regulatory function located $145 kb downstream of the nuclear receptorinteracting protein 1 (Nrip1) promoter. Using CRISPR-Cas9, we generated 4.5 kb deletion knockouts of this region in Kras G12D mutant mouse pancreatic cancer cells. Comparative analyses using 17 wild-type and knockout clones revealed an overall reduction of Nrip1 expression by 30% in the knockout clones ( Figure S11). Another CIS-linked regulatory region in this screen is located 80 kb downstream of Enpp1, a nucleotide pyrophosphatase involved in anticancer immunity. We again performed CRISPR-Cas9 knockouts of this region (3.5 kb), which resulted in downregulation of Enpp1 ( Figure S11). Thus, both experimental series support the functional relevance of intergenic transposon insertions.
We next performed mouse to human lift-over of CIS coordinates for all intergenic CISs identified in the mouse screens (n = 2,337). We observed a significant enrichment of cancerassociated GWAS variants in the syntenic human CIS regions compared with the rest of the genome (p = 7.95 3 10 À7 , c 2 test), supporting the human relevance of discoveries made in mice.
Comparison of mouse screens exposed that individual entities and subentities display striking differences in the occurrence of intergenic CISs ( Figure 5A), suggesting context dependencies. In the hematopoietic system, for example, immature AMLs (defined by morphology/IHC) displayed much lower numbers of regulatory CISs than more mature forms (10% vs. 18%-19%, Figure 5A, immature AML vs. AML_P: Fisher's exact test, p = 0.045). Likewise, immature ETP-ALL have fewer regulatory CISs than their more mature ''classic'' T-ALL counterparts (detailed information on these analyses and T-ALL subtyping is provided below).
Altogether, these studies imply pervasive roles of quasi-insufficiency across cancers and suggest the existence of substantial context dependencies.
T-ALL subtyping for the study of regulatory context dependencies To explore possible context-dependent roles of quasi-insufficiency, we investigated whether our T-ALL model develops different disease subtypes. Human T-ALL is a heterogeneous disease. However, in the latest WHO classification one subset with unique biology has been recognized as a distinct sub-entity: ETP-ALL is characterized by the retention of myeloid and stem cell markers 32-34,67 ( Figure 6A). To characterize mouse tumors, we analyzed transcriptomes and determined immunophenotypes of T-ALL, mature T cell lymphomas (MTL), and healthy thymus ( Figure S12). IHC-based profiling revealed CD4-positive and -negative tumors, suggesting T-ALL heterogeneity in the cohort ( Figure S12). Model-based clustering of gene expression data identified three major subgroups (Figures 6B and S13; Table S14). We first compared the two largest clusters using gene set enrichment analysis ( Figure 6C) and found enrichment of signatures characteristic for human ETP-ALL in one group (e.g., IL6/Jak/Stat), while the second group was enriched for cell-cycle-associated genes, a characteristic of human non-ETP T-ALL (hereafter referred to as ''classical'' T-ALL). 67,68 Moreover, the enrichment of signatures specific for hematopoietic stem cells (HSCs), myeloid progenitors, and early T cells (ETP/DN1) in ETP-like tumors reflects their origin in early precursors, as in humans. In contrast, classical T-ALL displayed  Table S15). To account for the lack of mouse T-ALL classifiers, we built a 20 gene panel that separates subgroups ( Figure 6D).
The characteristics of the third subgroup were initially difficult to classify based on gene-expression profiles. We therefore inspected the insertion profiles in this group and found activating insertions in Mef2c as a top hit in the majority of samples, as defined by high-coverage Mef2c insertions with sense orientation ( Figure 6E). In contrast, classical or ETP-like tumors had no or predominantly low-coverage Mef2c insertions, respectively. We therefore refer to the third tumor cluster as the ''Mef2c-driven'' group, which indeed had high Mef2c expression levels ( Figure 6F). Mef2c-driven T-ALLs are CD4 negative (as are ETP-like tumors), indicating their development from precursor T cells ( Figure S12C). Indeed, MEF2C is strongly expressed in human HSCs and CLPs but not in T cells. 70 Activation of human MEF2C can occur through different translocations, which were associated with immature T-ALL and MEF2C-dependent suppression of Notch signaling. [70][71][72] Multi-scale mapping of T-ALL subtype evolution reveals context-dependent quasi-insufficiency patterns We next investigated whether quasi-insufficiency displays context dependencies at cellular, molecular, and temporal levels in tumor evolution. To this end we integrated analyses of chromatin profiles along the T cell lineage and transposon insertion landscapes. Cell of origin T-ALL is prototypical for tumor types that can arise from different developmental precursors or cell types. While the cell of origin can profoundly affect the biological properties of the evolving tumor, it often cannot easily be inferred through standard phenotyping. We hypothesized that, in scenarios of insertion biases (such as the preference of PiggyBac for open chromatin), the transposon insertion landscape in a cell population reflects a vague screenshot of global chromatin conformation at the stage of genome integration. To examine whether insertion profiles can give indications on a tumor's cell of origin, we overlapped subtype-specific CISs with regions of accessible chromatin in different cell types along the T cell developmental lineage (stage-specific ATAC-seq from Johnson and co-workers 69 ). We found that 79% of CISs in ETP-like tumors overlap with ATAC peaks specific for progenitor, natural killer, B, or early T cells, while classical T-ALL CISs overlap predominantly (68%) with regions of accessible chromatin that are specific for intermediate and late stages of T cell development ( Figure 6G). These data suggest that, despite ongoing transposon mobilization, some insertions carry a certain level of ''historical information'' that allows us to infer developmental origin, a concept that could possibly be expanded to the study of other cancer types.

Sequentiality
The molecular determinants driving individual stages of tumor evolution in different T-ALL subtypes are poorly understood. 73,74 Interrogation of such evolutionary principles in our dataset requires clonal deconvolution, which is, however, not supported by standard CIS calling algorithms. This is due to the non-quantitative statistical concept used to search for genomic ''insertion hot-spots'' in a cohort of mice. To overcome this problem, we conducted a second type of analysis, which integrates quantitative data for each of the 170,000 non-redundant insertions based on quantitative insertion site sequencing, a method we developed earlier for this purpose. 40 For each cancer, read coverages supporting individual insertions (range: 2 to 10,000) reflect their clonal distribution and likely position at the tumor's evolutionary tree.
These analyses revealed that, while top CISs are shared between T-ALL subtypes, their clonal distribution differs markedly, indicating distinct evolutionary hierarchies ( Figure 6H). The most prominent hit in ETP-like tumors was Ikzf1, which is supported by very high read coverages in virtually all tumors ( Figure 6H). Strong positive selection for Ikzf1 insertions in virtually all ETP-ALL establishes a critical role of Ikzf1 in the initiation of this T-ALL subtype. In contrast, Pten, which was characterized by highly subclonal insertions in ETP-like tumors, was the dominant high-coverage hit in classical T-ALL ( Figure 6H), indicating differential temporal orders of tumor-driving events in T-ALL subtypes. Importantly, chromatin accessibility at Ikzf1 and Pten (or other main drivers, including Notch1 and Mef2c) is similar at different stages of T cell development (that is in the different cell types from which T-ALL subtypes arise; Figure S15), excluding the possibility of integration biases driving the differential distribution or sequentiality of driver gene insertions in T-ALL subtypes. Intratumor heterogeneity We next exploited screening data to infer global characteristics of clonal architecture in different T-ALL subtypes. The analyses revealed that the Mef2c-driven group differs from other subtypes in that it displays: (1) fewer CISs, which is also true in samplesize-matched analyses (Figures S16A and S16B) and (2) reduced numbers of total insertions per tumor (Figures 6I and S16C). In humans, there is controversy as to whether MEF2C dysregulated and ETP-ALL feature a single or distinct disease entities. 75,76 Our results support the latter by highlighting substantial biological differences between subtypes. Subtype-specific driver genes We next performed subgroup-specific CIS analyses (Figures S16D-S16F; Tables S16-S18). CISs specific for ETP-ALL affected mature T cell genes (inactivation of Ikzf2, Ikzf3), Ras pathway components (Rapgef2, Nf1), and potential negative regulators of Wnt signaling (Kremen1, Tmem170b; not linked to T-ALL so far). Moreover, several genes linked to stemness or the myeloid lineage were among the top hits in this group (Cnr2, Chd2, Crebbp, Mecom) ( Figures S16D and S16E), which sheds light on several open questions in human ETP-ALL biology (detailed in the discussion). In the classical T-ALL subgroup, two observations stood out in addition to the predominance of Pten hits described above: (1) recurrent insertions in genes linked to late thymocyte development (Tcf12, Rpl5), consistent with the notion of classical T-ALLs arising from post-commitment DP cells, and (2) a large number of CISs affecting intergenic REs, especially among the top CISs ( Figures 6J, S16D, and S16F), suggesting a so far unappreciated importance of subtle gene regulation, specifically in classical T-ALL.
Collectively, these results uncovered key characteristics of different T-ALL subtypes (Figures 6K and S16G) and highlight the ability of our experimental system to interrogate tumor evolutionary principles in space and time. Their application exposed how the complex interplay of distinct cellular contexts, molecular triggers, and the temporal dynamics of their alterations and regulatory interactions drive cancer evolution along different trajectories that give rise to distinct cancer phenotypes/subentities.

DISCUSSION
During oncogenesis, regulatory landscapes undergo extensive changes, captured by global profiling studies, but not well understood at the functional level. Determining whether an alteration is cancer causing (driver) or neutral (passenger) is a challenge in epigenetics, more so than in structural cancer genomics and mutational profiling. Systematic in vivo perturbation of the noncoding genomic space is-given its enormous size and limited annotation-difficult to achieve by targeted approaches, such as library-based CRISPR screens (which face size limits). Even greater restraints to screening scalability arise from the hurdles associated with somatic delivery of libraries to many organs and cell types. 20,77 Insertional mutagenesis using endogenous transposon systems addresses these limitations. Continuous mobilization and random reintegration of multiple transposons in every single cell of the mouse produces enormous mutational complexity, constituting a pool of hundreds of billions of alterations in non-transformed cells that are then subject to selection. We developed universally applicable methods for interrogation of the nPC genome using transposon mutagenesis. Applied to the evolution in T-ALL, the screens assembled catalogs of cancer-relevant REs and nPC transcripts, constituting the first systematic survey of its kind.
Another area of cancer research that faces methodological constraints is evolutionary genetics. In human cancer samples, genetic evolution is inferred retrospectively. Owing to selection and clonal sweep, such analyses typically capture sequentiality of the latest mutations only. Even more challenging is the discovery of subtle, often temporally restricted (and reversible) regulatory processes during tumor evolution, or the capture of other determinants, such as the cell of origin. We devised methods for prospective interrogation of cancer evolution by intersecting epigenomic data along the T cell developmental lineage and insertional mutagenesis to induce different T-ALL subentities. We mapped perturbations driving evolution, including their type and quality, regulatory fine-tuning, combinatorial code, temporal sequence, and cellular/evolutionary history ( Figure S16G). Capturing the complex interactions of these different layers provided multifaceted new insights into T-ALL subtype evolution ( Figure 6K) and explain several open questions in the field.
It was reported earlier that IKZF1 alterations are enriched in human ETP-ALL compared with classic T-ALL. 67 The screens in mice add both functional and temporal information to this association by linking Ikzf1 to the initiation of this T-ALL subtype. Moreover, ETP-ALL displayed exclusivity for a number of screening hits, including genes or REs/gene pairs linked to stemness or the myeloid lineage. Coupled with the observation that stem cell and/or myeloid markers are expressed in human ETP-ALL, 78 the latter suggests a specific vulnerability of pre-commitment progenitors (but not committed T cells) to transformation by insults sustaining the lympho-myeloid program. The variety of myeloid/stem cell-related CISs we found in ETP-like ALL indicates that various such triggers can contribute to ETP cell transformation. This might explain the phenotypic diversity of human ETP-ALL, which express different combinations of stem cell and/or myeloid markers. 33 We speculate that distinct genetic interactions define whether perturbation of such ''myeloid/stem cell'' genes in precursors promotes AML or rather ETP-ALL development. Indeed, while in our AML screen Mecom activation was found as a top truncal driver (unpublished data), Mecom insertions in ETP-ALL were preceded by truncal Notch1 and Ikzf1 insertions.
Our screening system constitutes a model for Mef2c-driven T-ALL, which gave new insights into the biology of this sub-entity. In humans, there is controversy as to whether MEF2C dysregulated and ETP-ALL feature a single disease entity or should be regarded as distinct groups (there is often partial discordance of immunophenotypic characteristics 75,76 ). The mouse model supports the latter. It shows that there are biological differences between ETP-like and Mef2c-driven tumors at various levels, including their transcriptional profiles, driver genes or clonal architectures. This might also explain differences in treatment response and resistance between groups as human MEF2C T-ALL respond poorly to glucocorticoids. 76 The dominance of high-coverage Pten hits in classical mouse T-ALLs (vs. low-coverage hits in ETP-ALL) highlights differential sequentiality of tumor driving events in T-ALL subtypes. Strong selection for Pten inactivation in our functional screens likely reflects subtype-specific constraints or exigencies during early evolution of classical T-ALL and rationalizes the enrichment of PI3K pathway alterations in the equivalent mature forms of human T-ALL. 49, 79 Another observation that stood out in the classical subgroup was the large number of CISs affecting intergenic REs, suggesting a so far unappreciated importance of subtle gene regulation specifically in classical T-ALL. The evolutionary pressures underlying these observations remain to be explored. It is possible, for example, that classical T-ALL rely on establishing a fine-tuned dosage reduction of T cell commitment genes rather than their complete inactivation, which might either be deleterious at this stage or lead to a phenotypic switch toward a less differentiated state.
Over the past two decades, much effort in cancer genetics has focused on identifying coding mutations, a process that had transformative impact in cancer biology. Our results suggest, however, that-beyond coding mutations-a vast and so far understudied layer of molecular dysregulations contributes to oncogenesis. The screens across many cancer types described here provide evidence for widespread haplo-and quasi-insufficiency in tumor evolution, and we show in mouse models that even small or temporal gene expression changes of tumor suppressors, such as Bcl11b, can be oncogenic. Transposon insertions affecting REs predominantly cause subtle gene dysregulations, as do human cancer risk variants, which are commonly located in the nPC genome. The strong enrichment of GWAS hotspots in our catalogs of regulatory CISs thus reinforces the human relevance of this study.
Functional experiments targeting cancer-relevant REs provided mechanistic insights into the pathogenic outcomes of gradual gene dysregulations. Allelic deletion series that we engineered in a large ''gene desert'' downstream Bcl11b in mice Cell Genomics 3, 100276, March 8, 2023 13 Article ll OPEN ACCESS displayed striking developmental and cancer phenotypes. This not only confirmed the predictions made by the screen, but also provided new insights into the biology of the most frequent human translocation in pediatric T-ALL. Allele-specific differences in tumor penetrance, latency, and spectra/types reflect gradual gene dysregulations through enhancer modularity, with additive effects and tissue-specific phenotypic outcomes-concepts that require organismal models, such as the ones developed here, for their interrogation and proof.
Global interrogation of quasi-insufficiency requires methodology capable to induce genome-wide subtle perturbations-in experimental systems that can capture the relevant readout, that is cancer development in an organism. Our screening approach fulfills these requirements and enabled a comprehensive survey toward a systems-level understanding of subtle gene dysregulation in cancer. Both, our studies covering multiple-entities as well as the focused hematopoietic screens revealed (sub)entity-specific differences in the global extent of RE alterations and quasiinsufficiency. Subtle gene dysregulations were less predominant, for example, in immature T-ALL than in tumors originating from committed T cells, an observation that was also mirrored in screens for myeloid malignancies. Selection of fine-tuned rewiring of signaling networks during transformation is not surprising, nor is its context-specific variance: depending on the cell of origin and oncogenic insult, the path between oncogenic cell fate changes and cell death can be narrow, requiring precise orchestration of molecular reprogramming during transformation.
Limitations of the study One limitation of the screening approach is the dependence of transposon-induced gene activation on splicing. As a result, all genes having their translation initiation codon in exon-1 (first exons do not have a splice acceptor) cannot be activated by the promoter engineered into transposons, unless the genes possess an alternative in-frame ATG. Another limitation is the difficulty to predict effects of insertions in the non-coding genomic space. This contrasts the analysis of the protein-coding CISs, where transposon insertion patterns predict whether gene activation or inactivation is the cancer-driving mechanism. In principle, transposon insertions in regulatory regions can positively or negatively affect expression of target genes. Possible mechanisms underlying repressive effects include disruption of transcription factor binding sites, interference with 3D chromosomal conformation and topology-associated domain structure. In contrast, gene activation can in principle be mediated by the transposon's activating elements or the disruption of silencer/insulator sequences. In this study, effects related to selected intergenic CISs were validated functionally. Finally, some of the analyses comparing different T-ALL subentities would have benefited from a larger sample size, which may increase the power of the discovery approach.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:  x 129Sv x FVB background. Different ATP2 lines were used to generate final cohorts, which differ in their number of transposon copies and the donor locus (ATP2-S1: donor locus chr17, 15 copies; ATP2-H27: donor locus chr4, 20 copies; ATP2-H32: donor locus chr2, 25 copies). The T-ALL cohort included 48 ATP2-S1, 2 ATP2-H27 and 1 ATP2-H32 mice (Table S1) In addition to in depth analyses of tumors originating from these mouse cohorts, we have also explored specific questions (on regulatory CISs) in datasets emanating from a large number of -mostly so far unpublished -constitutive and conditional PiggyBac screens 38,87,88 (Table S13). In total we analyzed 1450 tumors from 15 screens. Additionally, for comparison of integration profiles, we analyzed a cohort of hematologic malignancies (n = 11) derived from a whole-body Sleeping Beauty Screen. 82 We generated intergenic knockout mouse models using MICER targeting vectors as previously described. 81 For the 1 Mb deletion, the MHPN-250E23 and the MHPP-53N24 and for the 105 kb deletion, the MHPN-262H24 and the MHPP-53N24 targeting vectors from the MICER 3'Hprt (MHPP) and 5'Hprt (MHPN) library (CloneDB database) were used. Mouse embryonic stem (ES) cells were transfected by electroporation and those carrying the vector were selected. After transient Cre expression, the Hprt minigene recombines in ES cells carrying both vectors and mediates hypoxanthine/aminopterin/thymidine (HAT) resistance. ES cells carrying the deletion were selected with HAT-Medium and injected into C57BL/6-derived blastocysts to generate the mice. Experimental mice were maintained on a mixed C57BL/6 x 129Sv x FVB background.
Necropsy and histopathological analysis All animals were monitored regularly for signs of sickness (e.g., inactivity, palpable masses and poor grooming). During necropsy, a gross inspection of all internal organs was carried out. For DNA/RNA isolation, tissue samples were stored in RNAlater (Sigma). For histology, tissue samples were fixed in 4% formaldehyde, paraffin-embedded, sectioned, and stained using hematoxylin and eosin following standard protocols.
Immunohistochemistry Immunohistochemistry (IHC) was performed on a Bond Rxm (Leica) using a Polymer Refine detection kit without post-primary antibody. Slides were deparaffinized and pretreated with Epitope retrieval solution 1 (ER1, citrate buffer, pH = 6) or solution 2 (ER2, EDTA buffer, pH = 9) as indicated. The following primary antibodies were used: rat anti-B220/CD45R (B220, BD Bioscience, 1:50 dilution, ER1, 20 min), rat anti-CD138 (281-2, BD Bioscience, 1:59, ER2, 20 min), rat anti-MPO (A0398, DAKO, 1:100, ER2, 20 min), rabbit anti-CD3 (Sp7, DCS, 1:100 dilution, ER1, 20 min), rabbit anti-Tdt (005, Supertechs, 1:100, ER2, 20 min) and rat anti-CD4 (GHH4, Dianova DIA-404, 1:50 dilution, ER2, 40 min). Rabbit anti-rat secondary antibody (Vector, 1:400) was applied for primary rat antibodies. Slides were counterstained with hematoxylin and coverslipped after manual rehydration. Slides were scanned with a Leica AT2 scanning system. HE stainings and IHCs were evaluated by experienced mouse pathologists, who were blinded to the mouse genotypes according to the Bethesda proposals for classification of lymphoid neoplasms. 89 Quantitative transposon insertion site sequencing QiSeq is a method for semi-quantitative transposon insertion sequencing that we developed earlier. 40 Briefly, DNA samples were sheared with a Covaris AFA sonicator to a mean fragment length of 250 bp. The fragmented DNA was then end-repaired, A-tailed and a splinkerette adapter was ligated to each DNA end. For the 5' and 3' transposon end, subsequent steps (amplification and sequencing of transposon-genome junctions) were conducted separately. The specific structure of the splinkerette adapter (Y-shaped design with a template and a hairpin strand) ensures that only transposon-genome junction fragments (and not genomic fragments without transposon insert) can be amplified in the first PCR step (which was conducted with transposon-and splinkerettespecific primers). Afterwards, a second nested PCR step was performed for further amplification, barcoding of samples and extension with Illumina flow cell binding sites P5 and P7. Each sample was then quantified with quantitative real-time PCR (using P5-and P7-specific primers). Subsequently, samples were equimolarly mixed and the library pool was again quantified. Libraries were sequenced on the Illumina MiSeq sequencer (75 bp, paired-end). Mapping of integrations to the mouse genome (mm10) was performed using the SSAHA2 algorithm and sequences containing transposon-genome junctions were selected for downstream analyses. Read coverage of insertions was normalized to the top hit of each sample (normalized read coverage).

CIS calling and downstream analyses
For the identification of common insertion sites (CISs; genomic regions that are more frequently hit by transposons than expected by chance), ATP2 insertions were subjected to statistical analysis using CIMPL (Common Insertion site Mapping PLatform), 41 which is based on a Gaussian kernel convolution framework. CIMPL assigns a p value to each CIS (listed in Table S2) and controls the errors at an average of 5%. Insertions within a 3 Mb region upstream and downstream of the transposon donor locus were excluded from the analysis (local hopping area of the transposon as described in 37 ). CISs were ranked according to the number of contributing insertions. Sfi1, a known artefact frequently detected in insertional mutagenesis screens, was removed from the list of CIS genes. 90 Additionally, Arid1b and Mmp16 were excluded due to their close proximity to the donor locus on chr17 and chr4, respectively. CIS genes for Table S2 were ranked according to the number of contributing samples. A scale parameter of 30 k was used for CIS identification. Profile plots and profile heatmap plots for visualization of ChIP-Seq peak enrichment in CIS regions were created using deeptools. 91 Subgroup specific CIS analysis were performed using a scale parameter of 5 k and were ranked according to the number of contributing insertions. For comparisons of the number of intergenic CISs in multiple cancer types, a scale parameter of 5 k was used. The Cell Genomics 3, 100276, March 8, 2023 e3 Article ll GENCODE Biotype annotation was used to differentiate genic (protein-coding and immunoglobulin genes) and intergenic (all other biotypes) CISs. CIS genes were compared to human cancer sequencing studies [92][93][94][95][96][97] (Table S3).
To exclude the possibility that insertion biases rather than selection are underlying non-coding CISs accumulation, we also used Transmicron 82 for CIS calling, which corrects for insertion biases by modelling neutral insertion rates based on chromatin accessibility, transcriptional activity, and sequence context. CISs were filtered for an adjusted p value < 0.05 and overlapped with CIMPL 5 k CISs. Of the 537 evaluated CIMPL-CISs (Table S8) and the top-ranked CIMPL-CISs ( Figure 2E), 79% and 84% were also called by Transmicron, respectively.
Footprint plots and transcription factor motif search Footprint plots were generated using intergenic T-ALL CISs (n = 227) as an input. To show the specificity of the footprint signature, a random background model was generated. Instead of selecting arbitrary loci in the genome, a set of ''control CISs'' was generated with specific characteristics as similar as possible to the original CISs. For each CIS, the chromosome, width, number of insertions as well as the number of comprised TTAA positions was used to identify a matching region in the genome resulting in a large pool of ''control CISs'' for each original CIS. ''Control CISs'' overlapping with any CIS were removed. The peak position for each mimicry-CIS was determined by the relative position of the peak in the original CIS. The procedure of identifying the overlap density with annotated regions was identical for the original and the control data, whereby for the latter this step was repeated 100 times in a bootstrapping approach. Every time one randomly selected ''control CIS'' from the previously generated pool of candidates was selected, and the final density line was then generated based on the 97.5% quantile of all values. ChIP-Seq input files are listed in Table S5. For analysis of transcription factor binding sites, regions flanking 200 bp of single intergenic insertions sites were used as input. The find-MotifsGenome.pl tool of HOMER (v4.11) 83 was used for known motif analysis using default parameter.

CIS annotation pipeline (ARCIS)
For the identification of CISs using CIMPL, the scale parameter was set to 5 k to identify narrow regions with regulatory potential (Figures 2A, 2B, and S5A). The resulting CIS coordinates were overlapped with a collection of publicly available datasets listed in Table S5 43,44,69,98-104 using the GenomicRanges R package. 105 The data was post-processed into a BED3 format with an additional column for name assignment. For overlap with peak-based files (ChIP-Seq, DNase-Seq), the number of overlapping peaks and the distance to the closest peak are reported. For interaction datasets (Hi-C from different stages of development and data from dbSuper [mouse thymus]), linked target genes are annotated. Chromatin regions specifically increasing or decreasing during T cell development (change in A and B compartment scores called from Hi-C data in Hu et al. 43 ) were also overlapped with CIS regions (all publicly available datasets are listed in Table S5, detailed ARCIS scheme shown in Figure S5B).
Additionally, we run a chromatin Hidden Markov Model (chromHMM) 84 on a collection of thymus ChIP-Seq data from ENCODE to define chromatin states, based on distinct combinations of histone marks. Chromatin states were used as an additional input dataset for ARCIS. We used six thymus-specific ChIP-Seq datasets: H3K4me1, H3K27ac, H3K4me3, H3K27me3, H3K36me3 and CTCF. The observed chromatin combinations resulted in eight manually assigned chromatin states: active/weak/poised/insulated enhancer, active promotor, gene body, CTCF binding sites and quiescent ( Figure S5C, Table S7). For the human T-ALL cell lines DND41 and Jurkat chromatin states are shown in Figure S7. The predicted regions showed a median range of 400-1000 bp. To overcome the issue of genomic sections with various small interrupted states, we implemented a merging and smoothing step. For each chromatin state, neighboring regions within 3000 bp of each other were merged into a single larger region, while bridging other states in between. Resulting smooth chromatin states were filtered to only include regions with a minimum size of 4000 bp. As an additional attempt to obtain large and coherent regulatory elements, the same procedure was also applied to combinations of interrelated chromatin states in close distance, like active with weak enhancers as well as promoters with active enhancers. Regarding the many different states, only active and weak enhancer chromatin states were used for CIS annotation. The number of overlaps as well as the distance to the closest element is reported. As silencers are less well-studied, and CISs affecting insulators were rare, we focused our downstream analyses on enhancers and ncRNAs.
ARCIS computes the overlap between CIS regions and all input datasets and reports number of observed overlaps as well as the putative target gene. For each CIS, ARCIS reports: (i) transcript annotation, (ii) number of peaks or distance to the closest peak (for ChIP-, ATAC-and DNase-Seq), (iii) intersection with super-enhancers, (iv) connected target gene (for Hi-C and dbSuper), (v) information on chromatin access change during T cell development, and (vi) annotation from chromHMM (Table S7). ARCIS calculates a 'RE-score' based on a combination of features and ranks the CISs according to their regulatory potential reflecting an additional layer of information to support data interpretation (details shown in Figure S6). In brief, overlaps with super-enhancers (dbSuper), increasing chromatin accessibility, active/weak enhancers according to chromHMM data and Hi-C connections were used for scoring. The ARCIS output comprises a full (Table S7) as well as a user-friendly ''reduced'' format (Table S8).
The classification of intergenic enhancers without overlap to other functional elements is straightforward. However, because functional elements in the genome often overlap (e.g. enhancer overlapping with lncRNAs and/or mRNAs) or are found in close proximity to each other, a substantial part of putative REs has to be further inspected individually (Figures 2E and S6).
Main criteria for discriminating different RE categories are the position of CIS-insertions in relation to potential overlapping/neighboring functional units in the genome, their orientation as well as the pattern of insertion clusters across samples. Integrating lncRNA and mRNA expression profiles in respective tissues further aids discrimination of the RE type targeted by transposons in regions with e4 Cell Genomics 3, 100276, March 8, 2023 Article ll identifiers (UMIs) and an adaptor. Ends of the cDNAs were extended by a template switch oligo (TSO) and full-length cDNA was amplified with primers binding to the TSO-site and the adaptor. NEB UltraII FS kit was used to fragment cDNA. After end repair and A-tailing a TruSeq adapter was ligated and 3'-end-fragments were finally amplified using primers with Illumina P5 and P7 overhangs. In comparison to Parekh et al., 115 the P5 and P7 sites were exchanged to allow sequencing of the cDNA in read1 and barcodes and UMIs in read2 to achieve a better cluster recognition. The library was sequenced on a NextSeq 500 (Illumina) with 63 cycles for the cDNA in read1 and 16 cycles for the barcodes and UMIs in read2.

RNA-seq data analysis
Gencode gene annotations M25 and the mouse reference genome GRCm38 were derived from the Gencode homepage (EMBL-EBI). Data was processed using the published Drop-Seq pipeline (v1.12) to generate sample-and gene-wise UMI tables. 116 The resulting UMI filtered count matrix was imported into R v4.0.1. Lowly expressed genes were filtered so that 80% of samples have at least three read counts per gene. The data was normalized to sequencing depth (within sample normalization) and variance stabilized (between sample normalization). This was done via the rlog transformation implemented in the DESeq2 package and dispersion of the data was estimated with an intercept only model using DESeq2 v1.28.1. 117 Details on statistical analysis are described in the chapter 'Quantification and Statistical Analysis'.

GSEA
For gene set enrichment analysis GSEA v4.0.3 85 118,119 A pathway was considered to be significantly associated with an experimental condition if the FWER was below 0.05. All statistical values can be found in Table S15. Details on statistical analysis are described in the chapter 'Quantification and Statistical Analysis'.
We additionally performed lift-over of mouse CISs (5 k size parameter) coordinates to the human genome (hg38) using the UCSC liftover tool and used the syntenic human regions to analyze their overlap with cancer-associated GWAS variants. Details on statistical analysis are described in the chapter 'Quantification and Statistical Analysis'.

QUANTIFICATION AND STATISTICAL ANALYSIS
General statistical analyses All statistical analyses were performed using R v4.0.1. Methods used for statistical hypothesis testing and exact n numbers are directly stated in the figure legends. In general, the significance level was set to 0.05. Boxplots were generated using the default ggplot2 geom_boxplot settings (middle, median; lower hinge, 25% quantile; upper hinge, 75% quantile; upper/lower whisker, largest/smallest observation less/greater than or equal to upper/lower hinge ±1.5 * IQR).

CIS analysis
For CIS analysis, CIMPL assigns a p value to each CIS (listed in Table S2) and controls the errors at an average of 5%.

RNA-seq data analysis
Principal Component Analysis (PCA) was conducted with the 10 percent top variable genes in the rlog transformed dataset. The cola R package was used to compare different clustering methods. 86 The cola package provides a general framework for subgroup classification by consensus partitioning. The rlog transformed expression matrix was used as an input and cola was run with default parameters. The results show that 2 or 4 clusters were predicted as best choice (statistical details in Table S14, Figure S13A). After careful review of the biology behind the clusters, the combination of CV (coefficient of variance) as top value method and mclust as clustering method was chosen for downstream analyses. This approach predicted 4 as best k parameter and resulting cluster assignments are shown in the PCA embedding ( Figure S13). Samples showing ambiguous clustering were not used in downstream analyses (Table S1). Detailed subtype analyses (Figures 6C-6J) were only performed on this subset of samples (ETP-like n = 14, classical n = 8, Mef2c-driven n = 7). Cluster assignments were then used as explanatory variable during model fitting with DESeq2. The Wald test was used for determining differentially regulated genes between all pairwise clusters. Shrunken log2 fold changes were calculated afterward. A gene was determined to be significantly regulated if the p value was below 0.05. Rlog transformation of the data was performed for visualization and further downstream analysis.  85 was used to perform gene set enrichment analysis in the preranked mode using the apeglm shrunken log2 fold changes as ranking metric. apeglm shrinkage is a process to correct foldchanges that are overestimated due to low expression of genes or highly variable genes. 120 A pathway was considered to be significantly associated with an experimental condition if the FWER was below 0.05. All statistical values can be found in Table S15.

Analyses of GWAS data
The GWAS catalog was downloaded from https://www.ebi.ac.uk/gwas/ (EMBL-EBI). 80 We assessed whether reported genes in the GWAS catalog were over-represented among CIS-target genes. For enrichment calculation, the number of protein-coding genes in the genome was used as a control (19,370;Gencode). c 2 test was used to calculate enrichment p values (Table S11).
We additionally performed lift-over of mouse CISs (5 k size parameter) coordinates to the human genome (hg38) using the UCSC liftover tool and used the syntenic human regions to analyze their overlap with cancer-associated GWAS variants. Variants (n = 8,677) were pruned for linkage disequilibrium using the SNPclip tool (https://analysistools.cancer.gov/LDlink/?tab=snpclip) from LDlink 60 with a threshold of R 2 = 0.8 and MAF = 0.01. The thinned list of variants (n = 4,625) was used for overlap with CIS regions. c 2 test was used to calculate enrichment p values considering the sum of all human CIS regions (size in basepairs) in comparison to the size of the complete genome (Table S12).