Insulation of gene expression by CTCF and cohesin-based subTAD (ubiquitous intra-TAD) loop structures in mouse liver

CTCF and cohesin are key drivers of 3D-nuclear organization, anchoring the megabase-scale Topologically Associating Domains (TADs) that segment the genome. Here, we present a computational method to identify cohesin-and-CTCF binding sites that form intra-TAD DNA loops (subTADs). We show that predicted subTAD anchors are structurally indistinguishable from those of TADs regarding their binding partners, sequence conservation, and resistance to cohesin knockdown; further, the subTAD loops retain key functional features of TADs, including insulation of chromatin contacts, blockage of repressive histone mark spread, and ubiquity across tissues. We propose that subTADs form by the same loop extrusion mechanism as larger loops, and that their shorter length enables finer regulatory control over gene expression. 4C-seq analysis using an Alb promoter viewpoint illustrates the role of subTADs in restricting enhancer-promoter interactions. These findings elucidate the role of intra-TAD cohesin-and-CTCF binding in nuclear organization, and demonstrate that distal enhancer insulation by subTADs is widespread.

2/23/18 Figure 1E, each TAD was designated as active, weakly active, inactive, or weakly inactive, based on the signal 149 distribution around the boundary and the eigenvalue from Hi-C PCA analysis along the length of the TAD (see 150 Suppl. Methods and listing of TADs in Table S1A). Striking differences in gene expression were seen between 151 active and inactive compartment TADs (median expression 1.095 FPKM for 12,258 genes in active TADs vs. 152 0.003 FPKM for 4,643 genes in inactive TADs; Figure 1G).

154
We sought to determine the tissue-specificity of the genes in active vs. inactive TADs. We used the 155 expression level of each gene across ENCODE tissues to calculate Tau scores, a robust metric for tissue 156 specificity [45,46]. Tau values are close to 1 are highly tissue specific, while lower values (< ~0.3) are widely 157 expressed and considered housekeeping genes (see Methods). A greater fraction of genes located in inactive

160
Furthermore, genes whose TSS is close to a TAD boundary (i.e., TSS within 2% of the total TAD length in either 161 direction from the boundary) tend to be less tissue specific than the genomic average ( Figure S1H). Active 162 transcription may be a key driver of dynamic cohesin movement in the nucleus [47], and RNA polymerase II, in 163 vitro, is capable of translocating cohesin rings along DNA [48]. Thus, the ubiquitous expression of genes at TAD 164 boundaries could be either a driver or an initiator of loop extrusion, although the exact mechanism remains  Figure S2A). We considered two possibilities: 1) TAD-internal 174 CAC sites form subTAD loops that are too short to be detected in standard Hi-C datasets; and 2) additional 175 factors associated with TAD boundary CTCF sites differentiate them from other such binding sites in the 176 genome (see Discussion). To examine the first possibility, we used an algorithm for analysis of CTCF loops [31] 177 and adapted it to predict subTAD-scale loops in silico, using CTCF and cohesin peak strength and CTCF Page 7 2/23/18 orientation as inputs (Figure 2A), and building on the finding that >90% of CTCF-based loops are formed 179 between inwardly-oriented CTCF sites [12]. Each mouse liver CAC site was given a score that represents its 180 peak strength and CTCF motif score, and an orientation was assigned based on whether the non-palindromic 181 CTCF motif was present on the (+) strand or the (-) strand, considering the highest scoring CTCF motif at each 182 CAC site. Scanning the genome, each (+) strand CAC peak was connected to putative downstream (-) strand 183 CAC sites. Low scoring CAC peaks were removed and the process was iteratively repeated until the top 20,000 184 candidate subTAD loops remained. The set of loops was then filtered, as detailed in Methods, to take into 185 account cohesin scoring, and to ensure TSS overlap and <80% TAD overlap, to restrict our definition of analyses, we excluded from the subTAD anchor group the 1,632 subTAD anchors that are shared with TAD 203 anchors to ensure that the groups compared are mutually exclusive ( Figure S3B). We first sought to determine if 204 the subTAD loop anchors show conserved CTCF binding across multiple ENCODE tissues, as seen for TADs in 205 Figure 1C. Figure 2C shows the tissue distribution of CTCF binding at CTCF binding sites found at subTAD 206 anchors in liver, where an X-axis value of 1 indicates the liver CTCF binding site is occupied by CTCF in only 207 one other tissue, and a value of 15 indicates binding occurs in all 15 mouse tissues where CTCF ChIP-seq data 2/23/18 is available. Results show that a large majority of these CTCF binding sites are bound by CTCF in at least 10 of 209 the 15 mouse tissues examined. Indeed, CTCF binding at the TAD-internal subTAD loop anchors is more 210 deeply conserved across mouse tissues than that at TAD boundaries. In contrast, CTCF sites not associated 211 with cohesin binding (lone CTCF sites), and to a lesser extent CAC sites not at subTAD or TAD anchors (other 212 CAC sites) showed much greater tissue specificity for CTCF binding ( Figure 2C).

214
The enrichment of knockdown-resistant cohesin binding sites at TAD boundaries, seen in Figure 1B, may 215 explain the persistence of domain structures following CTCF or cohesin depletion [15,41]. Further, we found 216 that 80% of TAD anchors and 90% of subTAD anchors are resistant to the loss of cohesin binding in Rad21 +/-217 mice vs. only 52.8% for all CAC sites ( Figure 2D). Moreover, a large fraction of TAD and subTAD anchors, 218 70.6% and 77.6%, respectively, are comprised of "triple sites", where cohesin and CTCF are co-bound with 219 Top2b, a potential component of the loop extrusion complex [18], vs. only 46.6% for the set of all CTCF sites 220 ( Figure 2D). Top2b binding appears to be associated with cohesin binding rather than with CTCF binding, as it 221 is present at enhancer-like CNC sites much more frequently than at CTCF sites in the absence of cohesin 222 ( Figure S3C).

224
TAD and subTAD loop anchors show greater sequence conservation within the core 18 bp CTCF motif than 225 other CTCF sites ( Figure 2E). Analysis of sequences surrounding the CTCF core motif did not provide evidence 226 for loop anchor-specific motif usage or cofactor binding ( Figure S3D,E). Downstream from the core CTCF motif 227 (within the loop interior) we observed a shoulder of high sequence conservation, as well as additional complex 228 motif usage, likely due to the multivalency of CTCF-DNA interaction, as described in [49]. TAD and subTAD 229 anchors showed broader and more complex CTCF motif usage outside of the core ( Figure

237 2/23/18
Given the highly tissue-conserved binding of CTCF at sites that we predicted to serve as subTAD anchors in 238 liver ( Figure 2C), we sought direct experimental evidence for the presence of these loops in other tissues. We 239 examined CAC-anchored loops identified in mESC using ChIA-PET [5,32]. This comparison revealed a striking 240 68% overlap of our predicted liver subTAD loops with the CAC-anchored loops found in mESC ( Figure 2B, 241 Figure S4A). The majority of subTADs are also observed in newer cohesin HiChIP datasets (ChIP followed by

242
Hi-C) [50], and substantial overlap is also observed between high resolution Hi-C loop domains and chromatin 243 contact domains, which tend to be larger in size ( Figure S4B,C). Overall ~80% of subTADs are experimentally 244 observed in at least one other tissue. Further, the loops shared between liver and mESC are more robust and 245 showed stronger interactions than mESC-unique loops ( Figure S4D).

249
investigated whether subTAD loops demonstrate these dual insulating properties, canonically ascribed to TADs.

251
CTCF sites were divided into three groups, based on whether they were predicted to anchor TAD loops or 252 subTAD loops, or were not predicted to interact (non-anchor CTCF sites) based on our model. Figure 3A shows 253 the extent to which individual CTCF sites within each group show directional interactions towards the loop 254 interior based on Hi-C data, using a chi-squared metric derived from the same directionality index used to 255 predict TADs genome-wide. This inward bias index quantifies the strength of interaction from a 25 kb bin 256 immediately downstream from the CTCF motif, with a positive sign indicating a downstream (inward) bias with 257 regard to motif orientation, i.e., towards the center of a TAD/subTAD in the case of loop anchors. TAD and 258 subTAD anchors both show strong directional interactions compared to non-anchor CTCF sites, and TAD 259 anchors show stronger interactions than subTAD anchors. For the non-anchor CTCF sites, only those sites 260 containing a strong CTCF motif (FIMO score > 10) were considered, and the predicted directionality was 261 oriented relative to this motif (as there is no left versus right anchor distinction).

263
To visualize features of these anchors, aggregate Hi-C profiles spanning 1 Mb around each anchor were 264 generated for each group of CTCF sites ( Figure 3B). Each group was further subdivided into a left (upstream) 265 and a right (downstream) anchor based on its CTCF motif orientation. By aggregating many sites, we can 266 visualize the overall interaction properties of each group of sites at high resolution (5 kb bins), revealing features Page 10 2/23/18 that are much harder to discern at an individual CTCF site ( Figure 3B). TADs. This may also explain the lower inward bias scores of subTAD loops seen in Figure 3A. Depletion of 272 interactions that span across loop anchors (dark blue density above anchor points in Figure 3B) was seen for 273 both TAD and subTAD anchors, however, this local insulation was substantially greater for TAD anchors. This 274 may be due to the compounding impact of adjacent TAD loops, i.e., the end of one TAD is often close to the 275 beginning of an adjacent TAD loop anchor (median distance between TAD anchors = 33.5 kb, Figure S4E). GO 276 term analysis of genes whose TSS fall within these inter-TAD regions revealed enrichment for housekeeping 277 genes (ribosomal, nucleosome, and mitosis-related gene ontologies), whereas neighboring genes found just 278 within the adjacent TADs were enriched for distinct sets of GO terms ( Figure S4F; Table S3B, S3C). The nearby 279 but oppositely oriented TAD anchors flanking the inter-TAD regions likely contribute to the more bidirectional 280 interaction pattern for TADs seen in Figure 3B. For instance, the left TAD anchor plot in Figure 3B shows a well-281 defined pattern of interaction enrichment downstream from the anchor, but also a more diffuse enrichment

304
CNC sites, as did the CAC group. As a control, the distribution of IgG signal (input signal) showed much less 305 insulation overall and no significant differences between the various classes of CTCF/cohesin binding sites 306 ( Figure S4G).

307
308 Figure 3D and 3E show heat map representations of H3K27me3 ChIP-seq signal around the top 500 active 309 and top 500 inactive subTADs, based on a ranked list of JSD insulation scores. These represent subTADs that 310 have significantly lower (or higher) H3K27me3 signal in the loop interior based on the combined rank of JSD 311 insulation scores for each anchor, respectively (p < 0.05, two-sided t-test). For example, Figure 3E shows 312 subTADs with lower H3K27me3 signal within the loop than in neighboring regions (and thus the left anchor 313 transitions from high to low signal, while the right anchor transitions from low to high signal). No such pattern 314 was seen for the IgG (control) ChIP-seq signals for these same regions, indicating this is not an artifact of the 315 sequence read mappability of these regions ( Figure S4H).

317
Classifying open chromatin regions and defining super-enhancers in mouse liver. TADs, and possibly subTADs,

318
are structurally conserved across tissues, but the activity of enhancers and promoters contained within them is 319 highly tissue-specific [5,52]. Accordingly, it is important to understand the ability of TADs and subTADs to 320 insulate active enhancer interactions, i.e. enhancer-blocking activity. To address this issue, we examined the 321 distribution of CTCF and cohesin binding sites in relationship to promoters and enhancers across the genome, 322 as well as the impact of TADs and subTADs on their associated genes and enhancers.

343
A subset of intra-TAD CAC loops are well characterized as insulators of tissue-specific genes with highly 344 active enhancer clusters, termed super-enhancers [5,57]. To determine if some subTADs correspond to these 345 "super-enhancer domains" we first identified super-enhancers in mouse liver. We used 19 publicly available 346 mouse liver H3K27ac ChIP-seq datasets to score clusters of individual enhancers + weak enhancers DHS 347 identified above ( Figure 4C). Super-enhancers were identified separately in male and female liver, as well as in 348 male liver at various circadian time points, to take into account these three key sources of natural variation in 349 mouse liver [53,58]. In total, we identified 503 core super-enhancers, i.e., super-enhancers that show strong

354
Genes neighboring super-enhancers, both protein coding and lncRNA genes, are more highly expressed and 355 tissue-specific when compared to all genes or to all genes neighboring typical enhancers (KS test p-value < 356 0.0001; Figure 4D). Consistent with this, only 6.8% of genes proximal to liver super-enhancers are targets of Page 13 2/23/18 super-enhancers in mESCs or pro-B cells ( Figure 4E), whereas 55.2% of genes proximal to liver typical 358 enhancers are proximal to typical enhancers in the other two cell types. Super-enhancers showed much greater 359 accumulation of RNA polymerase 2, despite the lack of the promoter mark H3K4me3 ( Figure 4F) and are 360 transcribed to yield eRNAs ( Figure S6B). GO terms associated with genes targeted by either typical enhancers 361 or super-enhancers are enriched for liver functions (such as oxidoreductase activity), however, super-enhancer 362 target genes also show enrichment for transcription regulator activity and steroid hormone receptor activity 363 ( Figure 4G). These data support the model that super-enhancers drive high expression of select liver-specific 364 genes, including transcriptional regulator genes (Table S2).

375
To determine the impact of subTADs on the expression of genes with neighboring super-enhancers, we 376 considered two possible gene targets for each super-enhancer, with the requirement that the TSS of each gene 377 target be within 25 kb of one of the individual enhancers that constitute the overall super-enhancer: one gene 378 target is located within the subTAD, and the other gene target crosses a subTAD anchor ( Figure 5A, scheme at 379 top; genes inside (red) and genes outside (black) of the subTAD). We hypothesized that the true gene target of 380 the super-enhancer will be more highly expressed. Indeed, we found that genes neighboring super-enhancers 381 and found within the same subTAD are significantly more highly expressed than the alternative potential gene 382 targets, located outside of the subTAD ( Figure 5A). Similarly, when comparing the tissue specificity of genes 383 within TADs and subTADs to a random shuffled set of subTADs, we observed less variance in the Tau value

384
(index of tissue specificity) for genes within (sub)TAD loops compared to the shuffled set ( Figure 5B; KS test p-385 value < 0.0001). Thus, groups of genes within subTADs are more uniformly tissue specific (as in the case of 386 some super-enhancer-regulated genes) or tissue ubiquitous (as in the case of groups of housekeeping genes). 2/23/18 Examples are shown in Figure 5C and 5D, which illustrate the impact of a TAD loop on the expression of Cebpb 388 and the impact of a subTAD loop on the expression of Hnf4a. The TSS of two other nearby genes, Ptpn1 and 389 Ttpa1, are closer in linear distance to the adjacent super-enhancer than Cebpb and Hnf4a, respectively, 390 however, any super-enhancer-promoter interactions involving Ptpn1 and Ttpa1 would need to cross a TAD or 391 subTAD boundary. In both cases, the genes within the super-enhancer-containing (sub)TAD loop are expressed 392 at least 10-fold higher than genes outside the (sub)TAD loops. Therefore, based on the 3D-structure imposed by 393 these TADs and subTADs, one predicts that the super-enhancers are restricted from interacting with Ptpn1 or 394 Ttpa1, in agreement with their comparatively low expression levels.

396
Given the ability subTADs to insulate repressive histone marks ( Figure 3C-3E), we considered whether 397 subTAD loops enable high expression of genes within otherwise repressed genomic compartments. As seen in 398 Figure 1G, a minority of genes within inactive TADs are expressed (939 genes expressed at FPKM > 1). The

416
These interactions become more diffuse with increasing genomic distance from the viewpoint, which may Page 15 2/23/18 represent dynamic interactions, or alternatively, the effect of averaging across a heterogeneous cell population.

418
Comparing the combined interaction profiles between male and female livers, we observed highly reproducible 419 results, with 92.9% of male interactions also present in female liver ( Figure S7C).

421
Looking beyond local interactions, we observed interaction frequencies that span several orders of 422 magnitude going from local (intraTAD) to cis (intra-chromosomal) and trans (genome-wide) interactions. Thus,

423
for the Alb promoter, the TAD 4C signal per TAD was up to >1000 RPKM for local interactions, >100 RPKM for 424 cis interactions within 3 TADs of the viewpoint, and ~10 RPKM beyond that. Trans interactions were almost 425 exclusively < 10 RPKM ( Figure 6B, 6C). Using a separate background model for far-cis and trans 4C signals, we 426 categorized TADs as either high, medium, low, or non-interacting (see Methods). As Alb is the most highly 427 expressed gene in liver and is proximal to a strong super-enhancer, we expected that distal interacting regions 428 would also be active genomic regions, as proposed in the transcription factory model of nuclear 429 compartmentalization [59]. Indeed, we found that >80% of the distal high interacting TADs (both far-cis and 430 trans) were active, and >90% were either active or weakly active. In contrast, only 16.6% of the non-interacting

431
TADs in cis and 25.9% of those in trans were considered active ( Figure 6D). Furthermore, genes in the 432 interacting regions are more highly expressed than genes in Alb 4C non-interacting regions, and the vast 433 majority are found in active TADs ( Figure 1E), as determined by analysis of the Hi-C data alone ( Figure S7D,E).

436
We present a computational method that uses 2D (ChIP-seq) and 1D (DNA sequence) information to predict 437 3D-looped intra-TAD structures anchored by cohesin and CTCF (CAC sites), and we provide evidence that the 438 subTAD loops predicted underpin a general mechanism to constrain the interactions of distal enhancers to 439 specific gene targets. While select instances of CAC-mediated loop insulation within TADs have been described

440
[5, 60, 61], our work establishes that this phenomenon is a more general feature of genomic organization and 441 regulation than previously appreciated. The subTADs described here are nested, CAC-mediated loops whose 442 formation may be a result of extrusion complex pausing within larger domains (i.e., TADs). Their ultimate impact 443 is likely to constrain the available promoter contacts for a given distal enhancer, or vice versa [35]. While the 444 precise mechanism that differentiates TAD-and subTAD-forming CTCF sites (anchors) from other CTCF 445 binding sites (non-anchors) in the genome is unknown, we provide evidence that loop-forming CTCF sites at 446 TAD and subTAD anchors, but not other CTCF site, are highly insular. This difference in insulation is apparent 2/23/18 from the blockage of repressive histone mark spread and by the inhibition of chromatin contacts across subTAD form as a result. We have used the term subTAD to describe these loops to highlight their intra-TAD range,

463
although they are functionally similar to loop domains, isolated cliques, and insulated neighborhoods, which tend 464 to overlap or be contained within TADs [12,13,35]. It should be noted that our approach cannot predict CTCF-465 independent loops such as those mediated cohesin alone (enhancer-promoter loops), although such loops are 466 likely constrained by CAC driven subTADs, as was highlighted by our Albumin 4C-seq results.

468
The method for subTAD identification described here builds on the preference for inward-facing CTCF motifs 469 evident from high resolution Hi-C data [12,13], and will be highly useful for the identification of intra-TAD CAC 470 loops for the large number of cell lines and tissues that lack high resolution Hi-C data. In these cases, intra-TAD 471 loop domains cannot be identified because there is not sufficient local enrichment to calculate a corner score 472 with the arrowhead algorithm [12]. Further, while we used liver Hi-C data to filter out longer CAC loops, the 473 general conservation of TADs across both tissues and species [1, 3] broadens the applicability of our method to 474 other organisms with well-annotated genomes, even when Hi-C data is not available and TAD boundaries have 475 not been determined. Thus, even in the absence of TAD coordinates, our method identifies TAD and subTAD 476 looping events, which provides an invaluable first approximation for understudied organisms. As we have tuned Page 17 2/23/18 our parameters to identify loop structures comparable in size and number to those found previously in mouse 478 models, the exact cutoff values may need to be adjusted for other model organisms.

480
The identification of subTADs at high resolution from Hi-C data requires extremely deep sequencing of Hi-C 481 libraries, something that has only been achieved in a small number of cell lines [12], and is costly, both in terms 482 of computational and experimental laboratory resources. Alternative strategies have been proposed to reduce 483 the need for extreme deep sequencing to identify interactions at high resolution [36,63]. While these 484 approaches are beginning to make 5-10 kb resolution possible, the sequencing depth and cost will likely remain 485 out of reach for many labs. The computational method described here considers both CTCF and cohesin peak 486 strength as the primary predictor of subTAD loop strength, which is a reasonably good predictor of interactions 487 [13,31]. Antibody enrichment for select genomic regions followed by chromosome conformation capture, as 488 implemented in ChIA-PET, is a common experimental alternative, and can identify "many to many" interactions,

499
We found that TAD and subTAD loop anchors together comprise 27% of all liver CTCF binding sites, previously appreciated, although the presence of distinct active and inactive genomic compartments is 508 consistently seen across most cells [40,68]. Truly high-resolution elucidation of single cell intra-TAD structures 509 may not be possible due to the intrinsic limitation of two potential ligation events per fragment in any given cell.

511
Regarding the genomic distributions of cohesin and CTCF binding sites, we found that CNC sites are 512 primarily present at enhancers and weak enhancers, while CAC sites are found at insulators and also at 513 promoters, which we defined as DNase hypersensitive sites (DHS) with high a histone-H3 K4me3/K4me1 ratio.

514
Others find that promoters, when defined as the set of all TSS upstream regions, including those not at a DHS, such sites lack insulating activity and also lack strong directional interactions. Since the binding of CTCF is 522 always intrinsically directional, due to its non-palindromic motif, the absence of directional interactions from 523 CTCF-non-cohesin sites suggests that the directionality of interactions with CTCF sites at TAD and subTAD 524 loop anchors is conferred by additional factors, such as cohesin as a part of the extrusion complex [11,13].

526
Top2b is another factor that is proposed to be a key contributor to loop extrusion complexes [18]. However,

527
our findings indicate that the interactions of Top2b involve binding to cohesin, and not CTCF, as suggested by 528 the high frequency of CNC sites bound by Top2b vs. very low frequency of binding to CTCF-non-cohesin sites 529 ( Figure S3C). Furthermore, binding by Top2b does not distinguish TAD from subTAD loop anchors. Indeed, by 530 all metrics tested, we found no TF or motif that differentiates TAD from subTAD loop anchors, although the 531 existence of some unknown differentiating factor cannot be ruled out. Cohesin has the ability to stabilize large 532 protein complexes, as indicated by its role in stabilizing clusters comprised of up to 10 distinct TFs at enhancers 533 [17], and could thus facilitate the binding of additional unknown proteins to the loop extrusion complex.

535
Cohesin is continuously recycled throughout the genome by loading and release factors [47], and so it is 536 unclear how insulator activity is effectively maintained at TAD and subTAD anchors in such a dynamic 2/23/18 environment. We found that CNC sites, which are primarily at enhancers, consistently show the least insulation here to improve gene target assignments for distal regulatory elements, based on the insulating capacity of 563 these looped domains. Such an approach will be beneficial for the many model systems where distal enhancer 564 activity is the clear driver of tissue specificity or a given disease state [35]. The ability to identify subTAD loops 565 based solely on CTCF motif orientation and CTCF and cohesin ChIP-seq binding data, and then use these Page 20 2/23/18 loops to improve gene target assignments for distal regulatory elements is likely to constitute a substantial 567 improvement over "nearest gene" or other more nuanced target assignment algorithms, such as GREAT [71].

569
In conclusion, our studies reveal that while TAD structures are readily apparent in routine Hi-C experiments, 570 and thus are better characterized than internal TAD sub-loops, their structural organization and functional 571 impact on the genome is not unique. Thus, the 9,543 subTAD loops that we identified have strong intra-TAD 572 cohesin-and-CTCF-bound anchor regions. Structurally, these subTADs appear to be formed by the same loop 573 extrusion mechanism that is responsible for TAD formation. Functionally, we hypothesize that these subTADs 574 contribute to nuclear architecture as intra-TAD scaffolds that further constrain enhancer-promoter interactions.  any results performed using the core JASPAR motif (MA0139.1). These motifs were explicitly used in Figure   656 S3E, where no difference between subTAD and TAD anchor motif usage was observed.
657 658 4C-seq protocol. Four male and four female mouse livers were processed for Albumin-anchored 4C-seq 659 analysis using published protocols, with some changes for primary tissue [72]. To adapt the protocol for liver,

660
care was taken to rapidly isolate single cell or nuclei suspensions prior to crosslinking. Specifically, two 661 approaches to crosslinking were taken and both gave similar results. One male and one female mouse liver 662 sample was processed through the crosslinking step as described for the ChIP protocol, above, prior to

757
First, CAC sites were identified from ChIP-seq data as described above. Next, each chromosome was scanned  (Table S1B) 796 are those obtained from the merged sample. mouse tissues (including liver), we used Tau, which was shown to be the most robust in a recent study [46].

802
Testis was excluded from this analysis because a large proportion of testis-expressed genes are highly tissue 803 specific. For each tissue, the maximum FPKM per gene between the two replicates was used. These FPKM 804 values were log transformed and a Tau value, ranging from 0 to 1, was calculated, where 1 represents high 2/23/18 tissue specificity: τ = ∑ni = 1(1−yi)n−1, where yi = ximax1≤i≤n(xi), n is the number of tissues, and xi is the   Table S4. H3K9me3, H3K27me3, H2AK5ac, and H3K36me3 marks were processed from the raw 836 sequencing data (fastq files) through the standard ChIP-seq pipeline, above. H3K9me3 and H3K27me3 mark 837 data were expressed as log2(ChIP/IgG signal). Lamina-associated domain coordinates and GRO-seq data were 838 downloaded as pre-processed data. Heat maps were generated using Deeptools reference point, with a bin size 839 of 10 kb. TAD boundaries were grouped according to k-means clustering (k=4) using signal within a 1 Mb 840 window from 3 datasets: H3K9me3, H2AK5ac, and the eigenvalue of the Hi-C PCA analysis (above). Based on 841 these clusters, TADs were classified as active, weak active, weak inactive, or inactive, as follows. If both the 842 start and end boundary of a given TAD were classified as active, then the TAD was designated active.

843
Specifically, a TAD was considered "active" if the boundary at the start of a TAD fell into clusters 1 or 2 (as 844 marked in Figure 1E) and the boundary at the end of the same TAD fell into clusters 1 or 3. The corresponding 845 metric was applied to identify inactive TADs. If the two ends of a given TAD were not in agreement, then the 846 TAD was designated weakly active if the median Hi-C PC1 eigenvalue within the TAD was positive, or weakly 847 inactive if the median Hi-C PC1 eigenvalue was negative. Gene expression and tissue specificity metrics 848 represent expression or Tau values of genes whose TSS overlap active or inactive TADs.

850
Additional Hi-C analysis. Contact profiles around TAD, subTAD, and non-loop-anchor CTCF sites were 851 generated using Homer (v4.9) using the command analyzeHiC and the options "-size 500000 -hist 5000" to

864
Page 30 2/23/18 TAD anchor identification. TAD anchors were predicted using a similar approach. The merged list of CTCF 866 peaks was filtered to only consider peaks that were found across all four biological replicates, that contained 867 CTCF motifs, and that were within 50 kb of a TAD boundary, as defined previously for mouse liver [3]. This 50 868 kb distance was chosen based on the ambiguity of binned Hi-C data to more accurately determine the precise 869 TAD boundary. Then, for each TAD boundary, all pairs of (+) and (-) CTCF peaks were considered and scored 870 based on their combined distance to the called TAD boundary. Pairs of "+/-" peaks that were comprised of a (+) 871 anchor upstream of a (-) anchor (i.e., CTCF peak pairs that were not divergently oriented) were considered an 872 invalid combination to define the end of one TAD and the beginning of the next TAD, and were not considered.

873
The valid pairs with the shortest combined distance to the previously defined liver TAD boundary [3] were 874 retained and all others were removed. If no valid pair for a TAD boundary was identified, the single CTCF peak 875 closest to the TSD boundary was retained as the TAD anchor. A complete listing of TAD anchors can be found 876 in Table S1, and a listing of inter-TAD regions and associated gene ontology analysis is presented in Table S3.

878
Alternative loop anchor analysis. We sought to compare the relative insulation of loops identified by our 879 computational approach to alternative loops identified using the original core algorithm [31]. This provides an 880 objective measure to compare the performance of each implementation in identifying TAD-like loops and loop 881 anchors within TADs. To this end, we used the complete CTCF peak list from the merged sample as input and

882
implemented the loop prediction algorithm exactly as described previously (60% proportional peak cutoff, CTCF 883 signal + motif scores as above, retaining only loops <200 kb) [31]. As summarized in Figure S2B, this analysis 884 yielded many more loops (60,678), which were shorter (median size of 61 kb) and showed less overlap with 885 cohesin-mediated loops in the mESC ChIA-PET dataset (25.5% versus 68.3% for subTADs). 59% of the 886 subTADs characterized in our study are found in this larger loop list (termed "60k loops"). To characterize loops 887 unique to the 60k loop set, we had to first filter out anchors found subTAD or TAD loops (to insure that each list 888 was mutually exclusive, as above). Any 60k loop anchor within 50 kb of a TAD boundary was excluded from 889 downstream analysis. Similarly, we also excluded any 60k loop with at least one anchor that coincided with a 890 subTAD anchor. These mutually exclusive lists of subTAD anchors and the filtered set of 60k loop anchors 891 (25,983 loop anchors in total; representing a subset of "Non Anchor CTCF" in the main text) were then 892 compared based on insulation of repressive histone marks (see "Repressive histone mark insulation", below)

893
and Hi-C interaction profiles (see "Additional Hi-C analysis", above). Figure S2B,C presents these insular Page 31 2/23/18 features of subTAD anchors compared to alternative loop anchors that are not subTAD or possible TAD 895 anchors.

897
Anchor/Loop overlap. CTCF ChIP-seq data for 15 non-liver tissues from the ENCODE Project were 898 downloaded (https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm9&g=wgEncodeLicrTfbs) and intersected with 899 replicates to form a single peak list for each tissue [42]. These single peak lists per tissue were then compared 900 to liver CTCF peaks using the Bedtools multiinter command with the -cluster option to generate a union CTCF 901 peak list for all tissues with a score representing the number of tissues in which a peak is present. "Lone" CTCF 902 (CTCF sites lacking cohesin bound), other/non-anchor CAC sites, TAD anchors, and subTAD anchors were 903 compared to this list to generate the histograms in Figure 2C (See also,

911
Comparisons to mESC Smc1a ChIA-PET and Smc1a Hi-ChIP datasets [5,50] were based on merged 912 replicates, and reciprocal overlaps with subTADs were required (Bedtools intersect -wa -u -r -f 0.8 -a 913 subTAD.bed -b mESC.bed). mESC interactions were filtered to only include anchor regions that also contain a 914 CTCF peak, to exclude CNC-mediated enhancer-promoter interactions.  tissue. If a tissue had only two replicates (as was the case for 12 of the 20 non-liver tissues), we required that 955 the DHS be present in both replicates. If a tissue had 3 or 4 replicates, then the DHS were required to be 956 present in all or all but one replicate (this was the case for 7 of the 20 non-liver tissues). For whole brain tissue, 957 the merged peak list required that a DHS was present in at least 5 of the 7 replicates. These regions were 958 compared to each other using the Bedtools multiinter command with the -cluster option to generate a union 959 DHS peak list for all tissues, where the score column represents the number of non-liver tissues in which a 960 given region was found. For all liver DHS assigned to one of the above five DHS classes (Table S2A) (Table S2A) was used as the region input list. A set of 19 publicly available H3K27ac

973
ChIP-seq datasets from mouse liver was used as signal input (see Table S4 for sample information). This set 974 includes male, female, and circadian time course data [53]. A strict intersection of super-enhancers identified 975 across all 19 samples was used to define a set of 503 "core" super-enhancers in mouse liver using the Bedtools 976 multiintersect command, as shown in Figure S5E. Any enhancer cluster (i.e., constituent enhancers within 12.5 977 kb, as above) not identified as a super-enhancer in any sample was termed a typical enhancer.

979
Gene targets for enhancers were assigned as the nearest gene up to a maximum distance cutoff of 25 kb. Gene 980 expression values and tissue specificity were defined as described above. Aggregate plots were generated 981 using Deeptools (version 2.3.3). In Figure 4F, the scale-regions option of Deeptools was used to scale super-982 enhancers and typical enhancers to their median sizes of 44 kb and 1 kb, respectively. Figure S5F used the

CTCF Orientation
A.

TAD Boundary
Inactive   available ChIP-seq data tested, these data are representative of the vast majority of the >50 publically available ChIP peak lists for liver-expressed TFs. Notable exceptions related to promoter-associated features, marks, and factors are shown in Figure S1B,D. This enrichment does not appear to be an artifact of CTCF overlap ( Figure S1E).

E. A subset of TAD boundaries represent transitions from active to inactive chromatin compartments.
Shown is the distribution of activating and repressive marks or features for a 1 Mb window around each TAD boundary. TAD clusters, numbered at the left, are defined using k means clustering (k=4).
The boundaries between TADs transition from active to inactive genomic compartments (or vice versa) for TAD clusters 2 and 3. For downstream analyses, a TAD was considered active if the boundary at the start of a TAD fell into clusters 1 or 2 (the first two clusters from the top) and the boundary at the end of the same TAD fell into clusters 1 or 3 (see Methods). See Table S1A for a full listing of the 3,538 TADs analyzed and their active/inactive status.
F. UCSC browser screenshot for a transitional TAD boundary on chromosome 13 from TAD cluster 3 in Figure 1E. Arrows at bottom indicate CTCF motif orientation.
G. Genes found in active compartment TADs are more highly expressed, with the majority of genes showing >1 FPKM, than genes found in inactive TAD compartments. Box plots are based on 12,258 genes in 1,930 active TADs and 4,643 genes in 1,000 inactive TADs. 939 genes in 473 of the inactive TADs are expressed at > 1 FPKM (see Table S1E). Genes in weakly active and weakly inactive TADs were not included in these analyses.
H. Genes whose TSS are located in inactive TADs (B compartments) are more tissue specific in their expression pattern than genes found in active TADs (A compartments). The top GO category for expressed genes in the A compartment is RNA binding, while the top category for expressed genes in the B compartment is monooxygenase activity (not shown). C. TAD and predicted subTAD loop anchors are more tissue ubiquitous than other categories of CTCF/CAC sites. Each of the four CTCF site subgroups was defined in mouse liver as detailed in Table S1C. The x-axis indicates the number of ENCODE tissues that also have CTCF bound, where a higher value indicates more tissue-ubiquitous CTCF binding. These data are shown for "lone" CTCF binding sites (10,553), non-anchor cohesin-and-CTCF sites ("Other CAC"; 26,970), TAD anchors  Table S1C, and for the set of CNC sites as a control (Table S1D).

! "#"$%&
C. TAD and subTAD anchors show greater insulation of the repressive histone marks as measured by  D. Genes associated with super-enhancers (SE) are more highly expressed than those associated with typical enhancers (TE), for both protein coding genes and multi-exonic lncRNA genes. Gene expression is presented as log2(FPKM+1) values. Also, these super-enhancer-regulated genes are more tissue specific (higher Tau score) than typical enhancer-regulated genes (KS t-test, p < 0.0001).

Signal at SE/TE Comparison
p-values shown are for pairwise comparisons of SE-regulated genes vs. TE-regulated genes (KS ttest).
E. Venn diagrams show the substantial overlap between typical enhancer gene targets across tissues (liver, ESCs, ProB cells), but limited overlap between super-enhancer gene targets (all within 10 kb) for the same tissues. The numbers represent the percent of genes targeted in a given tissue by the indicated class of enhancer (typical enhancers or super-enhancers) that are not targets of the corresponding class of enhancers in the other two tissues. For example, 93.2% of genes targeted by liver super-enhancers are not targeted by the set of super-enhancers identified in either Pro-B or mouse ESCs. Gene targets of each enhancer class were identified by GREAT using default parameters.
F. ChIP and DNase-seq signal at typical enhancers and super-enhancers, scaled to their median length (1 kb and 44 kb respectively; indicated by hash marks) with ±10 kb. In addition to a 1.5-2.5-fold higher signal for general enhancer marks, super-enhancers show greater accumulation of RNA polymerase 2, despite no apparent enrichment for the promoter mark H3K4me3.
G. Super-enhancers (SE) generally regulate distinct categories of genes than typical enhancers (TE) in mouse liver. While GO terms such as oxidoreductase activity are regulated in common by both classes of enhancers, only super-enhancers are enriched for transcription-regulated terms such as Regulation of Transcription and Steroid hormone receptor activity. Numbers represent the overlap of GO terms (either Molecular Function or Biological Process) in any DAVID annotation cluster (with an enrichment score >1.3) for genes regulated by either typical enhancers or super-enhancers.   A. For those subTADs that include a super-enhancer, two possible gene targets were assigned: one gene within the subTAD and another crossing the subTAD anchor but within 25 kb of the TSS. Box plots show that gene targets within a subTAD are significantly more highly expressed than the alternative, linearly more proximal, gene target.

Expression of Genes in Screenshot
B. Genes within subTADs tend to be more uniformly tissue-specific or tissue-ubiquitous when compared to all genes within TADs or when compared to a shuffled set of random regions matched in size to subTADs. Shown is the standard deviation in Tau values (tissue-specificity index) of genes whose TSS's are within TADs or subTADs containing at least three TSS. This indicates that sets of three or more genes within subTADs are consistently either more or less tissue specific than random clusters of genes within the same sized genomic spans.
C-D. TAD and subTAD loops insulate a subset of super-enhancers (black horizontal bars) with key liver genes, allowing high expression of genes such as the TFs Cebpb and Hnf4a, relative to their immediate neighbors. Cebpb is an example at the TAD scale, while Hnf4a shows a subTAD loop. In both cases, the most linearly proximal gene is outside the TAD or subTAD and is expressed at a lower level than the loop-internal genes (and presumptive gene target).
E. Insulated subTAD loops allow for expression of select gene targets within otherwise repressed genomic compartments. Shown is a UCSC genome browser screenshot of a transition from an active to a repressed TAD, with the expression of genes within the region shown in a bar graph, below. The obesity-related gene Scd1 is insulated in a subTAD loop and is the only expressed gene in its TAD (FPKM >100). H3K27me3 marks are shown both as a reads per million signal track (below) and as a signal over an IgG input control (above), expressed as log 2 (H3K27me3 signal / Input signal).

exemplifies subTAD insulation and super-enhancer interaction
A. The Alb promoter makes multiple directional contacts with the adjacent super-enhancer region in both male (M) and female (F) mouse liver, as determined by 4C with a viewpoint at the Alb promoter. All reproducible interactions occur within the TAD loop containing the Alb TSS and its super-enhancer (red), and all but two contacts in male liver occur within the predicted subTAD loop (pink). 4C-seq interaction scores are shown as -log10(pval) across replicates as calculated by R3C-seq (see Methods). Also see Figure S7.
B. The 4C interaction signal within the Alb TAD is orders of magnitude above the background signal and generally decays with distance. Far-cis and trans interactions are represented on a per TAD basis, expressed as RPKM per TAD, to control for sequencing depth and TAD length. The overall background within chromosome 5 is significantly higher than all trans chromosomes; immediately adjacent TADs also show higher 4C signal than the overall cis background. The 4C signal decayed to background levels within approximately 3 TADs of the Alb viewpoint TAD. Each data point represents a single TAD and each color represents a 4C-seq replicate. C. Background model used for distal cis interactions, showing a rapid decay in per TAD signal intensity.
Each data point represents a single TAD along chromosome 5.
D. Distal cis and trans TADs that highly interact with the Alb promoter tend to be active TADs, while a majority of TADs that interact less than the background model are predicted to be inactive. Along the cis chromosome, a simple inverse logarithmic decay of signal per TAD was used to determine the background, while the 4Cker package was used to determine high, medium, low, and non-interacting TADs in trans based on a hidden markov model with adaptive windows better suited for low signal regions.  Table Outline-Table S1: TAD coordinates with A/B compartment assignment, subTAD coordinates with loop score, CTCF peak list with CAC classification, CNC peak list, 939 liver-expressed genes in inactive TADs   A.

C.
In "plus" oriented motif In "minus" oriented motif Figure S1 D.