Ultrafine mapping of chromosome conformation at hundred basepair resolution reveals regulatory genome architecture

Significance In this research, we developed Tri-4C and Tri-HiC, innovative approaches that significantly improve the resolution of chromosome conformation mapping. These methods effectively overcome the limitations of traditional chromatin conformation capture techniques, enabling robust detection of intricate chromatin structures formed at cis-regulatory elements (CREs) in sub-TAD (topologically associating domain) levels. By unveiling the fine-gauge architecture of the regulatory genome with extensive detail, this work offers an insight for how distal CREs physically communicate and drive long-range gene regulation.

To identify significant and reproducible peak regions, bins were scored with the -log10(-log10(p)) value, and those with a score > 0 (p < 0.1) in all replicates were collected and analyzed by IDR package 19 (Github https://github.com/nboley/idr)using the following settings --initial-mu 1.5 --initial sigma 0.3 --initial-rho 0.8 Bins with IDR < 0.05 (score ≥ 540) were considered significant and merged.We defined a minimum length of 300 bp for calling significant distal loop peaks.
The UMI-4C 11 and 1D adaptation of in situ Hi-C 3 loop calling algorithms were used for comparison.For both algorithms, distance-dependent decay of interaction frequency was calculated at genome-wide level using IMR90 in situ Hi-C data at 5 kb resolution.The decay function was further smoothed to 500 bp bins (W) in 100 bp step size resolution by using linear interpolation to obtain the F(d) (UMI-4C) or E* (Hi-C) suitable for high resolution analysis in Tri-4C.For the UMI-4C algorithm, the background of each Tri-4C profile was obtained by re-allocating the total intra-TAD Tri-4C read counts (N) to each bin according to F(d).The enrichment p value of a bin with expected read count of E and actual count of E1 was calculated by fitting to binomial distribution: Pr(B(N, E/N) > E1).For the 1D in situ Hi-C algorithm, the adjusted expected read count for each bin E d* i can be calculated by the filter formula Where p and w were set at 2 and 100 to be corresponding with 500 bp peak and 20 kb background size.This expected count was compared with actual read count Mi using the Poisson statistics.The p values obtained from UMI-4C and Hi-C algorithms were directly compared with Tri-4C raw p values by the CRE ROC analysis.
To investigate for potential artifacts during Tri-4C loop calling due to mapping bias, the GC content and restriction site density under triple enzyme digestion for the locus analyzed by Tri-4C were obtained by directly analyzing the hg19 genome sequence of the region.Mappability of the region was obtained from the ENCODE mappability track available on the USCS genome browser.The average GC content, restriction site density, and mappability for the 10 kb neighboring regions for all loop sites called by Tri-4C were calculated at 100 bp resolution.To generate the background for comparison, a set of 10,000 equal size genomic intervals were randomly selected in the locus.The mean for each set was calculated after removing intervals whose center fell within Tri-4C loops or repeat regions.

Data Reproducibility
Each Tri-4C and single RE UMI-4C was performed in two technical replicates.Reproducibility of intrachromosomal interaction was measured by Pearson's correlation r using the Python numpy.corrfunction after binning the contacts in the size described in Fig S1c.

Analysis of Interaction Frequency and Loop Strength
For frequency-based analysis, the read count for each bin was converted to interaction frequency by normalizing against (1) total intrachromosomal interactions, which we referred to as normalized interaction or (2) total intrachromosomal interactions of a reference viewpoint in a multiplexed run.We refer to this as relative interaction frequency.The purpose of the latter method was to control for the significant change in total read count generated between different experimental conditions, as in the IFNB1 viewpoint at the baseline control compared to the IFNB1 induced condition (Fig S10c,e).Thus, the reference point was selected based on the standard of exhibiting the least variation in unit read count yield among different conditions, and in our experiment the Boundary viewpoint was chosen for that reason.Subtraction analysis was performed by directly calculating the interaction frequency difference between two tracks in comparison after normalization.A multiplication factor of 10,000 was applied to the normalized interaction frequency to simplify the display when presenting the interaction map on UCSC/WashU track.
For loop-based analysis, loop strength, i.e. log fold enrichment (logFE) was calculated by logFE = log10(M/N), where M and N are the actual and expected read count for each sliding window as indicated in the Peak Calling section.This step can be performed with MACS2 using -m logFE.A 1.0 pseudo count was given to calculate logFE to resolve zero division as well as attenuating noise level at regions with sparse mapped counts.Differential loop strength was calculated by measuring the logFE difference between two tracks.For presentation in Fig S10e , the logFE was weighed by frequency at basal condition.

Allele-specific Loop Calling
We used a likelihood ratio test, based on the loop strength, to determine whether an interaction loci displayed allelic bias.Specifically, for each 100 bp sliding window we obtained four vectors: Mref, Malt, Nref, and Nalt, where ref and alt denoted the allele genotype, and M and N denoted the actual and expected read count, as indicated above.Each element in the vector represented one replicate.Firstly, the elements in Nref and Nalt were normalized to their respective mean, and the scalars were used to normalize their corresponding M.Then, we had H1: M ̂ref ~ Pois(N ̂ref); M ̂alt ~ Pois(N ̂alt), and H0: M ̂ref ~ Pois(N0); M ̂alt ~ Pois(N0), where N0 denotes the mean of N ̂ref and N ̂alt.The likelihood ratio L was converted to p value by applying the Wilks' theorem, namely -2ln(L) ~ χ 2 (1), subjected to subsequent Bonferroni correction where n equal to total number of assayed intervals in the locus.

Analysis of Cis-regulatory Element Interaction Network
To annotate CREs and active enhancers, respectively, DNase and H3K27Ac peak position and intensity for IMR90 cells were obtained from the Roadmap Project web portal (https://egg2.wustl.edu/roadmap/web_portal/). For the comparison of loop strength and interaction frequency between viewpoints with DNase peak intensity, linear regression models were built in Python using the scipy.stats.linregressfunction.
Receiver Operating Characteristic (ROC) Analysis CRE positions, defined by Roadmap DNase peaks, were converted to 100 bp 1/0 tracks, with 1 indicating the presence of peaks inside the bin.The ROC curves were built by using loop scores (-log(p)) obtained from Tri-4C and UMI-4C as predictors for the peak positions in the converted DNase track, using Python sklearn.metrics:roc_curve and auc function with default settings.

Motif Analysis
The DNA sequences for all Tri-4C peak regions in both baseline control and IFNB1-induced conditions were extracted.Motif prediction was performed using TFBSTools (R platform) with accession to the JASPAR2018 database 39,40 .A minimum score of 90% was set to discover matched motifs.The Lasso CV model (CV=10, iter=10,000) was applied to all motifs to identify factors correlated with ΔlogFE of Tri-4C peaks during induction.Significance of correlation was determined by F statistic and subject to Bonferroni correction.

Hi-C and Topologically-Associated Domain (TAD) Definition
The IFNB1 TAD (chr9:19480000-2120000) was defined by the in situ Hi-C data of IMR-90 7 .Hi-C Browser (http://promoter.bx.psu.edu/hi-c/index.html)was used to visualize the Hi-C interactions in the TAD.

Contact Map Alignment
The raw Fastq for Tri-HiC was aligned to hg19 genome and processed to obtain the interaction contact matrix in .mcoolformat by applying the Distiller pipeline 28 with default configurations at resolutions of 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 250000, 500000, and 1000000 bp.The pairsam intermediate output from Distiller was processed by Juicebox Pre 27 to generate contact maps in .hicformat with the same resolution setup.We kept both the alignments with filtering of MAPQ > 0 and > 30.For all analyses downstream we chose MAPQ > 30 by default, although in regions reported from this study, we did not observe distinguishable differences between the two parameters.For this study, we used HiGlass 41 to visualize the contact matrices in .mcoolformat.

Reproducibility Test
Reproducibility of Tri-HiC at 500 bp to 20 kb resolution range (Fig S13a) was evaluated by the Pearson correlation coefficient between replicates calculated by the HicCorrelate function in the HiCExploer 42 package.Due to the sparsity of distal contacts at extreme long range which results in mostly empty bins for the high resolutions in our analysis, we restrained the test to 8 Mb intrachromosomal range, which is in line with the distance limit for loop calling (see sections below).

Decay Function Analysis
Valid intra-chromosomal contact pairs were binned in 100 bp intervals, and the contact frequency for each bin was calculated by dividing to the total read count.We sorted contact pairs by orientations noted as "in-in (+/-)," "in-out (+/+)," "out-in (-/-)," and "out-out (-/+)".Smoothed decay curves for each orientation were generated by using the CubicSpline function in the scipy.interpolatepackage from Python.

Interaction Hotspot Analysis
According to the decay function analysis (Fig S13b), the majority of contacts from Tri-HiC were short range (< 1 kb) in-in self-ligation products, which can be considered as a roughly even background over the genome.Taking the advantage of this property, we defined the interaction hotspots as regions having a significantly higher fraction of long range contacts, with the short range and long range being respectively defined as smaller and greater than 1.5 kb.To identify these hotspots, bam alignments obtained from the Distiller pipeline were split to short-and longrange interaction tracks based on the interaction distances recorded in the .pairsfile.Reads were piled up and converted to coverage signals in .bedGraphformat, using a binning window of 500 bp in 100 bp sliding step.For each bin, the expected long range read count was calculated by multiplying the short-range count of the bin by the average long-range to short-range ratio over the 20 kb neighboring background..The statistical significance for the difference between expected and actual long range contacts was determined by Poisson statistics using the MACS2 bdgcmp -m ppois function 18 .Significant bins with p values smaller than the Bonferroni threshold (3E-7) were merged, and intervals with at least 300 bp span were considered interaction hotspots.The -log10(p) score from the analysis corresponded to the distal interactivity track displayed in Fig 4a , and the average bin score for each peak was reported as its interactivity score for Fig 4c.
The HiCCUPS reported both observed read count in the loop regions as well as the expected values from the donut background (expectedDonut), and their ratio defined the signal-tobackground fold change of the loop, which we referred to as the loop strengths.Similarly, horizontal and vertical stripe strengths were calculated by dividing the expectedH and expectedV columns, respectively, to the expectedDonut column.The residual loop strength used for motif analysis was defined by the subtraction of log2 horizontal and vertical stripe strengths from log2 loop strength.
The intensities of significant loops identified by HiCCUPS were assessed by aggregate peak analysis (APA) also available from Juicebox with the following parameters: -n 500 -w 125 -r 200 -q 80 -k KR.APA scores from the analysis were directly reported in For the pile-up study at the vicinity of interaction hotspots, the APA was applied with the following parameters: -n 0 -w 250 -r 100 -q 320 -k KR.A distance-based normalization was applied to the obtained matrix by dividing each diagonal to the corresponding diagonal mean from a larger APA obtained with the parameters -n 0 -w 2500 -r 100 -k KR.The resulting matrix was Z-score transformed and shown in Fig S18a .Motif Analysis To annotate the interaction hotspots and loop anchors identified by Tri-HiC, we used the AME 43 package in the MEME suite to predict motifs in the intervals, with the accession to the JASPAR2018 and TFBSshape databases and default threshold parameters.

Insulation Analysis
Genome-wide insulation scores were calculated at 200 bp resolution by using the diamondinsulation function in the cooltools 44

Fig 3d .
The intensity matrices obtained from the analysis were transferred to Z-scores and presented in Fig 3d.The raw APA matrices were also used to evaluate the relative intensities of corners, stripes, and loops for the insulation analysis in Fig S18d by calculating the log 2 mean intensity of each interval indicated in the figure and normalizing against the left-bottom corner.
package with the default parameters.In Fig S18b, we piled up the scores within the 20 kb region of all interaction hotspots categorized by their annotations.Insulation strengths for each hotspot in Fig S18c represented the range (maximum-minimum) of the insulation score of the interval.

Figure S1
Figure S1 Tri-4C improves both yield and reproducibility by finer digestion of the genome.(a) Distribution of DNA fragment size of human genome digested by indicated restriction enzymes.Numbers on top indicate median and in the parentheses indicate percentage of fragments smaller than 500 bp window size.(b) Yield of unique intrachromosomal reads for UMI-4C and Tri-4C on the three viewpoints.(c) Reproducibility of interaction profiles binned in 50 kb-50 bp.

Figure S2
Figure S2Overview of Tri-4C experimental design at the chromosome 9p21 IFNB1 TAD.Tri-4C profiles of three viewpoints (Boundary, MLLT3, and IFNB1) are displayed under IMR90 in situ Hi-C matrix (5kb resolution Rao 2014) obtained from 3D genome browser (Hi-C loops are highlighted with squares).The Y axis of Tri-4C tracks denotes interaction frequency multiplied by 10,000.The interaction profiles are aligned with regulatory marks (DNase, H3K4me1) and boundary markers (CTCF, RAD21) for IMR90 cells obtained by the Roadmap Project.Bottom panel shows significant loop interactions between the viewpoints and CREs.

Figure S3
Figure S3Tri-4C loop annotation and quality control.(a) Overlap between intra-TAD Tri-4C loops and intra-TAD DHS and H3K27ac peaks.(b) Overlap of DHS-marked CRLs among the three viewpoints.(c) GC content, (d) mappability, and (e) restriction site density around regions looped with any of the three viewpoints.Gray background indicates confidence intervals estimated by using 1,000 randomly selected intra-TAD regions not looped with any viewpoints with mappability > 0.5.

Figure S4
Figure S4 Comparison of Tri-4C profiles analyzed in 500 bp (High res) and 3000 bp (Low res) resolution.(a) Overlap of loops falling in CREs.(b) Interaction of MLLT3 with neighboring CREs shown by Tri-4C in two resolutions.DHS peaks showing looping at 500 bp resolution are highlighted.

Figure S5
Figure S5 Comparison of loop calling algorithms.(a) ROC analysis using loop scores for each 100 bp bin steps calculated by Tri-4C (Dynamic), 1D Hi-C, and UMi-4C algorithms as predictors of intra-TAD DHS peaks.(b) 2D plots comparing loop strength (logFE) on all intra-TAD CREs determined by Tri-4C normalization, or read count using (b) UMI-4C or (c) Hi-C normalization between viewpoints.Color indicates log distance ratio between the x and y viewpoint (blue = closer to x and red = closer to y).Pearson correlation coefficient r and p value from linear regression are indicated.

Figure
Figure S6 (a) Venn diagram of reproducible CRLs (N=2) called for MLLT3 and IFNB1 using Tri-4C and UMI-4C digested by three different restriction enzymes (b) ROC analysis using loop scores for each 100 bp bin as predictors of intra-TAD DHS peaks.(c) ROC analysis using loop scores for each 100 bp bin as predictors of intra-TAD H3K27Ac peaks.

Figure S7
Figure S72D plots the loop strength difference between UMI-4C and Tri-4C (Y axis, ΔlogFE) and the distance between the nearest restriction site and the DHS peak center (X axis, log scale) of all intra-TAD CREs.The p values were calculated by fitting with linear regression.

Figure S8
Figure S8 Analysis of Tri-4C loops that do not overlapped with CREs (a) Overlap between off-CRE loops within intra-TAD regions and ChIP-seq peaks of 5 or more transcription factor binding tracks combined in the ENCODE (Transcription Factor ChIP V3, 161 factors, all cell lines combined).(b) Validation of Cas9 deletion of S1-S4 indicated in (c).(c) Tri-4C, but not DpnII-UMI-4C, indicates looping of MLLT3 with 4 neighboring regions (S1-S4) lacking enhancer marks and CTCF/cohesin.(d) Expression of MLLT3 after deletion of these regions using Cas9 and two pairs of guide RNAs (sg1, sg2) quantified by real-time PCR (N=3).

Figure S9
Figure S9 Analysis of Tri-4C loop strength.(a) Comparison between loop strength (logFE) from three viewpoints and DHS peak log fold enrichment on all intra-TAD CREs.Pearson correlation coefficient r and p value from linear regression model are indicated.(b) Association between loop strength and CTCF motif presence for Tri-4C and UMI-4C based on three enzyme digestion.

Figure S10
Figure S10 CRL alterations after IFNB1 induction.(a) Expression profiling of IFNB1 and MLLT3 expression before and after IFNB1 induction by using real-time PCR (N=3, t test).(b) Comparison of Tri-4C yield for three viewpoints (N=2) (c) 2D plot showing read count of IFNB1 Tri-4C (normalized against Boundary) at all intra-TAD CREs before (X axis) and after (Y axis) induction.(d) Venn diagram showing the overlap of CRLs called from IFNB1 before and after induction.(e) Alteration of IFNB1 interaction before (Ctrl) and after (Induced) induced expression in IMR90.Top track aligns interaction read count, while second denotes loop strength alteration (ΔlogFE) and shows the loop gain is specific to S1 despite increased count on both S1 and S2.Third track indicates ATAC-seq peak signal of the two enhancers corresponding to the two conditions.S1 is a known enhancer of IFNB1 45 .(f) Comparisons of loop strength alterations (ΔlogFE) between three viewpoints and (g) with ATAC-seq peak log fold enrichment changes on all intra-TAD CREs.Pearson correlation coefficient r and p value from linear regression model are indicated.(h)

Figure S12
Figure S12 Reproduction of Tri-4C using alternative digestion by CviAII.(a) 2D plot showing read count in 500 bp bins obtained from original Tri-4C and alternative digestion protocol.Pearson correlation coefficient r is indicated.(b) Loop score (log(-log(p)) comparison for all bins scored above 0 (p<0.1).

Figure
Figure S13 (a) Reproducibility of Tri-HiC at various resolutions for intrachromosomal contacts within 8 Mbp range (same as the distance restraint for loop calling).(b) Contact frequencies versus distance for read pairs in indicated directions determined by Tri-HiC.

Figure
Figure S15 (a) Annotation of in situ HiC loops.(b) Examples of virtual 4C derived from Tri-HiC for promoters interacting with multiple enhancers Additional annotations of Tri-HiC-identified loops.(c) Annotation of 1 kb Tri-HiC loops.

Figure S16
Figure S16 Examples of H3K27me3-associated super-stripes (highlighted by gray bars) identified by Tri-HiC.

Figure S17
Figure S17 Decomposition analysis of Tri-HiC loops.(a) Scatter plot showing correlations between loop fold enrichment (FE) and product of stripe fold enrichments of the two corresponding loop anchors around the loop sites.See Methods for precise definition of loop and stipe regions.(b) Categorical comparisons between loop and stripe fold enrichments.Loops annotated into multiple categories are only included in the leftmost category.Numbers above the

Figure S18
Figure S18 Tri-HiC reveals chromatin insulation associated with cis-regulatory elements.(a) Distance-normalized and Z score-transformed piled up heatmaps of distal interaction hotspots annotated with indicated epigenetic marks at 100 bp resolution.(b) Piled-up insulation scores for regions indicated in (a).Comparison of insulation strength of interaction hotspots, defined as the range of insulation scores within their 20 kb neighboring regions, annotated with indicated marks.

Fig
Fig S19Examples of loci where lncRNAs are located between the oncogene and its looped enhancer networks.For MEIS1 (green arrow), interactions with the downstream enhancers is interfered with by LINC01798(red arrow) promoter, whereas for CCND1, FZD8, and VEGFA, the promoters of respective lncRNAs (LINC01488, PCAT5, and LINC01512, dotted red arrow) are not occupied with active histone markers and show no sign of interaction insulation.