Main

Single-cell Hi-C (scHi-C) technologies have been developed to map chromatin architecture in individual cells, enabling the measure of spatial proximity between transcriptional regulatory elements in a cell type-specific manner1,2,3. However, due to the lack of tools tailored for scHi-C data, identifying loops from scHi-C data mainly relies on applying methods developed for bulk Hi-C4,5 to the aggregated scHi-C data of the same cell type. Due to the extreme sparsity of scHi-C data, such a strategy would require a large number of cells (>500–1,000), which is both cost prohibitive and impractical for rare cell types. To overcome these issues, we developed Single-Nucleus Analysis Pipeline for Hi-C (SnapHiC), a computational framework customized for scHi-C data to identify chromatin loops at high resolution and accuracy from a small number of cells.

SnapHiC (Fig. 1a) first imputes intrachromosomal contact probability between pairs of 10-kilobase (kb) bins in each cell with the random walk with restart (RWR) algorithm6. Next, it normalizes the imputed contact probability based on linear genomic distances. SnapHiC then applies paired t-test to the matrices of normalized contact probability of all cells to identify candidate bin pairs (or loop candidates) with higher-than-expected contact probability in a population of cells. To minimize false positives, SnapHiC considers a bin pair as a loop candidate only when its normalized contact probability is significantly higher than expected by chance based on both global and local background. Finally, SnapHiC groups the loop candidates into clusters7 and identifies the summit(s) within each cluster. In SnapHiC, individual cells are treated as independent datasets instead of being aggregated into pseudo bulk data. Therefore, the variability of contact frequency within the cell population can be estimated to boost the statistical power in loop detection, especially when the number of cells is low.

Fig. 1: SnapHiC reveals chromatin loops at high resolution and accuracy.
figure 1

a, Overview of the SnapHiC workflow. b, The number of chromatin loops at 10-kb resolution identified by SnapHiC and HiCCUPS (with default or optimal parameters) from different numbers of mES cells. c, F1 score of SnapHiC- and HiCCUPS-identified loops (with default or optimal parameters) from different numbers of mES cells. In b,c, the dots and the error bars represent the mean values and the standard deviations calculated across six randomly sampled subsets, respectively.

Source data

We first benchmarked the performance of SnapHiC against a commonly used loop detection method for bulk Hi-C data, HiCCUPS4. We applied SnapHiC to the published scHi-C data1 generated from mouse embryonic stem (mES) cells. Besides the full set of 742 cells, we also randomly subsampled 10, 25, 50, 75, 100, 200, 300, 400, 500, 600 and 700 cells from this dataset, and determined 10-kb-resolution intrachromosomal loops within the 100 kb–1 Mb range. For each subsampling, we also pooled the scHi-C data and identified chromatin loops at 10-kb resolution using HiCCUPS with both default and ‘optimal’ parameters for sparse data (Supplementary Note and Extended Data Fig. 1). For each subsampling dataset, SnapHiC found substantially more loops than HiCCUPS, suggesting SnapHiC has a much higher sensitivity than HiCCUPS (Fig. 1b and Supplementary Table 1). Even from 75 cells, SnapHiC identified 1,050–1,420 loops, whereas HiCCUPS found only 0–2 loops with default parameters and 3–10 loops with optimal parameters. Additionally, HiCCUPS-identified loops tended to be a subset of SnapHiC-identified loops (Extended Data Fig. 2a). Moreover, SnapHiC achieved higher reproducibility. From two replication datasets with 371 cells each, reproducibility was 50.8% for SnapHiC versus 38.7% for HiCCUPS with default parameters (paired t-test two-sided P = 7.86 × 10−8), while 50.8% for SnapHiC versus 39.7% for HiCCUPS with optimal parameters (paired t-test two-sided P = 9.90 × 10−11).

We used the F1 score, the harmonic mean of the precision and recall rates, to evaluate the overall performance of each method. To calculate the F1 score, we combined the chromatin loops identified by HiCCUPS from bulk in situ Hi-C data8, with long-range interactions identified by MAPS (model-based analysis of long-range chromatin interactions from PLAC-seq (proximity ligation-assisted ChIP–seq (chromatin immunoprecipitation assays with sequencing)) and HiChIP experiments) from H3K4me3 PLAC-seq data9, cohesin10 and H3K27ac HiChIP data11, all from mES cells. At each subsampling of scHi-C data, SnapHiC consistently attained a greater F1 score than HiCCUPS (Fig. 1c and Extended Data Fig. 2b,c). The reliability of SnapHiC-identified loops was supported by two lines of evidence: (1) significant focal enrichment at anchors of loops identified from at least 25 cells was observed from aggregate peak analysis (APA) plots using aggregated contact matrices of 742 cells (Extended Data Fig. 3) and (2) for SnapHiC-identified loops with CTCF (CCCTC-binding factor) binding on both anchors, there was a clear preference in convergent orientation—ranging from 63.6% to 78.7% when at least 50 cells are used (Supplementary Table 2), as predicted by the loop extrusion model4,12. The advantages of SnapHiC were more obvious when the number of cells profiled is limited. As illustrated in Extended Data Fig. 4, SnapHiC detected previously verified long-range interactions at Sox2, Wnt6 and Mtnr1a loci13,14 from scHi-C data of as few as 75 cells, whereas HiCCUPS required at least 200–600 cells to detect the same loops.

We next compared the performance of SnapHiC with three additional methods designed to identify long-range interactions from bulk Hi-C-FastHiC15, FitHiC2 (ref. 5) and HiC-ACT16 (Supplementary Note). Considering their default thresholds may not be optimal for the sparse scHi-C data, we also tested different thresholds for each method. Results on different numbers of mES cells demonstrated that SnapHiC consistently identified more loops and achieved greater F1 scores than the other methods, with higher recall rates and equivalent or slightly lower precision rates (Extended Data Fig. 5). For the three loci examined above (Extended Data Fig. 4), SnapHiC also detected the known long-range interactions with much fewer cells than the other methods (Extended Data Fig. 6). Taken together, our results suggested that SnapHiC can identify loops from a small number of cells with high sensitivity and accuracy.

To demonstrate the use of SnapHiC on complex tissues, we applied it to the published single-nucleus methyl-3C-seq (sn-m3C-seq) data3 generated from human prefrontal cortex, which simultaneously profiled DNA methylome and chromatin organization from the same cells. In this study, 14 major cell types were classified from single-cell methylome data based on cell type-specific CG and non-CG methylation patterns (Extended Data Fig. 7a). We applied SnapHiC to the Hi-C component of each cell within the 14 cell clusters, and identified roughly 817–27,379 loops at 10-kb resolution (Fig. 2a). Consistent with our observation on mES cells, SnapHiC identified more chromatin loops than tools developed for bulk Hi-C (Extended Data Fig. 7b) and yielded the highest F1 score on all cell types except for the oligodendrocytes (Extended Data Fig. 8 and Supplementary Table 3), which had comparable sequencing depth to routine bulk Hi-C data after aggregating (roughly 278 million intrachromosomal reads >20 kb, 1,038 cells).

Fig. 2: Application of SnapHiC to sn-m3C-seq data from human prefrontal cortex uncovered chromatin loops in diverse brain cell types.
figure 2

a, Scatter plot showing the numbers of cells and SnapHiC-identified loops in each of the 14 major cell types identified in human prefrontal cortex in Lee et al.3. ODC, oligodendrocyte. Astro, astrocyte. MG, microglia. OPC, oligodendrocyte progenitor cell. Endo, endothelial cell. L2/3, L4, L5 and L6 denote excitatory neuron subtypes located in different cortical layers. Pvalb and Sst, medial ganglionic eminence-derived inhibitory subtypes. Ndnf and Vip, CGE-derived inhibitory subtypes. NN1, nonneuronal cell type 1. b, Boxplot of ATAC-seq log2(CPM+1) value (left), H3K27ac ChIP–seq log2(CPM+1) value (middle) and RNA-seq log2(FPKM+1) value (right) at Astro-specific, MG-specific, ODC-specific and L2/3-specific SnapHiC loops. For each set of SnapHiC loops, the values are calculated using ATAC-seq/H3K27ac ChIP–seq/RNA-seq data generated from astrocyte, microglia, oligodendrocytes and neurons, respectively (Methods). ***P < 2.2 × 10−16, two-sided P values by the paired Wilcoxon signed-rank test. In each box, the upper edge, horizontal center line and lower edge represent the 75th percentile, median and 25th percentile, respectively. The upper whiskers represent the 75th percentile + 1.5× the interquartile range (IQR). The lower whiskers represent the minimum values (0). Data points with a value above the 75th percentile + 1.5× the IQR are outliers and reported as dots. c, SnapHiC-identified loops from astrocyte and microglia around gene APOE. There is no loop identified in this genomic region from oligodendrocytes or L2/3 excitatory neurons, so no corresponding tracks are shown. Two astrocyte-specific loops linking the APOE promoter (highlighted in gray) and the active enhancers in astrocyte (highlighted in pink) containing two Alzheimer’s disease (AD) -associated GWAS SNPs are marked by black arrows. Only APOE transcription start site-distal Alzheimer’s disease-associated GWAS SNPs are shown in the figures (residing in the region chr19: 45,440,000–45,630,000).

Source data

The accuracy and sensitivity of the above SnapHiC-identified loops were supported by two lines of evidence. First, APA analysis confirmed SnapHiC-identified loops show significant enrichment compared to their local background on the aggregated contact matrices (Extended Data Fig. 9). Second, anchors of SnapHiC-identified loops displayed corresponding cell type-specific chromatin accessibility, histone acetylation and gene expression in four distinct cell types: astrocytes, L2/3 excitatory neurons, oligodendrocytes and microglia, where assay for transposase-accessible chromatin using sequencing (ATAC-seq), H3K27ac ChIP–seq and RNA sequencing (RNA-seq) data are available17,18. To minimize the effect of cell number variation on the performance of SnapHiC, we randomly selected the same number of cells (N = 261) from astrocytes, oligodendrocytes and microglia to match the number of cells from L2/3 excitatory neurons, and applied SnapHiC to identify loops from these subsampled data. We found that most chromatin loops were cell type-specific (Supplementary Table 4), and the anchors of cell type-specific loops showed significantly higher ATAC-seq and H3K27ac ChIP–seq signals in the matched cell type compared to those in different cell types (Fig. 2b). The genes whose promoters linked to cell type-specific loops also showed significantly higher expression levels in the matched cell type (Fig. 2b and Supplementary Table 5). Moreover, they were associated with gene ontology terms19 related to cell type-specific biological processes (Extended Data Fig. 10a). Taken together, our results indicated that SnapHiC can detect chromatin loops reliably from scHi-C data in complex tissues.

Furthermore, we assigned candidate target genes to noncoding genome-wide association study (GWAS) single nucleotide polymorphisms (SNPs) based on the loops identified in specific cell types. We first collected 3,471 unique GWAS SNPs associated with seven neuropsychiatric disorders and traits that resided within the active enhancers of astrocytes, neurons, microglia or oligodendrocytes17 (Supplementary Table 6). Using SnapHiC-identified loops from the matching cell types (L2/3 excitatory neurons to represent neurons), we defined 788 SNP-gene linkages, connecting 445 disease-associated SNPs to 189 genes (Supplementary Table 7). The list included several known disease genes, such as INPP5D (Alzheimer’s disease), RAB27B (major depressive disorder, MDD), SORL1 (Alzheimer’s disease) and ZNF184 (MDD and schizophrenia). Figure 2c and Extended Data Fig. 10b showed an illustrative example of gene APOE, which was specifically expressed in astrocyte. Two astrocyte-specific loops connected the transcription start site of APOE to two active enhancers containing Alzheimer’s disease-associated GWAS SNPs (rs112481437 and rs138137383) in astrocyte. Our results indicated that APOE was the putative target gene of these two GWAS SNPs specifically in astrocytes.

In summary, we describe SnapHiC, a method to identify chromatin loops at high resolution and accuracy from sparse scHi-C datasets. Reanalyses of published scHi-C data from mES cells demonstrate that SnapHiC greatly boosts the statistical power in loop detection. Application of SnapHiC to sn-m3C-seq data from human prefrontal cortical cells reveals cell type-specific loops, which can predict putative target genes of noncoding GWAS SNPs. SnapHiC has the potential to facilitate the study of cell type-specific chromatin spatial organization in complex tissues.

Methods

Single-cell Hi-C (scHi-C) data processing

For scHi-C data from mES cells1, we downloaded the raw fastq files of all diploid serum cells. We first aligned scHi-C read pairs to mm10 genome with BWA-MEM with the ‘-5’ option, to report the most 5′ end alignment as the primary alignment, and the ‘-P’ option to perform the Smith–Waterman algorithm to rescue chimeric reads. We only used primary alignments in the next steps. We then deduplicated read pairs with the Picard tool to keep only one read pair at the exact same position. We further applied two filtering steps to remove duplications: (1) we split each chromosome into consecutive nonoverlapping 1-kb bins and only kept one contact for each 1-kb bin pair, and (2) we removed 1-kb bins that contact with more than ten other 1-kb bins, since they are likely mapping artifacts. The number of contacts per cell for all 1,175 cells has a bimodal distribution, and therefore only the top 742 cells with >150,000 contacts per cell were selected for downstream analysis.

Single-nucleus methyl-3C-seq (sn-m3C-seq) data processing

For sn-m3C-seq data from human prefrontal cortex, we performed data processing using reference genome hg19 as described in the previous study3. Afterward, we applied the same filtering steps to remove duplications as described in the Single-cell Hi-C (scHi-C) data processing section. Again, the number of contacts per cell for all 4,238 cells showed a bimodal distribution and the top 2,869 cells with >150,000 contacts per cell were used for downstream analysis. The method for clustering and cell type annotation for these 2,869 cells was the same as previously described3.

SnapHiC algorithm

Step A. Contact probability imputation using the RWR algorithm

We first partitioned each autosome into 10-kb bins and dichotomized contact for each 10-kb bin pair (binary contact matrix with 1 indicating nonzero contact and 0 otherwise). Next, we modeled each autosome as an unweighted graph, where each 10-kb bin is one node and each nonzero contact between any two 10-kb bins is one edge. We also added edges to all adjacent 10-kb bins. We then implemented the RWR algorithm6 with a restart probability of 0.05 to impute the contact probability between all intrachromosomal 10-kb bin pairs. We used the Python ‘NetworkX’ package to construct the graph and adopted the ‘linalg.solve’ function in the Python ‘SciPy’ package to solve the linear equation in the RWR algorithm. The systematic biases in imputed contact probabilities in scHi-C data are negligible, and thus normalization against effective fragment size, GC content or mappability is not needed (Supplementary Note).

Step B. Contact probability normalization based on one-dimensional (1D) genomic distance

Since the contact probability between any two genomic loci is dependent on their 1D genomic distance, normalization of the imputed contact probability against 1D genomic distance is needed before loop calling. To achieve this, we first removed bin pairs residing in the first 50 kb or the last 50 kb of each chromosome, which often have unusually high imputed contact probability due to the edge effect of the RWR algorithm. We then stratified all 10-kb bin pairs by their 1D genomic distance. Specifically, let xij represent the contact probability between bin i and bin j. Define set Ad as all bin pairs (i,j) with the 1D genomic distance d. For simplicity, we only considered bin pairs (i,j) in the upper triangle of the contact matrix where i < j. We removed the top 1% bin pairs in Ad with the highest contact probability, and then computed the mean μd and the standard deviation σd of the contact probability using the remaining bin pairs in Ad. We further calculated the normalized contact probability (that is, z score), defined as zij = (xij − μd)/σd, for all bin pairs in Ad. For single cells with very few contacts, the imputed contact probabilities xij at a specific 1D genomic distance d are close to zero, leading to very small standard deviation σd and numerical errors in the z score transformation. To avoid this issue, when σd is less than 10−6, we defined zij = 0 for all bin pairs in Ad. After the calculation described above, bin pair (i,j) with higher normalized contact probability zij suggests that bin i and bin j are more likely to interact with each other than other loci pairs.

Step C. Identification of loop candidates

To minimize false positives in loop calling results, we defined a bin pair as a loop candidate only if it shows a higher contact probability compared to both its global and local background. Specifically, we required the loop candidate to satisfy the following criteria:

  1. (1)

    Its average normalized contact probability from all single cells is greater than 0 (that is, with respect to global background).

  2. (2)

    More than 10% of single cells have normalized contact probability above 1.96 (that is, z score >1.96, corresponding to a z-test two-sided P value <0.05, with respect to global background).

  3. (3)

    For each 10-kb bin pair (i,j), we defined its local neighborhood as all 10-kb bin pairs (m,n) such that 30 kb ≤ max{d(i,m), d(j,n)} ≤ 50 kb (Supplementary Fig. 1), where d(i,m) is the 1D genomic distance between the center of bin i and the center of bin m. Here we did not consider the bin pairs within 20 kb of bin pair (i,j) as part of its local neighborhood because they can be part of the same loop cluster centered at bin pair (i,j). We then compared the normalized contact probability at bin pair (i,j) with the mean of the normalized contact probability of all ninety-six 10-kb bin pairs within its local neighborhood region, and applied the paired t-test across all single cells to obtain a P value. We further converted P values into false discovery rates (FDRs) using the Benjamini–Hochberg procedure, again stratified by 1D genomic distance. A loop candidate must have FDR < 10% and t-statistics greater than three in the paired t-test (that is, with respect to local background).

  4. (4)

    Motivated by the HiCCUPS algorithm4, we also required each loop candidate to have at least 33% higher average normalized contact frequency than its circle, donut and lower left background and 20% higher average normalized contact frequency than its horizontal and vertical background (Supplementary Fig. 1) (that is, with respect to local background).

  5. (5)

    Finally, we removed loop candidates with either end having low mappability score (≤0.8), or overlapping with the ENCODE blacklist regions (http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz for mm10 and https://www.encodeproject.org/files/ENCFF001TDO/ for hg19). The sequence mappability for each 10-kb bin is calculated based on our previous study20; it can be downloaded from http://enhancer.sdsc.edu/yunjiang/resources/genomic_features/.

Step D. Clustering of loop candidates and identifying the summit(s) as final outputs

For each loop candidate (i,j), we defined its surrounding area as all 10-kb bin pairs (m,n) such that max {d(i,m), d(j,n)} ≤ 20 kb, where d(i,m) is the 1D genomic distance between the center of bin i and the center of bin m. We defined a loop candidate as a singleton if there is no other loop candidate within its surrounding area, and removed all singletons from the downstream analysis since the singletons are likely to be false positives.

To group the remaining nonsingleton loop candidates into clusters, we adopted the Rodriguez and Laio’s algorithm7. Specifically, for each loop candidate (i,j), we first counted the number of loop candidates in its adjacent neighborhood regions: (m,n): max {d(i,m)},d(j,n) ≤ 10 kb, and defined this number as its local density ρ(i,j). Next, we calculated the minimum Euclidean distance between the loop candidate (i,j) and any other loop candidate with higher local density on the same chromosome, defined as δ(i,j):

$$\delta \left( {i,j} \right) = \mathop {{\min }}\limits_{\left( {m,n} \right):\rho \left( {m,n} \right) > \rho (i,j)} \sqrt {\left( {i - m} \right)^2 + \left( {j - n} \right)^2} .$$

If a loop candidate (i,j) has the highest local density (that is, ρ(i,j) = 9), δ(i,j) is defined as:

$$\delta \left( {i,j} \right) = \mathop {{\max }}\limits_{(m,n)} \sqrt {\left( {i - m} \right)^2 + \left( {j - n} \right)^2} .$$

We then selected loop candidates that have high local density ρ, and are relatively far away from other loop candidates with higher local density, that is, high δ, as loop cluster centers. To achieve this, let ρmax and δmax represent the maximal value of ρ and δ of all loop candidates on each chromosome, respectively. We defined \(\rho ^{\prime} \left( {i,j} \right) = \rho \left( {i,j} \right)/\rho _{\mathrm{max}}\) and \(\delta ^{\prime} \left( {i,j} \right) = \delta \left( {i,j} \right)/\delta _{\mathrm{max}}\) such that both \(\rho ^{\prime} \left( {i,j} \right)\) and \(\delta ^{\prime} \left( {i,j} \right)\) are within range [0,1]. We then defined \(\eta \left( {i,j} \right) = \rho ^{\prime} \left( {i,j} \right) \times \delta ^{\prime} \left( {i,j} \right)\), ordered all loop candidates by their η in the descending order and plotted the rank of η against the value of η. In this plot, we selected the reflection point such that the slope at the reflection point is one. All loop candidates with η larger than η at the reflection point were chosen to be the loop cluster centers. After finding the loop cluster centers, we assigned each remaining loop candidate to the same loop cluster as its nearest neighbor with a higher local density ρ.

Within each loop cluster, we defined the loop candidate with the lowest FDR as the first summit of the cluster. For the first summit (i, j), we defined its surrounding area as all 10-kb bin pairs (m,n) such that max{d(i,m),d(j,n)} ≤ 20 kb, and removed all loop candidates within its surrounding area. Next, we selected the loop candidate with the lowest FDR among the remaining ones (if there is any) as the second summit of this cluster. We then removed all loop candidates within the surrounding area of the second summit in the same way as we did for the first summit, and searched for the third summit (if there is any) with the lowest FDR among the remaining loop candidates. Such a procedure was iterated until there were no loop candidates left in this cluster. Note that one loop cluster may contain multiple summits. The SnapHiC algorithm outputs a file containing the summit(s) of each loop cluster as its final chromatin loop list.

Details about the justification of the thresholds implemented in SnapHiC can be found in Supplementary Note and Supplementary Figs. 2 and 3.

Identification of chromatin loops with SnapHiC

We applied SnapHiC to scHi-C data from 10, 25, 50, 75, 100, 200, 300, 400, 500, 600, 700 and 742 mES cells and each of the 14 cell clusters from sn-m3C-seq data of human prefrontal cortex to call chromatin loops at 10-kb resolution between the 100 kb and 1 Mb region on autosomal chromosomes.

We did not take bin pairs within 100 kb into consideration because they do not have complete information in their local neighborhood (refer to SnapHiC algorithm). We also evaluated the performance of SnapHiC beyond 1 Mb 1D genomic distance or at a different resolution; the results are summarized in the Supplementary Note.

Visualization of scHi-C and sn-m3C-seq data using percentage (%) of outlier cells matrix

Due to the sparsity of the raw count matrix of scHi-C data, the SnapHiC-identified loops can be visualized by the percentage of the outlier cells matrix. Specifically, we first computed the percentage of outlier cells (that is, cells with normalized contact probability >1.96), and then took the integer ceiling of 100 × (% of outlier cells) to create a count matrix. We then used the Juicer21 software to convert the count matrix into a .hic file and visualize it in Juicebox22.

Generation of aggregated contact matrix for scHi-C and sn-m3C-seq data

We pooled contacts from single cells of interest to create the aggregated contact matrix in .hic format using Juicer with KR normalization21. Only intrachromosomal contacts >2 kb away were used.

Identification of loops/interactions using HiCCUPS, FastHiC, FitHiC2 and HiC-ACT from aggregated contact matrix

We applied the HiCCUPS4 to the aggregated contact matrix after pooling the contacts from single cells of interest and calling loops at 10-kb resolution with the two sets of parameters: (1) default parameter: ‘-ignore_sparsity -r 10000 -k KR -f.1 -p 2 -i 5 -t 0.02,1.5,1.75,2 -d 20000’ and (2) optimal parameter: ‘-ignore_sparsity -r 10000 -k KR -f .1 -p 4 -i 15 -t 0.4,1.5,1.75,2 -d 20000’.

We applied FitHiC2 (ref. 5), FastHiC15 and HiC-ACT16, with the default setting to the aggregated contact matrix after pooling the contacts from single cells of interest at 10-kb bin resolution. We also tested different significance thresholds to accommodate the sparse scHi-C data: FDR < 1%, <5% and <10% for FitHiC2; the posterior probability of significant interactions >0.9, >0.99 and >0.999 for FastHiC and the local neighborhood smoothed P values <10−6, <107 and <108 for HiC-ACT. After getting the raw output, we further removed significant chromatin interactions supported by fewer than six reads to minimize false positives. We then applied the same algorithm implemented in SnapHiC (Step D in SnapHiC algorithm) to identify their summits.

To ensure a fair comparison with SnapHiC-identified loops, we further filtered the above identified loops/interactions by selecting the intrachromosomal ones within 1D genomic distance roughly 100 kb–1 Mb and removing the loops whose anchor bins had low mappability (≤0.8) or overlapped with the ENCODE blacklist regions.

Definition of loop overlap

Let bin pair (i,j) represent a loop in set A. We define that it overlaps with a loop in set B, if and only if there exists a loop (m,n) in set B such that max(dim, djn) ≤ 20 kb, where dim is the 1D genomic distance between the middle base pair of bin i and the middle base pair of bin m.

Subsampling of scHi-C and sn-m3C-seq data

For scHi-C data from mES cells, we randomly permuted the order of all 742 cells that passing our quality control six times, and selected the first 10, 25, 50, 75, 100, 200, 300, 400, 500, 600 and 700 cells from all 742 cells to create a series of subsampled datasets.

For sc-m3C-seq data from human prefrontal cortex, we randomly permuted the order of all 338 astrocytes, 323 microglia and 1,038 oligodendrocytes and selected the first 261 cells to create the subsampled datasets for astrocytes, microglia and oligodendrocytes, respectively.

Reproducibility of SnapHiC- and HiCCUPS-identified loops

Suppose we have two sets of loop list A and B. Let PA represent the proportion of loops in set A overlapped with loops in set B (Definition of loop overlap) and let PB represent the proportion of loops in set B overlapped with loops in set A. We used (PA + PB)/2 to measure the reproducibility of loops in the two sets.

To assess the reproducibility of SnapHiC and HiCCUPS, we first randomly split all 742 mES cells into two groups where each group consists of 371 cells, and then applied SnapHiC and HiCCUPS to identify loops for each group. The reproducibility of SnapHiC- and HiCCUPS-identified loops between two sets of 371 cells was calculated as described above. We repeated such random splitting and loop calling analysis ten times, and reported the mean of reproducibility of SnapHiC and HiCCUPS-identified loops. We further used the paired t-test to evaluate the statistical significance of the difference in reproducibility between the methods.

Generation of the reference loop/interaction lists for calculation of precision, recall and F1 score

For mES cells, the HiCCUPS loops at 10-kb resolution from bulk in situ Hi-C data were called as previously described14 using the pooled datasets of all four biological replicates from the Bonev et al. study8. MAPS9 was applied to H3K4me3 PLAC-seq data9, cohesin HiChIP data10 and H3K27ac HiChIP data11 to call significant intrachromosomal interactions at 10-kb resolution within 1 Mb. We combined the above four lists and further filtered by removing interactions where anchor bins had low mappability (≤0.8) or overlapped with the ENCODE blacklist regions to create the final reference loop list.

For oligodendrocytes, microglia and eight neuronal subtypes from human prefrontal cortex, we used MAPS-identified interactions from H3K4me3 PLAC-seq data of purified oligodendrocytes, microglia and neurons as their reference list, respectively17. We first filtered the list by selecting loops with 1D genomic distance roughly 100 kb–1 Mb and removing loops where anchor bins had low mappability (≤0.8) or overlapped with the ENCODE blacklist regions. We further selected the loops in which at least one end contains active promoters of the corresponding cell type to create the final reference interaction list.

Calculation of precision, recall and F1 score

Let N represent the number of loops in the reference loop list for the cell type of interest. Suppose method A identifies M loops from the same cell type, and m of them overlapped with loops in the reference loop list (Definition of loop overlap). The precision is calculated as m/M. Suppose among all N loops in the reference loop list, n loops overlapped with method A-identified loops. The recall is calculated as n/N. Notably, m and n may not be equal since we allow up to a 20-kb gap between two overlapped loops. The F1 score is the harmonic mean of the precision and recall and is calculated as below:

$${\mathrm{F1}}\,{\mathrm{score}} = 2 \times \frac{{{\mathrm{Precision}} \times {\mathrm{Recall}}}}{{{\mathrm{Precision}} + {\mathrm{Recall}}}} = 2 \times \frac{{m/M \times n/N}}{{m/M + n/N}}.$$

For mES cells, we used all SnapHiC-, HiCCUPS-, FastHiC-, FitHiC2- or HiC-ACT-identified loops/interactions for the above calculation. For oligodendrocytes, microglia and eight neuronal subtypes, we only selected the SnapHiC-, HiCCUPS-, FastHiC-, FitHiC2- or HiC-ACT-identified loops/interactions in which at least one end contains active promoters of the corresponding cell type for this calculation, since the available reference loop lists are called from H3K4me3 PLAC-seq data, which can only detect interactions centered at promoter regions.

APA

We used the Juicer21 software with the command ‘java -jar juicer_tools_1.19.02.jar apa -r 10000 -k KR -u input.hic loops.txt APA’ to perform the APA. We reported ‘P2LL’ (also known as the APA score) and ‘ZscoreLL’ to evaluate the enrichment of SnapHiC-identified loops with respect to the lower left background.

CTCF motif orientation analysis

We obtained the CTCF ChIP–seq peaks of mES cells from Kubo et al.23 and used FIMO24 with default parameters and the CTCF motif (MA0139.1) from the JASPAR25 database to search for CTCF sequence motifs among CTCF ChIP–seq peaks. We then selected a subset of testable SnapHiC-identified loops in which both ends contain either a single CTCF motif or multiple CTCF motifs in the same direction and calculated the proportion of convergent, tandem and divergent CTCF motif pairs among all testable loops.

Visualization of CTCF and H3K27ac ChIP–seq data from mES cells

We downloaded the signal tracks from the ENCODE portal (https://www.encodeproject.org/) with the following identifiers: ENCFF230RNU (for H3K27ac) and ENCFF069PTO (for CTCF) for Extended Data Fig. 4a.

Definition of cell type-specific SnapHiC loops

We used the SnapHiC loops identified from subsampled astrocytes, microglia and oligodendrocytes datasets and L2/3 excitatory neurons (all with 261 cells) to define cell type-specific loops. We defined a loop identified from one cell type as cell type-specific, if it did not overlap (Definition of loop overlap) with loops identified from any of the other three cell types.

Processing of ATAC-seq and H3K27ac ChIP–seq data from four brain cell types

The ATAC-seq and H3K27ac ChIP–seq data from human astrocytes, oligodendrocytes, microglia and neurons are from Nott et al.17 and are processed with ENCODE ATAC-seq and ChIP–seq pipelines as previously described17. The normalized bigwig tracks with reads per kilobase of a transcript, per million mapped reads as the y axis are generated for visualization in Fig. 2b.

Processing of RNA-seq from four brain cell types

The RNA-seq data from human astrocytes, oligodendrocytes, microglia and neurons were acquired from Zhang et al.18. The alignment and quantification are performed with pipeline: https://github.com/ren-lab/rnaseq-pipeline. Briefly, we first aligned RNA-seq raw reads to hg19. Next, we used Gencode GTF gencode.v19.annotation.gtf for hg19 with STAR following the ‘ENCODE’ options outlined in the STAR manual (https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf). We then used Picard (http://broadinstitute.github.io/picard/) to remove PCR duplicates. We also generated the normalized bigwig tracks with reads per kilobase of a transcript, per million mapped reads as the y axis for visualization in Fig. 2b.

Enrichment analysis of ATAC-seq or H3K27ac ChIP–seq signals at cell type-specific loops

To quantify the intensity of ATAC-seq or H3K27ac ChIP–seq signals at cell type-specific loops in the cell type of interest, we first calculated reads per million (CPM) values in each 10-kb anchor of the cell type-specific loops using ATAC-seq or H3K27ac ChIP–seq data from the cell type of interest. To minimize the background noise, we only considered the reads falling into the ATAC-seq or H3K27ac ChIP–seq peak regions defined in the cell type of interest but not all the reads in the entire 10-kb bin. If there are multiple ATAC-seq or H3K27ac ChIP–seq peaks in the same 10-kb bin, we then added up the CPM values and took the sum as the value for that 10-kb bin. Since each loop has two anchors, we took their average CPM to represent the intensity of ATAC-seq or H3K27ac ChIP–seq signal for that loop in the cell type of interest. Last, we applied the paired Wilcoxon signed-rank test on log2(CPM+1) values from different combinations of cell types of interest and the cell type-specific loop sets to test whether there is a significant difference.

Gene expression analysis at cell type-specific loops

We obtained the fragments per kilobase of transcript per million mapped reads (FPKM) values of each protein-coding gene in human astrocytes, neurons, microglia and oligodendrocytes from Supplementary Table 4 provided in Zhang et al. (Col P-U for astrocytes, Col AB for neurons, Col AC-AG for oligodendrocytes and Col AH-AJ for microglia in the ‘Human data only’ tab)18. For each gene, we took the average of FPKM across biological replicates of the same cell type. For the selected genes where promoters overlapped with cell type-specific loops, we applied the Wilcoxon signed-rank test to evaluate whether they were highly expressed in the matched cell type.

Selection of GWAS SNPs associated with neuropsychiatric disorders and traits

We first collected 30,262 genome-wide significant (P < 5 × 10−8) noncoding GWAS SNPs associated with neuropsychiatric disorders and traits. We considered seven neuropsychiatric disorders, including Alzheimer’s disease26, attention deficit hyperactivity disorder27, autism spectrum disorder28, bipolar disorder29, intelligence quotient30, MDD31 and schizophrenia32, resulting in a total of 28,099 unique GWAS SNPs (Supplementary Table 6). We then overlapped these GWAS SNPs with active enhancers of astrocytes, neurons, microglia or oligodendrocytes defined in the previous study17 and this resulted in 3,639 SNP-disease associations (3,471 unique GWAS SNPs) for analysis (Supplementary Table 6).

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.