scEnhancer: a single-cell enhancer resource with annotation across hundreds of tissue/cell types in three species

Abstract Previous studies on enhancers and their target genes were largely based on bulk samples that represent ‘average’ regulatory activities from a large population of millions of cells, masking the heterogeneity and important effects from the sub-populations. In recent years, single-cell sequencing technology has enabled the profiling of open chromatin accessibility at the single-cell level (scATAC-seq), which can be used to annotate the enhancers and promoters in specific cell types. A comprehensive resource is highly desirable for exploring how the enhancers regulate the target genes at the single-cell level. Hence, we designed a single-cell database scEnhancer (http://enhanceratlas.net/scenhancer/), covering 14 527 776 enhancers and 63 658 600 enhancer-gene interactions from 1 196 906 single cells across 775 tissue/cell types in three species. An unsupervised learning method was employed to sort and combine tens or hundreds of single cells in each tissue/cell type to obtain the consensus enhancers. In addition, we utilized a cis-regulatory network algorithm to identify the enhancer-gene connections. Finally, we provided a user-friendly platform with seven useful modules to search, visualize, and browse the enhancers/genes. This database will facilitate the research community towards a functional analysis of enhancers at the single-cell level.

Assay of Transposase-Accessible Chromatin using sequencing (ATAC-seq) is a powerful tool for epigenomic profiling of cell type-specific chromatin accessibility (25). It was reported that at least 50% and around 25% of the bulk ATAC-seq peaks fell into the enhancer and promoter regions, respectively (26). Thus, the peaks called from ATAC-seq mainly represent cis-regulatory elements, including enhancers and promoters, and can be used to annotate tissue/cell type-specific enhancers or promoters (22,27,28). At the single-cell level, scATAC-seq studies have been D372 Nucleic Acids Research, 2022, Vol. 50, Database issue widely applied to identify cell type-specific enhancers or enhancer-promoter interactions (1,3,4,(29)(30)(31)(32)(33)(34). Especially, accessible sites in the human genome identified by scATACseq displayed a high overlap with 75% of experimentally validated active enhancers in VISTA (3,9). Using the Graphical LASSO, a single-cell cis-regulatory network algorithm, Cicero, was developed (33) and enabled identification of genome-wide enhancer-promoter connections on a large scale (1,3,4,(29)(30)(31)(32)34). For example, Domcke et al utilized Cicero to identify 6.3 million unique pairs of cis-regulatory elements in 54 human cell types (3). These indicate that scATAC-seq may be an ideal single-cell sequencing technique for annotating enhancers and enhancer-promoter interactions.
Here, we constructed a single-cell enhancer database, scEnhancer, based on an improved unsupervised learning approach previously developed in our bulk enhancer database, EnhancerAtlas 2.0 (6). This method was used to integrate many genomic datasets to derive a consensus annotation of enhancers. It displayed several characteristics: (i) it was based on a well-designed score voting strategy for ranking and combining a large set of unlabelled data (7,35); (ii) we replaced the Pearson correlation with the Jaccard index, which was appropriate for the binary nature of scATAC-seq data, for computing similarity among all single-cell datasets (36); (iii) in contrast to the only 12 independent high-throughput datasets used in EnhancerAtlas 2.0, the new method could process tens or hundreds of single-cell datasets with different qualities as measured by the average number of fragments per cell; (iv) the Cicero results were used as a filtering condition for identification of the final single-cell enhancers (33). We also leveraged Cicero to generate enhancer-promoter connections of high quality (1,3,4,(29)(30)(31)(32)34). In some aspects, as a comprehensive single-cell enhancer resource, scEnhancer possesses tremendous advantages: (i) it has profiled 1 196 906 single cells and annotated a total of 14 527 776 enhancers in 775 tissue/cell types across three species; (ii) a suitable combination of an improved unsupervised learning method and a cis-network algorithm was applied to identify the enhancers and enhancer-gene interactions and (iii) a user-friendly platform with seven functional modules and the browser options were designed for searching, visualizing, drawing, and browsing enhancer or enhancer-promoter profiles. These will facilitate the analysis of enhancers at the single-cell level for the research community.

Integration of tissue/cell-type specific binary matrix
To obtain the tissue/cell-type specific binary matrix, we first converted the processed or raw scATAC-seq data into a large standard matrix with labelled cell types. In most scATAC-seq projects, the single-cell data are usually presented in many file formats, such as MatrixMarket (4), h5 (38), txt (41), RangedSummarizedExperiment (42), Seurat RDS (3), or even the raw fastq files (43). Here, we used the functions in R language and cellranger-atac 1.2.0 (30) to transform these different formats of single-cell data into large standard matrices for the subsequent extraction of small cell-type specific matrices (Supplementary Table S1). Data in MatrixMarket or h5 formats were transformed into the standard matrix via 'Matrix::readMM' and 'Read10X h5' in Signac (38) while the txt datasets could be read as data.frame and then converted into the matrices. For the RangedSummarizedExperiment dataset, we first parsed its R structure to obtain the matrix, peaks, cell barcodes and cell type annotations from its 'assay', 'colnames', 'colData', 'rowRanges' slots (42). Analogously, we parsed the dataset in Seurat RDS and extracted the matrix with peak, barcode, and cell type information from the slots 'GetAssayData', 'assays RNA/peaks' and 'meta.data', respectively (3). The raw fastq single-cell datasets could be transformed into a large standard matrix by cellranger-atac (30) that processes peak and cell callings into MatrixMarket or h5 formats (43). Finally, we removed the irregular datasets with fewer than 200 peaks. Signac and cellrangeratac tools can be downloaded and installed from: https:// satijalab.org/signac/ and https://support.10xgenomics.com/ single-cell-atac/software/downloads/1.2.
After integrating the large standard matrix containing barcodes of all mixed cell types, we extracted cell-type specific single-cell datasets from the matrix based on the cell type annotation information in the metadata. To make the single-cell datasets comparable, we binarized them to normalize each dataset. In each dataset, the peak signal was set to '1' ('open') for at least one read and '0' ('closed') otherwise in the absence of reads (36). Thus, all the singlecell datasets for one tissue/cell type were merged and consolidated into a binary cell-type matrix. In this cell-type matrix, single cells (i.e. columns) with less than 200 peaks (i.e. rows) were removed, as well as the peaks without any signal in all single cells or in uncommon chromosomes (e.g. chr1 random). We also removed the cell types with <50 single cells. Genomic coordinates of peaks with other genome builds in human and mouse were converted by liftOver to GRCh37/hg19 and GRCm37/mm9, respectively (40).

Generation of consensus single-cell enhancers
We designed an improved unsupervised method to identify the bulk consensus enhancers from 12 types of independent datasets in EnhancerAtlas 2.0 (6). Here, we modified the method to determine the weights of hundreds of single-cells and combine them to generate consensus singlecell enhancers. On average, there were hundreds of single cells per cell type in our integrated data (Table 1). For these datasets we hypothesized that one dataset was high-quality if it was highly correlated with the other datasets and lowquality otherwise (7). Since single-cell datasets are binary, Nucleic Acids Research, 2022, Vol. 50, Database issue D373 to weigh more on the 'open' peaks across the whole genome rather than the 'close' peaks, we applied the Jaccard coefficient that had been used for scATAC-seq clustering analysis (36,44), to measure the correlation between any two single cells (e.g. c i and c j ) based on the overlapping degree in open chromatin regions: where |C i |, |C j | and |C i ∩ C j | represent the number of 'open' peaks in C i , C j and their overlap, respectively. For a cell type with n single-cell datasets, a Jaccard similarity matrix for all combined datasets was integrated as: ⎡ Using the Jaccard matrix, we measured the weight of any single-cell dataset C i as: By combining all single cells into one cell type, the signal score of any consensus single-cell peak i could be defined as: where the S C j (i ) represents the signal score of the peak i in the single-cell dataset C j .
Furthermore, we removed the single-cell peaks overlapping with promoter or exon regions. We also used the experimentally validated silencers in SilencerDB (45) as a key filter to remove the single-cell peaks that overlapped with silencers. Finally, one consensus single-cell peak should satisfy two the requirements: (i) The signal of single-cell peak should be larger than 95% of the random signals calculated by shuffling the peaks in each single cell; (ii) Cicero connection score (≥0.1) is required to display the interaction of single-cell peak with at least one gene promoter.
To evaluate the accuracy of single-cell enhancers identified by this approach, we extracted the experimentally validated active enhancers from the VISTA database (9) as the gold standard. We compared them with bulk enhancers from the EnhancerAtlas 2.0 (6) in four human tissue/cell types. For single-cell or bulk enhancers in one cell type, the ones that overlapped with VISTA enhancers were classified as the positives while the others remained as the negatives. The sensitivity and specificity of the single-cell or bulk enhancers were computed on a base pair basis. We used the area under the receiver operating characteristic (AROC) to evaluate the performance for single-cell or bulk enhancers. The results showed that single-cell enhancers in brain, heart, eye and cranial nerve had much more overlaps with VISTA enhancers than the ones in bulk enhancers, as well as an average higher performance measured by AROC than bulk enhancers (Supplementary Figure S2). This indicated that single-cell enhancers were more accurately annotated than bulk enhancers.

Identification of enhancer-promoter interactions in scATACseq data
To identify the enhancer-promoter interactions in cell types, we employed the single-cell cis-regulatory network tool, Cicero, which had been widely used in many scATAC-seq projects to identify all the distal elements (e.g. enhancers)-promoter connections on a genome-wide basis (3,4,(29)(30)(31)(32)(33)(34). Because the scATAC-seq binary data for each tissue/cell type are extremely sparse, it is difficult to make accurate estimates of the co-accessibility score of chromatin accessibility loci with no normalization of the matrix (33,36). To successfully use Cicero, we transformed the binary matrix into a term frequency-inverse document frequency (TF-IDF) matrix using the latent semantic indexing (LSI) method for aggregating similar single cells to obtain denser counts in each peak (1,4). For each cell type, the binary count matrix M was converted into TF-IDF matrix as following: We then performed the Singular Value Decomposition (SVD) on the transformed TF-IDF matrix to reduce the dimensionality (1,4). Since the first dimension was highly correlated with the read depth, only the 2nd to 50th dimensions were passed to UMAP for 2D visualisation. Finally, Cicero used the UMAP coordinates and normalization matrix as the standard input to calculate the co-accessibility scores among peaks within a limited distance in DNA by the Graphical LASSO algorithm. Applying Cicero to each cell type across the three species with a cut-off of value >0.1, we annotated 63 658 600 enhancer-promoter connections involving 4 942 303 promoters, and 14 527 776 enhancers across 775 tissue/cell types in human, mouse and fly.

Implementation of scEnhancer
We developed a powerful web server, scEnhancer, for singlecell analysis of enhancers and enhancer-promoter interactions. scEnhancer adopted the Linux CentOS7 with a new web configuration of nginx (1.20.1)-php (5.4.16)-mySQL (5.7.34) in a new web configuration to build the website. In addition, we employed perl (5.16.3) for the fast processing of text files with large data. Moreover, the HTML5 Canvas API and Javascript with a drawing module were utilized together to establish a genome browser for displaying enhancer/gene distributions for single-cell or consensus datasets of tissue/cell types. We also set up a two-handle

Database statistics
We employed a modified unsupervised learning approach and the Cicero algorithm to build the single-cell enhancer resource scEnhancer (Supplementary Figure S3). To date, scEnhancer has catalogued 775 tissue/cell types, including 14 527 776 consensus single-cell enhancers and 63 658 600 enhancer-gene interactions from 1 196 906 single cells in three species (Table 1). We overlaid SNPs from GWAS (46) or TF binding motifs from JASPAR (47) with enhancer regions and found that the single-cell enhancers were much enriched for SNPs and TF binding sites. We also summarized the number of single cells, enhancers, and enhancergene connections in all tissue/cell types for each species (Supplementary Tables S2-S4). The integrated cell types covered many cancers and nearly all tissues in human and mouse (Supplementary Tables S2 and S3). Most cell types display high quality with an average of >3000 peaks per cell (Supplementary Tables S2-S4). The final consensus singlecell enhancer was determined by the possible functional evidence from enhancer-promoter interactions as defined by Cicero (33). As more and more cell type-specific marker genes were identified (48), we will use these marker genes to confirm the cluster's cell type in the scATAC-seq clustering analysis to confirm the cell types of clusters and then predict single-cell enhancers even when single-cell datasets are not labeled with cell type information.

Simple search
We designed five user-friendly analytical modules in scEnhancer for a simple search of single-cell enhancers ( Figure  1): (i) Search for single-cell enhancers by region ( Figure 1A).
(ii) Search for single-cell enhancers by a gene ( Figure 1B). (iii) Compare single-cell enhancers from different tissue/cell types ( Figure 1C). (iv) Compare enhancers of a gene in different tissue/cell types ( Figure 1D). (v) Predict target genes in genomic regions at the single-cell level (e.g. peaks from ChIP-Seq) ( Figure 1E). Users can search for the enhancers by region in any tissue/cell of any species. In each module, an 'Example' button can facilitate users to give input in one-step for a simple search. We allowed the gene name or ID input from several common gene/protein resources, including Ensembl, EMBL,UCSC, PDB, FlyBase, Ref-Seq and UniProt (49)(50)(51)(52)(53)(54)(55). These modules serve as easy-touse web interfaces for users to search, visualize and download single-cell enhancers and enhancer-promoter connections in any genomic region or any tissue/cell type(s).

Advanced search
We developed two powerful modules with several R packages as 'advanced search' to graphically display the differences/similarities among cell types or reveal the cell type specificity of enhancers at the single-cell level: (i) Display the differences among tissue/cell types by singlecell enhancer clustering (Figure 2). (ii) Identify tissue/cell type-specific enhancers at the single-cell level (Figure 3).
To avoid the batch effects to the greatest extent, we assigned each tissue/cell type to a batch and compared different tissue/cell types within the same batch. Based on the equipment or technology platforms, we classified all the cell types into 10, 8 and 1 batches in human, mouse and fly, respectively.
In the first module, the users can select a group of cell types of interest in the same batch to observe their differences or similarities among them (Figure 2A). Merging the scATAC-seq matrices of the selected cell types, the differences among selected tissue/cell types can be displayed by DimPlot of Signac (38) (Figure 2B). In addition, this module can also calculate and present the similarities among selected tissue/cell types by Jaccard index ( Figure 2C).
To reveal the cell type specificity of single-cell enhancers, the users can select a batch of interest and a primary cell type in which specific enhancers were found and select the reference cell types to compare with ( Figure 3). By clicking on 'Search', a list of cell type-specific enhancers will be obtained ( Figure 3A). Moreover, the cell-type specificity of the identified single-cell enhancers can be displayed by particular analyses, such as VlnPlot, FeaturePlot, and Heatmap ( Figure 3B-D).

Browser of single-cell enhancers
A browser page was provided in scEnhancer for accessing the single-cell enhancers. By clicking on the species name or image, the users can browse any tissue/cell type, any chro-mosome, and any single-cell enhancer, generating a summary table including the genomic coordinate of the enhancer, the contained GWAS SNPs (46), TF binding sites (47), relative super-enhancers (8), diseases (17) and DNA sequences (Figure 4).

CONCLUSIONS
scEnhancer is the first database to annotate enhancers or enhancer-promoter interactions at the single-cell level. It contains 50 861 554, 11 458 766 and 1 338 280 enhancerpromoter connections involving 3 905 212, 942 549 and 94 542 promoters, 11 373 862, 2 694 616 and 459 298 enhancers across 543, 185 and 47 tissue/cell types in human, mouse and fly, respectively. We believe this is the most comprehensive enhancer database that includes the largest number of enhancer-related datasets at the single-cell level.

DATA AVAILABILITY
A webserver with multiple analytic tools and deep browser capabilities is available at http://www.enhanceratlas.net/ scenhancer and no login is required for all users to access the website. Tutorials for performing the scEnhancer analytic tools are freely provided at http://www. enhanceratlas.net/scenhancer/help.php. All the data including single-cell enhancers, promoters, enhancer-promoter interactions, SNPs/Motifs in enhancers, and scATAC matrix in tissue/cell types could be downloaded in http://www. enhanceratlas.net/scenhancer/download.php