Pig-eRNAdb: a comprehensive enhancer and eRNA dataset of pigs

Wang, Yifei; Jin, Weiwei; Pan, Xiangchun; Liao, Weili; Shen, Qingpeng; Cai, Jiali; Gong, Wentao; Tian, Yuhan; Xu, Dantong; Li, Yipeng; Li, Jiaqi; Gong, Jing; Zhang, Zhe; Yuan, Xiaolong

doi:10.1038/s41597-024-02960-7

Download PDF

Data Descriptor
Open access
Published: 01 February 2024

Pig-eRNAdb: a comprehensive enhancer and eRNA dataset of pigs

Yifei Wang¹^na1,
Weiwei Jin²^na1,
Xiangchun Pan¹^na1,
Weili Liao¹,
Qingpeng Shen¹,
Jiali Cai¹,
Wentao Gong¹,
Yuhan Tian¹,
Dantong Xu¹,
Yipeng Li¹,
Jiaqi Li¹,
Jing Gong²,
Zhe Zhang ORCID: orcid.org/0000-0001-7338-7718¹ &
…
Xiaolong Yuan ORCID: orcid.org/0000-0002-3743-0130¹

Scientific Data volume 11, Article number: 157 (2024) Cite this article

746 Accesses
Metrics details

Subjects

Abstract

Enhancers and the enhancer RNAs (eRNAs) have been strongly implicated in regulations of transcriptions. Based the multi-omics data (ATAC-seq, ChIP-seq and RNA-seq) from public databases, Pig-eRNAdb is a dataset that comprehensively integrates enhancers and eRNAs for pigs using the machine learning strategy, which incorporates 82,399 enhancers and 37,803 eRNAs from 607 samples across 15 tissues of pigs. This user-friendly dataset covers a comprehensive depth of enhancers and eRNAs annotation for pigs. The coordinates of enhancers and the expression patterns of eRNAs are downloadable. Besides, thousands of regulators on eRNAs, the target genes of eRNAs, the tissue-specific eRNAs, and the housekeeping eRNAs are also accessible as well as the sequence similarity of eRNAs with humans. Moreover, the tissue-specific eRNA-trait associations encompass 652 traits are also provided. It will crucially facilitate investigations on enhancers and eRNAs with Pig-eRNAdb as a reference dataset in pigs.

A compendium and comparative epigenomics analysis of cis-regulatory elements in the pig genome

Article Open access 13 April 2021

A compendium of genetic regulatory effects across pig tissues

Article Open access 04 January 2024

Pig genome functional annotation enhances the biological interpretation of complex traits and human disease

Article Open access 06 October 2021

Background & Summary

During the growth and development of organisms, the transcriptions of genes are subject to a variety of complex regulators¹. Previous studies have reported that transcriptions are controlled by the cis-regulatory elements^2,3. Enhancers, first discovered in 1981, are widely considered as the important cis-regulatory elements to significantly increase the transcriptions of target genes⁴, and have been shown to bind the promoters of target genes at a distance to activate transcriptions^5,6, regardless of directions. It is important to note that enhancers are strongly specifically expressed in different cell types and tissues^7,8. Recently, ENCODE⁷, FANTOM5^9,10, and Roadmap Epigenomics¹¹ projects have established to collect enhancers of humans and animals.

Recently, enhancers in a genome-wide manner have been identified with multi-omics in humans and mice by machine learning, which greatly decreases the time and cost to define enhancers, compared with experimental methods¹². Currently, there are numerous strategies utilizing machine learning for enhancer prediction. For instance, in humans and mice, PEDLA uses a lot of heterogeneous data to predict enhancers in H1 cells, e.g., the chromatin accessibility (DNase-Seq), RNA-Seq, DNA methylation, the ChIP-Seq of 27 histone modification marks and 15 transcription factors (TFs), and achieved 97.7% accuracy by a deep neural networks algorithmic framework in 2016¹³. In 2017, He et al. develop REPTILE¹⁴ based on random forest classifier with the epigenomic signatures to predict enhancers, e.g., H3K4me1, H3K4me2, H3K4me3, H3K27ac, H3K27me3, and H3K9ac as well as DNA methylation, and the accuracy of REPTILE is 94.4%, which is higher than DELTA¹⁵ in H1 cells. In 2019, Ramisch et al.¹² use two binary random forest classifiers for the enhancer prediction using the signatures of H3K4me1, H3K4me3, and H3K27ac, and yield stable results with the area under the precision-recall ∈ [0.91,0.95], which is superior to REPTILE¹⁴. In 2021, Zhanlin Chen et al. initiate DECODE utilizing STARR-seq data, chromatin accessibility (ATAC-seq and DNase-seq) and signatures for H3K27ac, H3K4me3, H3K4me1 and H3K9ac to extract accurate cell-type-specific enhancers¹⁶, and the average of the precise recall of area under the curve of DECODE is 24% higher than Matched-Filter¹⁷. These appearances suggested that multi-omics with machine learning are likely to improve the accuracy of enhancer prediction.

Enhancer RNAs (eRNA) are discovered and reported as the non-coding RNAs for bidirectional transcription dependent on RNA polymerase II in the enhancer region¹⁸. Previous studies have showed that 40,000–65,000 eRNAs expressed in humans, and eRNAs have been reported to regulate the transcriptions of target genes¹⁹ and the activation of enhancers^10,20. For instance, eRNAs promote transcriptional condensation formation to activate enhancers in MCF7 breast cancer cells²¹, and the eRNA transcribed from Pcdh-α HS5-1 enhancer is likely to form an R-loop structure and alter the chromatin structure to strengthen the expression of Pcdh-α²². Furthermore, several eRNA databases have been completed in humans^23,24 and animals²⁵. For example, HeRA²³ characterizes the human eRNAs, which collects the RNA-seq files (9577 samples across 54 human tissues) from GTEx and the annotations of enhancers from ENCODE, FANTOM and Roadmap Epigenomics projects. GPIeR characterizes the impact of genetic variants on eRNA expression using large-scale omics data from The cancer Genome Atlas²⁶. Besides, Animal-eRNAdb²⁵ has been developed for the eRNAs of animals, e.g. chickens, sheep, rats and mice, basing on 5085 RNA-seq data from NCBI and the annotations of enhancers from SEA 3.0 and EnhancerAtlas v2.0.

The pig is not only an important agricultural animal that provides pork and animal proteins, but also serves as a necessary biomedical model for humans²⁷. However, the eRNA profiles have not been characterized in pigs. In this study, we proposed a package CNNEE (a convolutional neural network (CNN)-based pipeline to track enhancers and eRNAs, https://github.com/WangYF33/CNNEE) using multi-omics of the chromatin accessibility (ATAC-seq) and histone modifications (H3K27ac and H3K4me3) in multi-tissues, as well as using the data of RNA-seq due that enhancers are able to transcribe eRNAs. Moreover, we collected RNA-seq data from 607 samples across 15 pig tissues to characterize the eRNA profiles, the target genes and regulators for eRNAs, the tissue-specific eRNAs, the housekeeping eRNAs (HKeRNAs), and the associations between eRNAs and phenotypes in pigs. Pig-eRNAdb will facilitate the functional investigations of enhancers and eRNAs in pigs.

Methods

Pig-eRNAdb is an integrated dataset of enhancers and eRNAs, containing the coordinates of enhancers and eRNAs, the target genes and regulators of eRNAs, and the sequence similarities of eRNAs. The analysis protocols of Pig-eRNAdb were showed in Fig. 1.

Data collection and processing

We tried to collect data from multiple developmental stages and tissues to achieve the comprehensive prediction of enhancers. In this study, we downloaded the bigwig and narrow peak files of ATAC-seq, ChIP-seq of H3K27ac and H3K4me3, as well as RNA-seq data across five tissues (heart, livers, spleen, muscle and fat) of pigs (Supplemental Table S1). Signal tracks in the bigwig files were extracted for ATAC-seq, ChIP-seq of H3K27ac and H3K4me3 by python (version: 3.6.4). 607 RNA-seq data were downloaded from the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) across 15 tissues of pigs (Supplemental Table S2), e.g., hearts, livers, spleens, backfat, gallbladders, jejunums, kidneys, lungs, ovaries, pituitaries, muscles, skeletal muscles (Sk-Muscle), longissimus dorsi muscles (Ld-Muscle), pineal and ileum. The raw data were downloaded and converted as fastq files using fastq-dump in SRA Toolkit (version: v2.8.2), and enslaved to quality control and clean with fastp (version: 0.23.0)²⁸, and were mapped to the pig reference genome (https://ftp.ensembl.org/pub/release-105/fasta/sus_scrofa/dna/) using HISAT2 (version: v2.1.0)²⁹ to obtain the bam files, which were sorted by SAMtools (version: 1.11)³⁰. Furthermore, we merged all files using the merge parameter of STRINGTIE (version: 2.1.1) to unify the standard. Finally, we used STRINGTIE (version: 2.1.1) to compute transcripts per million (TPM)³¹ to normalize the gene expression, and obtained a gene expression matrix. The genome annotation file was download from Ensembl (https://ftp.ensembl.org/pub/release-105/gtf/sus_scrofa/). The annotations of enhancers in humans and mice were downloaded from ENCODE (http://screen.encodeproject.org/) (Supplemental Table S3). The repeat elements of pigs were downloaded from the UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/susScr11/database/).

Input data matrix construction

We used the multi-omics features of the RNA-seq data, the chromatin accessibility and the histone modifications (H3K27ac and H3K4me3) as input data. Firstly, the raw sequencing data from RNA-seq were processed into bam files according to the transcriptome data processing flow in the above method, and then used the bamCoverage tool from deepTools, and utilized the bigWigToBedGraph and bedGraphToBigWig tools from UCSC to process the bam files into bigwig files.

The peaks of H3K27ac and H3K4me3 intersected with the ATAC-seq peak region, and then intersected with transcripts with expression (TPM) greater than 0.1 as the narrow peak of transcript data. The ATAC-seq peaks overlapping with a transcript peak and the peaks of an active histone were defined as the active enhancers and positive training set. The negative training samples were selected from the background by downsampling, with 10 times of the positive set, and we unified the positive and negative set samples into a 4 kb window and aggregate the signals in 10 bp bins.

The input data was placed into a matrix of size 4 × 400 which contained signal trajectories for three epigenetic data and transcriptome data in the 4 kb region. In the input matrix, each value represented the signal strength at the corresponding genomic location. The precision of detected boundaries of enhancer regions was determined by the resolution of the signal integration. In this study, each value in the matrix of input data represented the average epigenetic signal value in a bin with a resolution of 10 bp.

Construction of the predict model in pigs

We constructed a CNN-based binary classifier with reference to ResNet³² and DECODE¹⁶, including convolutional layers, pooling layers, and dense fully connected layers. The convolutional layer was responsible for comprehensively extracting features from multi-omics. The Seekeze-and-Excitation blocks for calculating residual features and adaptively recalibrating channel-based feature responses were placed among each convolutional layer³³. The max-pooling layer reduced the number of parameters trained in the previous convolutional layer and preserved some important feature information. The pooled layers were then fed into the fully connected layer makes a sigmoid prediction about the probability of an enhancer in the region.

In the study, we discovered six models. Model 1: 6 convolutional layer, 1 max-pooling layer, 1 fully connected layer; Model 2: 7 convolutional layer, 2 max-pooling layer, 1 fully connected layer; Model 3: 6 convolutional layer, 2 max-pooling layer, 1 fully connected layer; Model 4: 7 convolutional layer, 3 max-pooling layer, 1 fully connected layer; Model 5: 1 convolutional layer, 1 max-pooling layer, 1 fully connected layer; Model 6: 5 convolutional layer, 1 max-pooling layer, 1 fully connected layer.

The positive set classification threshold criteria (threshold > 0.5) established by previous studies^12,14 was utilized in the present study. To better understand the decision process of CNN and narrow the boundaries of candidate enhancer regions, we added the Gradient-weighted Class Activation Mapping (Grad-CAM) method³⁴. Grad-CAM used gradient information flowing for last convolutional layer in the CNN to understand the importance of each neuron for category recognition and obtained a high-resolution subset of images of the most informative content with the target. Therefore, we used the global-average-pooled gradient of the positive set to generate importance scores for each activation map generated by the final convolutional layer, which were the weights of the linear combination of all corresponding activation feature maps. We multiplied the activation maps of the final convolutional layer by their respective weights and summed all the activation maps. We used ReLU function to retain the values which had a positive effect on the classification, and suppressed the values that showed a negative effect on the classification. Finally, depending on the importance score, we filtered out the regions on the positive set which were lower than the average of the importance scores.

Enhancers and eRNAs annotation

We used BEDTools (version: 2.26.0)³⁵ to remove the predicted enhancers which overlapped with the transcription start sites (TSSs) as the pig enhancers, and extended ± 3 kb from the midpoint of the enhancers, and defined these 6 kb regions as the potential eRNA regions, according to HeRA²³ and Animal-eRNAdb²⁵. To reduce the effect of the known protein-coding genes, we excluded the eRNA regions which were overlapped with the exons of protein-coding genes. Next, BEDTools (version: 2.26.0)³⁵ was used to calculate the read counts per sample mapped in the eRNA regions, and normalized the read counts using the trimmed mean of M values from the edgR package of R (version: 4.1.0). Furthermore, the expressions of eRNAs were normalized by reads per million (RPM) method³⁶. We defined the eRNAs with average RPM ≥ 1 in at least one tissue as the detectable eRNAs. The eRNAs of one certain tissue expressed more than 3 times of the average expression in other tissues were defined as tissue-specific eRNAs^37,38.

Comparison of sequence similarity and conservation between Pig and human eRNA

Previous studies have confirmed that enhancers are conserved between pigs and humans³⁹. Therefore, we compared the sequence similarity and the degree of conservatism between pig and human eRNAs. Firstly, we downloaded the sequence file of the human eRNAs from HeRA²³, and then used BEDTools (version: 2.26.0) to obtain the sequence file of pig eRNAs from the sequence file of the pig reference genome. Next, the similarity of each pig eRNA was calculated with all human eRNAs using blastn and specified that the similarity ≥ 0.5 and expectation value (E-value) ≤ 1e-5 was statistically significant. We used LiftOver (minMatch = 0.5)⁴⁰ to screen for the pig eRNAs which were sequence conserved with human eRNAs. After the converted genomic version, the eRNA regions of pigs overlapped with the human eRNA regions were defined as functionally conserved with human eRNAs.

Identification of target genes and putative regulators of pig eRNAs

In each tissue, the genes that closed to eRNAs (distance < 1 Mb) and significantly co-expressed with eRNAs were defined as potential target genes for eRNAs using Spearman’s correlation²³. |Rho| ≥ 0.3 and P-value < 0.05 were defined as statistically significant. We performed the Gene Ontology (GO) and the Kyoto Encyclopaedia of Genes (KEGG) enrichment analysis on target genes of eRNAs using clusterProfiler package of R (version: 4.1.0). We collected annotations of pig TFs from AnimalTFDB 3.0 (http://bioinfo.life.hust.edu.cn/AnimalTFDB/)⁴¹, and extracted the expressions of TFs in 607 samples in pigs. The TFs that highly co-expressed (|Rho| ≥ 0.3 and P-value < 0.05) with eRNAs were considered as the potential regulators of eRNAs²³.

eRNA-trait analysis

We download the pig quantitative trait loci (QTL) from AnimalQTLdb (https://www.animalgenome.org/cgi-bin/QTLdb/SS/index), and the eRNAs close to 2 Mb regions of QTLs were denoted as QTL-associated eRNAs. QTL enrichment was tested with a Fisher Exact test using an in-house R script, and the P-value ≤ 0.05 with the relative enrichment >1 were considered as significant.

Housekeeping and tissue-specific eRNAs

Considering the temporal activity of enhancers, we further annotated the tissue-specific eRNAs and housekeeping eRNAs. The constitutively expressed eRNAs (RPM ≥ 1) which expressed in all tissues were defined as HKeRNAs, according to previous studies^38,42. The HKeRNAs were further classified into three groups using the coefficient of variation (CV). Specifically, the CVs of HKeRNAs ≤ first quartile were lowly variable expression, the CVs of HKeRNAs < third quartile and å first quartile were medium variable expression, and the CVs of HKeRNAs ≥ third quartile were highly variable expression. To demonstrate the reliability of our eRNAs, we used the results in the PigGTEx (http://piggtex.farmgtex.org/) to verify the target gene of HKeRNAs and tissue-specific eRNAs. The eRNAs of one certain tissue expressed more than 3 times of the average expression in other tissues were defined as tissue-specific eRNAs^37,38.

Data Records

The dataset is available at Figshare:

File 1: This file contains the coordinate of enhancers. The column headings are chromosome number, start site, end site and enhancer ID of enhancer regions. This file can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 2: The file contains the eRNA regions of 15 tissues in pigs. The column headings are the eRNA id, chromosome number, start site, end site and enhancer id. This information can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 3: The file contains the sequence similarity analysis between pig eRNAs and human eRNAs. The column headings are pig species, pig eRNA id, reference species, reference eRNA id, identify, evalue, the chromosome number, middle. This information can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 4: A zip-file compressed tar archive contains the correlation between eRNAs and their target genes in pigs. The column headings are eRNA id, gene id, gene name, Rho and FDR. This information can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 5: A zip-file compressed archive contains the correlation between eRNAs and regulators in pigs. The column headings are eRNA id, TF id, TF name, Rho and FDR. This information can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 6: This file contains the tissue-specific eRNAs of 15 tissues in pigs. The column headings are tissue-specific eRNA id, chromosome number, start site and end site. The file can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 7: A zip-file compressed archive contains the correlation between the tissue-specific eRNAs and their target genes in 15 tissues of pigs. The column headings are eRNA id, gene id, gene name, Rho and FDR. This information can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 8: This file contains the list of Housekeeping eRNAs with CV. This file can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

File 9. The file contains the correlation association analysis of eRNA and QTL. The column headings are enriched trait, eRNA id, P-value and estimate. This file can be found in (https://doi.org/10.6084/m9.figshare.22923353)⁴³.

Technical Validation

CNNEE implements two functions: enhancer prediction and eRNA identification in pigs

The enhancer prediction module

To accurately identify enhancers in pigs, the powerful CNN strategy was utilized to construct the binary classifier to characterize the enhancer over a 4 kb sliding window. ATAC-seq peaks overlapping with a transcript peak and the peaks of an active histone enhancer (H3K27ac and H3K4me3) were defined as the active enhancers and positive training regions. The residuals were considered as negative training regions (see Methods). A total of six models with different convolutional and pooling layers were built to train the positive and negative regions (see Methods) (Fig. 2a). To further refine the boundaries of the candidate regions by adjusting the weights, the Grad-CAM method was added to CNNEE (Fig. 2b). To validate the results, we performed five cross-validations by dividing the data into five folders and reusing each folder as an out-of-sample validation set. We found Model 6 achieved the highest optimal metrics with Accuracy: 0.9983, Precision: 0.9474, Recall: 0.9388, F1 source: 0.9417 (Fig. 2c) and selected for further predict enhancers in pigs. The accuracy of CNNEE was higher than that of REPTILE¹⁴ (0.9550) and DECODE¹⁶ (0.9897). Moreover, if we deleted the transcriptomic data from CNNEE, the F1 score decreased to 0.9032, and the recall rate decreased to 0.8786. This observation demonstrated that the transcriptomic data were powerful in improving the predicted accuracy of enhancers. To further investigate the accuracy and resolution of CNNEE pipeline, we applied it on human and mouse liver database from ENCODE⁴⁴ (Supplemental Table S2), and found CNNEE achieved the Accuracy: 0.9976 and Precision: 0.9437 for mice, coupling with Accuracy: 0.9966 and Precision: 0.9435 for humans. Notably, 70.7% (humans) and 71.3% (mice) of the predicted enhancers by CNNEE (Figshare File 1) overlapped and conserved with enhancers reported in humans and mice of ENCODE, respectively. These results indicated that our CNNEE pipeline showed a wide availability and practicality in mammals, suggesting the predicted enhancers were reasonable and plausible in pigs.

The eRNA identification module

To depict the atlas of eRNAs in pigs, we collected 607 RNA-seq samples from 15 tissues including hearts, livers, spleens, backfat, gallbladder, jejunums, kidneys, lungs, ovaries, pituitaries, muscles, Sk-Muscles, Ld-Muscles, pineal and ileum. We defined the eRNAs with average RPM ≥ 1 in at least one tissue as the detectable eRNAs (Figshare File 2). In addition, about 81.4% of pig eRNAs were sequence-conserved with human eRNAs in HeRA²³ (Figshare File 3).

Validation of the enhancers and eRNAs functional characteristics

The CpG density of the central region of enhancers was obviously higher than flanking regions (Fig. 3a), and 84.7% of the enhancers were less than 50 kb away from the nearest TSS (Fig. 3b). The average length of candidate enhancers was refined to 539 bp in length, which was 31% shorter than Zhao et al.³⁹ and 19% shorter than Pan et al.⁴⁵. To further validate enhancers, we downloaded the peak files of H3K4me1 of porcine livers from Pan et al.⁴⁵, which were not used in the training model of CNNEE, and found that 83.5% of enhancers predicted by CNNEE were overlapped with the marks of H3K4me1. To verify the conservation of the pig enhancers, we compared pig and human enhancers, and the results showed that 83.8% of the enhancer regions were sequence-conserved with human genome (hg38) (LiftOver, minimum match = 0.5), and 82.0% of sequence-conserved enhancers were functionally conserved with human enhancers of ENCODE, indicating that the characterized enhancers were pinpointed in pigs.

To analyze the differences between conserved enhancers and non-conserved enhancers in terms of sequence and function. We found the CpG densities of conserved enhancers were significant higher than that of non-conserved enhancers (P-value = 0.0020) (Supplemental Figure S1). The average number of target genes (Figshare File 4) regulated by conserved enhancers was significant less than that of non-conserved enhancers (P-value < 2.2e-16), and the average number of TFs (Figshare File 5) that regulated the expressions of eRNAs for conserved enhancers was significant higher than that of non-conserved enhancers (P-value = 0.026). Moreover, we found that the tissue-specific eRNAs were significantly enriched in non-conserved enhancers (P-value < 2.2e-16, relative enrichment = 1.17).

To validation the biological functions of eRNAs across pig tissues, we determined the tissue-specific eRNAs in 15 tissues (Figshare File 6) and performed GO and KEGG enrichment analysis of the target genes (Figshare File 7) for tissue-specific eRNAs. Tissue-specific eRNAs with similar physiological function were more likely to cluster together, as revealed by tSNE and heatmap cluster analysis of tissue-specific eRNAs (tSNE and heatmap, Fig. 4a,b, respectively). Supplemental Figure S2 displayed several significant GO terms that are enriched (P-value < = 0.05) in tissue-specific eRNAs correspond to its known tissue-related biological functions. For example, the biological functions of ovary-specific eRNAs were enriched in female gonad development (GO: 0008585), development of primary female sexual characteristics (GO: 0046545), female sex differentiation (GO: 0046660), ovulation cycle process (GO: 0022602), development of primary male sexual characteristics (GO: 0046546). Notably, we found that the conserved enhancers were likely to maintain the fundamental biological functions, and the non-conserved enhancers appeared to keep the tissue-specific functions of livers (Supplemental Figure S3). Moreover, we classified HKeRNAs into three groups with low, medium, and high expression variability using thresholds of first quartile and third quartile of CV (Figshare File 8). Supplemental Figure S4 summarized the GO and KEGG enrichment analysis of HKeRNAs with low and medium variable expression, and the results show that the target genes are mainly involved in the basic biological activities of the organism, e.g., chromatin silencing (GO:0006342), regulation of gene expression, epigenetic (GO:0040029) and notch signaling pathway (ssc04330). We used pigGTEX (http://piggtex.farmgtex.org) to query the expression of target genes, 63.2% of the target genes for HKeRNAs were stably expressed in pigGTEX. For example, the SNRPC, IDH2, EMC1 and POP7 were the target genes of HKeRNAs, and Supplemental Figure S5 showed their expression in 34 tissues of pigGTEX.

To validate the relationship between tissue-specific eRNAs and phenotypes, we downloaded publicly available pig QTLs from AnimalQTLdb. (https://www.animalgenome.org/cgi-bin/QTLdb/SS/index). The eRNAs close to 2 Mb regions of QTLs were denoted as QTL-associated eRNAs for 670 traits. We found that the eRNAs were significantly enriched in the QTL regions (P-value < 2.2e-16, relative enrichment = 3.36), indicating the regulatory roles of eRNAs were powerful. 652 traits were associated with these tissue-specific eRNAs (Figshare File 9).

Code availability

All CNNEE code for enhancer prediction and eRNA identification is publicly available at https://github.com/WangYF33/CNNEE.

References

Cramer, P. Organization and regulation of gene transcription. Nature 573, 45–54 (2019).
Article ADS CAS PubMed Google Scholar
Lee, T. I. & Young, R. A. Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251 (2013).
Article CAS PubMed PubMed Central Google Scholar
Andersson, R. & Sandelin, A. Determinants of enhancer and promoter activities of regulatory elements. Nat Rev Genet 21, 71–87 (2020).
Article CAS PubMed Google Scholar
Banerji, J., Rusconi, S. & Schaffner, W. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. Cell 27(22 Pt 21), 299 (1981).
Article CAS PubMed Google Scholar
Dao, L. T. M. & Spicuglia, S. Transcriptional regulation by promoters with enhancer function. Transcription 9, 307–314 (2018).
Article CAS PubMed PubMed Central Google Scholar
Birnbaum, R. Y. et al. Coding exons function as tissue-specific enhancers of nearby genes. Genome Res 22, 1059–1068 (2012).
Article CAS PubMed PubMed Central Google Scholar
Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Article ADS Google Scholar
Kawaji, H., Kasukawa, T., Forrest, A., Carninci, P. & Hayashizaki, Y. The FANTOM5 collection, a data series underpinning mammalian transcriptome atlases in diverse cell types. Sci Data 4, 170113 (2017).
Article CAS PubMed PubMed Central Google Scholar
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Article Google Scholar
Ramisch, A. et al. CRUP: a comprehensive framework to predict condition-specific regulatory units. Genome Biol 20, 227 (2019).
Article PubMed PubMed Central Google Scholar
Liu, F., Li, H., Ren, C., Bo, X. & Shu, W. PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Sci Rep 6, 28517 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
He, Y. et al. Improved regulatory element prediction based on tissue-specific local epigenomic signatures. Proc Natl Acad Sci USA 114, E1633–E1640 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lu, Y., Qu, W., Shan, G. & Zhang, C. DELTA: A Distal Enhancer Locating Tool Based on AdaBoost Algorithm and Shape Features of Chromatin Modifications. PLoS One 10, e0130622 (2015).
Article PubMed PubMed Central Google Scholar
Chen, Z. et al. DECODE: a Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays. Bioinformatics 37, i280–i288 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sethi, A. et al. Supervised enhancer prediction with epigenetic pattern recognition and targeted validation. Nat Methods 17, 807–814 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kim, T. K. et al. Widespread transcription at neuronal activity-regulated enhancers. Nature 465, 182–187 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Arner, E. et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Wu, H. et al. Tissue-specific RNA expression marks distant-acting developmental enhancers. PLoS Genet 10, e1004610 (2014).
Article PubMed PubMed Central Google Scholar
Lee, J. H. et al. Enhancer RNA m6A methylation facilitates transcriptional condensate formation and gene activation. Mol Cell 81, 3368–3385 e3369 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhou, Y., Xu, S., Zhang, M. & Wu, Q. Systematic functional characterization of antisense eRNA of protocadherin alpha composite enhancer. Genes Dev 35, 1383–1394 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. HeRA: an atlas of enhancer RNAs across human tissues. Nucleic Acids Res 49, D932–D938 (2021).
Article ADS CAS PubMed Google Scholar
Zhang, Z. et al. Transcriptional landscape and clinical utility of enhancer RNAs for eRNA-targeted therapy in cancer. Nat Commun 10, 4562 (2019).
Article ADS PubMed PubMed Central Google Scholar
Jin, W. et al. Animal-eRNAdb: a comprehensive animal enhancer RNA database. Nucleic Acids Res 50, D46–D53 (2022).
Article CAS PubMed Google Scholar
Zhang, Z. et al. Genetic, Pharmacogenomic, and Immune Landscapes of Enhancer RNAs Across Human Cancers. Cancer Res 82, 785–790 (2022).
Article CAS PubMed Google Scholar
Joan, K. et al. Importance of the pig as a human biomedical model. Sci Transl Med 24(621), 13 (2021).
Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12, 357–360 (2015).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Article CAS PubMed PubMed Central Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp, 770–778 (2015).
Hu, J., Shen, L., Albanie, S., Sun, G. & Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp, 7132–7141 (2018).
Selvaraju, R. R. et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. International Journal of Computer Vision 128, 336–359 (2019).
Article Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621–628 (2008).
Article CAS PubMed Google Scholar
She, X. et al. Definition, conservation and epigenetics of housekeeping and tissue-enriched genes. BMC Genomics 10, 269 (2009).
Article PubMed PubMed Central Google Scholar
Zhang, T. et al. Transcriptional atlas analysis from multiple tissues reveals the expression specificity patterns in beef cattle. BMC Biol 20, 79 (2022).
Article PubMed PubMed Central Google Scholar
Zhao, Y. et al. A compendium and comparative epigenomics analysis of cis-regulatory elements in the pig genome. Nat Commun 12, 2217 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res 47, D853–D858 (2019).
Article CAS PubMed Google Scholar
Hu, H. et al. AnimalTFDB 3.0: a comprehensive resource for annotation and prediction of animal transcription factors. Nucleic Acids Res 47, D33–D38 (2019).
Article CAS PubMed Google Scholar
Zhu, J., He, F., Hu, S. & Yu, J. On the nature of human housekeeping genes. Trends Genet 24, 481–484 (2008).
Article CAS PubMed Google Scholar
Wang, Y., Jin, W. Pan, X. & Yuan, X. Pig-eRNAdb: a comprehensive enhancer and eRNA dataset of pigs. figshare https://doi.org/10.6084/m9.figshare.22923353 (2023).
Consortium, E. P. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Article ADS Google Scholar
Pan, Z. et al. Pig genome functional annotation enhances the biological interpretation of complex traits and human disease. Nat Commun 12, 5848 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research was funded by National Key R&D Program of China (2022YFF1000900), Local Innovative and Research Teams Project of Guangdong Province (2019BT02N630), the earmarked fund for China Agriculture Research System (CARS-35), the Key R&D Program of Guangdong Province Project (2022B0202090002), and Breed Industry Innovation Park of Guangdong Xiaoerhua Pig (2022‐4408X1-43010402-0019). We thank the National Supercomputer Center in Guangzhou for its computing platform.

Author information

These authors contributed equally: Yifei Wang, Weiwei Jin, Xiangchun Pan.

Authors and Affiliations

Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, Guangdong Laboratory of Lingnan Modern Agriculture, National Engineering Research Center for Breeding Swine Industry, State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science, South China Agricultural University, Guangzhou, 510642, China
Yifei Wang, Xiangchun Pan, Weili Liao, Qingpeng Shen, Jiali Cai, Wentao Gong, Yuhan Tian, Dantong Xu, Yipeng Li, Jiaqi Li, Zhe Zhang & Xiaolong Yuan
Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
Weiwei Jin & Jing Gong

Authors

Yifei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xiangchun Pan
View author publications
You can also search for this author in PubMed Google Scholar
Weili Liao
View author publications
You can also search for this author in PubMed Google Scholar
Qingpeng Shen
View author publications
You can also search for this author in PubMed Google Scholar
Jiali Cai
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yuhan Tian
View author publications
You can also search for this author in PubMed Google Scholar
Dantong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yipeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Li
View author publications
You can also search for this author in PubMed Google Scholar
Jing Gong
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.Y. and Z.Z. conceived the study. Y.W. constructed the database. W.J. developed the website. X.P. and Y.W. performed the data analysis. W.L., Q.S. and J.C. collected the data. Y.W., Q.S. and Y.T. prepared the figures, Y.W. wrote the original manuscript. Y.W., X.P. and W.G. revised the original manuscript. J.C., Y.L. and D.X. supplied the Supplementary Information. J.L. and J.G. managed laboratory work and supervised the project.

Corresponding authors

Correspondence to Zhe Zhang or Xiaolong Yuan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figure

Supplemental Table S1

Supplemental Table S2

Supplemental Table S3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., Jin, W., Pan, X. et al. Pig-eRNAdb: a comprehensive enhancer and eRNA dataset of pigs. Sci Data 11, 157 (2024). https://doi.org/10.1038/s41597-024-02960-7

Download citation

Received: 31 May 2023
Accepted: 11 January 2024
Published: 01 February 2024
DOI: https://doi.org/10.1038/s41597-024-02960-7