PEGS: An efficient tool for gene set enrichment within defined sets of genomic intervals

Many biological studies of transcriptional control mechanisms produce lists of genes and non-coding genomic intervals from corresponding gene expression and epigenomic assays. In higher organisms, such as eukaryotes, genes may be regulated by distal elements, with these elements lying 10s–100s of kilobases away from a gene transcription start site. To gain insight into these distal regulatory mechanisms, it is important to determine comparative enrichment of genes of interest in relation to genomic regions of interest, and to be able to do so at a range of distances. Existing bioinformatics tools can annotate genomic regions to nearest known genes, or look for transcription factor binding sites in relation to gene transcription start sites. Here, we present PEGS ( Peak set Enrichment in Gene Sets). This tool efficiently provides an exploratory analysis by calculating enrichment of multiple gene sets, associated with multiple non-coding elements (peak sets), at multiple genomic distances, and within topologically associated domains. We apply PEGS to gene sets derived from gene expression studies, and genomic intervals from corresponding ChIP-seq and ATAC-seq experiments to derive biologically meaningful results. We also demonstrate an extended application to tissue-specific gene sets and publicly available GWAS data, to find enrichment of sleep trait associated SNPs in relation to tissue-specific gene expression profiles.


Introduction
Gene expression control in higher organisms is achieved through a complex hierarchical process involving opening of chromatin, histone modifications, and binding of transcription factors (TFs). Experimental approaches to understand transcriptional regulatory mechanisms in a biological context involve large-scale measurement of gene expression. Depending on the design of the experiment, these analyses produce differentially expressed gene sets or clusters for further analysis. These studies are often complemented by assays which map, on a genome-wide scale, TF binding sites (ChIP-seq) or regions of chromatin accessibility (DNase-seq, ATAC-seq). Analyses of these data produce a collection of genomic intervals (peak sets). An important computational task is then to integrate these data to produce meaningful results; i.e. to relate gene sets to peak sets to aid functional interpretation. Bearing in mind distal regulation, an important consideration here is to be able to calculate gene set enrichment at multiple genomic distances from peak sets, and to be able to do this efficiently within the same analysis.
We present a new tool -PEGS (Peak set Enrichment in Gene Sets) 1 -which calculates mutual enrichment of multiple gene sets associated with multiple peak sets, simultaneously and efficiently. This can be at user-defined peak-to-TSS (transcription start site) distances, as well as constraining to topologically associated domains (TADs). Thus, PEGS quickly produces an overall picture of gene set enrichment in relation to peaks, and shows at what genomic distances this is most pronounced. It is applicable to gene sets derived from any source, and peak sets derived from different epigenomic assays, as well as single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWAS).

Architecture and implementation
In PEGS, input peaks are extended in both directions using user-provided genomic distances or constrained within known TAD boundaries, if provided ( Figure 1). Subsequently, the enrichment of the input gene set is calculated among the genes whose TSSs overlap with the extended peaks, separately for each genomic distance and/or TADs. These tasks are performed in PEGS as follows: 1. Creating a gene interval file in BED (Browser Extensible Data) format for all TSSs in the given genome using refGene from UCSC Table Browser. This reference TSSs BED file only needs to be created once (human hg38 and mouse mm10 are provided with the tool; a utility is provided to create these for other genomes).
2. For a given peak set, peaks are extended to a specified genomic distance in both directions (and up to overlapping TAD boundaries, if provided). Intersection of these extended peaks with the gene intervals BED file from step 1 is calculated using BEDTools (RRID:SCR_006646) 2 . This leads to a gene set with TSSs overlapping with extended peaks.
3. Using the intersection of the input gene set, and unique genes from step 2 (thus removing genes with multiple TSSs), a Hypergeometric test is performed to calculate the p-value using Equation 1, similar to GREAT (RRID:SCR_00580) 3 . Here, M is the total number of genes in the genome, N c is the number of genes in the input cluster/set c, N p is the number of unique genes overlapping the peaks for given distance and n pc is the intersection of two gene sets.

Amendments from Version 1
In this version, in response to reviewer's feedback, we have made following changes: We have updated our software package to a newer version (0.6.3) with improved functionality in terms of command line options for input and output files. We have updated the documentation accordingly. We have also uploaded updated versions of all three figures and improved the text throughout the manuscript.
Any further responses from the reviewers can be found at the end of the article Step 2 and 3 are repeated for every combination of gene cluster, peak set and genomic distance and/or TADs. The final combined heatmap shows −log 10 of the resulting p-values.
PEGS is implemented in Python 3, where we have reused functions from existing Python packages included with Python distributions, or available from the Python Package Index (PyPI). We also make use of BEDTools 2 for working with genomic intervals. We provide online documentation (https://pegs.readthedocs.io/en/latest/), and an example analysis with input data at the PEGS GitHub repository.
Operation PEGS works with Python >= 3.6 and, when installed through pip, automatically installs all the dependencies. These are listed in requirements.txt file in our PEGS GitHub repository. We provide extensive documentation online at https://pegs. readthedocs.io which includes easy-to-follow instructions about: •

Use cases
Here, we present three use cases where we apply PEGS to different publicly available data sets. The format of input files is the same for all use cases below. Gene clusters are provided as text files with one gene symbol on each line; genomic region coordinates are provided in standard BED format. These input files for Use Case 1, as well as example analysis reproducing Figure 2A, are provided in our GitHub repository (https://github. com/fls-bioinformatics-core/pegs).

Use case 1: Application of PEGS provides insight into glucocorticoid-mediated gene regulation in mouse liver
The first application ( Figure 2A) uses the gene sets consisting of putative targets of glucocorticoid receptor (GR) in mouse liver. These are up-and down-regulated genes obtained by an RNA-seq study of liver samples from mice treated acutely with synthetic glucocorticoid dexamethasone or vehicle 4 . Corresponding GR ChIP-seq and chromatin accessibility data (DNase I hypersensitive (DHS) regions) were obtained from 5, and 6 respectively, whilst the mouse liver TAD boundaries were obtained from 7. Raw published datasets were downloaded from GEO Sequence Read Archive (RRID:SCR_005012) using Figure 2. PEGS applications: (A) gene expression, ChIP-seq and DNase I data on mouse liver; upper two panels correspond to GR ChIP-seq and DNase peaks expanded to different genomic distances while the bottom panel shows both GR ChIP-seq and DNase peaks expanded to overlapping TAD boundaries (B) gene clusters derived from scRNA-seq and intergenic putative enhancer clusters from bulk ATAC-seq from three matching early stem cell differentiation time-points. In both plots, numbers in the cells show common genes among the input genes (x-axis) and genes overlapping with expanded peaks (y-axis) and the colour shows −log 10 of p-value (Hypergeometric test). sratoolkit v2.9.2 (http://ncbi.github.io/sra-tools/). Reads were aligned to the reference genome (mouse mm10), sorted and indexed using Bowtie2 (v2.3.4.3, RRID:SCR_005476, 8 ) and SAMtools (v1.9, RRID:SCR_002105, 9 ). MACS2 (v2.1.1.20160309, RRID:SCR_013291, 10 ) was used to call peaks, using default settings. Using these GR ChIP-seq peaks and DHSs as peak sets, PEGS analysis shows strong association of dexamethasone up-regulated genes with dexamethasone-induced GR peaks at distances up to 500kbp from these peaks, but no enrichment of down-regulated genes ( Use case 2: PEGS demonstrates association of differential chromatin accessibility and gene expression during embryonic stem cell differentiation Next, using PEGS, we calculated enrichment of gene clusters derived from single-cell RNA-seq and open chromatin regions defined by bulk ATAC-seq at three matching time points (ESCsembryonic stem cells, day1 EpiLCs -epiblast-like cells, day2 EpiLCs) during early embryonic stem cell differentiation 11 . Early embryonic development (naïve mouse ESCs to EpiLCs) involves large changes in the chromatin landscape through the action of many transcription factors and chromatin regulators leading to specific gene expression programs. Using our publicly available data 11 , we defined our open chromatin peak sets as the intergenic regions with differential accessibility between any two time points. These were clustered into four profiles based on z-score of tag densities, as described in 11. Similarly, differentially expressed genes were identified from pseudobulk gene expression data at each time point, and were similarly clustered into four patterns across three time points. These constituted our gene sets for PEGS analysis. As shown in Figure 2B, application of PEGS to these data shows strong association between the matching gene expression (x-axis) and chromatin opening profiles (y-axis) at intergenic enhancers, reflecting correspondence between differential accessibility and gene expression changes at corresponding time points. This application shows the utility of PEGS for integrating chromatin accessibility and gene expression data, leading to biologically meaningful association of enhancer and gene clusters.
Use case 3: Extended application: PEGS detects enrichment of sleep trait SNPs in tissue-specific genes Genome-wide association studies (GWAS) are commonly employed to study genotype-disease associations. Here, we present an extended application of PEGS to GWAS data and find associations of SNPs for different sleep phenotypes with sets of tissue-specific genes from the Genotype-Tissue Expression (GTEx) Portal, RRID:SCR_013042). For this purpose, we downloaded GWAS data from the Sleep Disorder Knowledge Portal (RRID:SCR_016611) which provides data, as well as analysis and visualisation resources for human genetic information regarding sleep and related traits. We extracted SNPs (single nucleotide polymorphisms) for certain sleep associated phenotypes (with genome wide p-value cutoff <=5 e − 8). These SNPs constitute our input peak sets to PEGS, while we defined corresponding gene sets as tissue-specific genes from GTEx portal. These were created as following; we obtained median transcripts per million (TPM) data for different tissues in GTEx, and a gene list for a specific tissue was defined as genes with greater than 5x median TPM compared to the average in the remaining tissues.
In Figure 3, using PEGS, we show enrichment of SNPs from three sleep related phenotypes, namely chronotype, daytime sleepiness adjusted for BMI, and sleep duration. These enrichments are calculated for tissue-specific genes lists created from GTEx for 22 tissues, the majority of them from the brain. Application of PEGS to these data reveals some strong associations, e.g. chronotype SNPs strongly enriched for genes expressed in liver and blood, while daytime sleepiness SNPs are enriched in gene sets for different brain tissues. Some of these associations are reported in the literature, e.g. daytime sleepiness SNPs in brain tissue 12 , others may warrant further investigation.

Conclusions
Through the three different applications above, we demonstrate that PEGS is a versatile and highly efficient tool to integrate different genomic data, and is able to generate hypotheses for further analysis. The implementation of PEGS is highly efficient and as an example of computational efficiency, with pre-created reference TSS files, it only took 7.6 seconds to produce the output for Figure 2A on a laptop with Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor with 16GB RAM.
Furthermore, the user can adjust the background population and control for bias. For example, depending on the scientific question at hand, the background population could be limited to include only those genes known to be expressed in the tissue of interest. The efficiency of PEGS allows multiple gene and peak input files (e.g. with varying false discovery rate or fold-change cut-offs) to be tested quickly.
PEGS analysis is limited to enhancer-genes associations based on genomic proximity. It builds on some aspects of, and is complementary to, GREAT 3 , an existing tool, which performs functional enrichment of regulatory regions using annotations of nearby genes. PEGS could also be used in conjunction with other tools to gain further mechanistic understanding (e.g. by finding enriched transcription factors with TFEA.ChIP 13 , ranking of their target genes with Cistrome-GO 14 or BETA 15 , or predicting which TFs might regulate differentially expressed gene sets with Lisa 16 ).

Data availability
All data underlying the results are available as part of the article or available publicly.

Version 1
The tool itself is very useful but it lacks several key options to give users the flexibility to customize the input data and also the output heatmap.
I have the following comments for the authors to address: It is useful to restrict peak-gene association within the TAD boundaries, but it is not the case that all the interactions, such as enhancer-gene interactions occur within the TAD boundaries. The enhancer-gene communication can also occur outside topological domains or in-between TADs. Do authors plan to provide an optional feature to integrate chromatin interaction data, such as HI-C? 1.
The command-line tool can be further improved by providing additional options to improve user experience and its usage. Below are some recommendations. Currently, the peaks sets and gene lists inputs arguments are positional and the tools can only scan files available in the provided folders. Instead of looking into provided folders for BEDs files and gene lists, the argument should also allow chaining a list of bed files with a path. This is because in real analysis scenarios BEDs can be spread across multiple folders or a single folder can have other visible/hidden files. For example, I was testing the tool on a Mac machine, and PEGS started processing peaks for .DS_Store, which is the default directory structure and a hidden file.

○
The tool arguments could be: pegs --peaks peaks/*.bed --genes genes/*.txt and also pegs --peaks A.bed B.bed --genes A.txt B.txt ○ The output heatmap should also have an option to generate vector-based plots, such ○ 2.
as PDF or SVG.
Authors may consider adding additional options to adjust the heatmap, such as setting labels, dimensions, colors, and gene/peak set names.

Figures can be further improved. 3.
Please highlight the limitations of the tool such as the enhancer-gene associations are solely based on proximity.

4.
Providing an installation option through Conda using the bioconda channel (https://bioconda.github.io/) will be useful and it will increase the usage/availability of the tool.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes

-It is useful to restrict peak-gene association within the TAD boundaries, but it is not the case that all the interactions, such as enhancer-gene interactions occur within the TAD boundaries. The enhancer-gene communication can also occur outside topological domains or in-between TADs.
Do authors plan to provide an optional feature to integrate chromatin interaction data, such as HI-C?
We would like to emphasise that the genomic distances are not restricted to TAD boundaries. PEGS provides the user with the option of supplying any distances. In addition, we also provide the user with the option to restrict peak expansion to TADs boundaries (if available), as a separate analysis (Fig 2A). We have revised the relevant text and we hope this will address reviewer's main concerns and clarify any confusion. Integration of HiC data is beyond the scope of this work, but we will think of ways to incorporate that in future developments of PEGS. We thank the reviewer as these are very useful suggestions and we have updated PEGS to a new version (please see latest version 0.6.2), which includes all of the above command line options. We also output the heatmap data as an excel file, so the user can customise their heatmaps/plots according to their choice/requirements. We have updated the documentation accordingly.

-Figures can be further improved.
We have improved and updated all of the figures (please see new version of the manuscript) 4 -Please highlight the limitations of the tool such as the enhancer-gene associations are solely based on proximity.
We have updated the manuscript text making it clear that our enrichment calculations are Is the rationale for developing the new software tool clearly explained? Yes

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes multiple distances within and beyond TAD boundaries.
3 -While readable, the resolution of figure 2 is low. I would advise the authors to upload a higher resolution figure.
We agree, and have added a high-resolution version of Fig. 2 4 -For case1, maybe I missed it, but I think it would be interesting to interpret the results with or without TADs. This would allow us to see the impact of TAD annotation on the analysis.
This is related to point 2. In Fig.2A, we do provide the analysis with and without TADs for case 1. Enrichment for multiple distances is at the top two panels, and enrichment calculations using TADs are at the bottom. We have improved Fig 2 to make this clear.

-I think the authors should add more explanations in the text about the results of their three cases.
We have added more text in the manuscript, further explaining the three cases.
Competing Interests: No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com