Functional effects of variation in transcription factor binding highlight long-range gene regulation by epromoters

Abstract Identifying DNA cis-regulatory modules (CRMs) that control the expression of specific genes is crucial for deciphering the logic of transcriptional control. Natural genetic variation can point to the possible gene regulatory function of specific sequences through their allelic associations with gene expression. However, comprehensive identification of causal regulatory sequences in brute-force association testing without incorporating prior knowledge is challenging due to limited statistical power and effects of linkage disequilibrium. Sequence variants affecting transcription factor (TF) binding at CRMs have a strong potential to influence gene regulatory function, which provides a motivation for prioritizing such variants in association testing. Here, we generate an atlas of CRMs showing predicted allelic variation in TF binding affinity in human lymphoblastoid cell lines and test their association with the expression of their putative target genes inferred from Promoter Capture Hi-C and immediate linear proximity. We reveal >1300 CRM TF-binding variants associated with target gene expression, the majority of them undetected with standard association testing. A large proportion of CRMs showing associations with the expression of genes they contact in 3D localize to the promoter regions of other genes, supporting the notion of ‘epromoters’: dual-action CRMs with promoter and distal enhancer activity.


INTRODUCTION
Identifying DNA cis-regulatory modules (CRMs) that control the expression of specific genes is crucial for deciphering the logic of transcriptional control and its aberrations. Advances of the last decade have made it possible to predict active CRMs based on chromatin features (1,2) and detect the binding of dozens of transcription factors (TFs) to these regions (3,4). However, deletion of known or predicted CRMs often shows no observable phenotype, suggesting that some CRMs either lack appreciable gene regulatory function or are efficiently buffered by other sequences, at least under normal conditions (5)(6)(7)(8)(9). In addition, the sequence, chromatin state and genomic location of CRMs do not immediately provide information on their target genes (10). Therefore, evidence from complementary approaches is required to establish the function of specific CRMs in transcriptional control.
Natural genetic variation can theoretically provide a direct indication of gene regulatory function by revealing the allelic associations between specific variants and gene expression (11,12). While expression quantitative trait loci (eQTLs) identified this way have provided important insights into gene control and the mechanisms of specific diseases (13,14), a number of challenges hamper comprehensive detection of functional sequences in 'brute-force' eQTL testing (15,16). In particular, the immense search space leads to a heavy multiple testing burden resulting in reduced sensitivity. This problem is typically mitigated in part by testing for 'cis-eQTLs' separately within a limited distance window (∼100 kb); this distance range is, however, an order of magnitude shorter than that of known distal CRM activity (17)(18)(19). In addition, correlation structure arising from linkage disequilibrium (LD) requires dis-entangling causal from spurious associations, which is particularly challenging in the likely scenario, whereby multiple functional variants with modest effects co-exist within the same LD block (20). These challenges provide a strong motivation for incorporating prior knowledge into association testing for identifying causal regulatory variants.
The recruitment of TFs to CRMs plays a key role in the regulatory function of these elements (21,22), and mutations leading to perturbed TF binding are known to underpin developmental abnormalities and disease susceptibility (18,23,24). Therefore, sequence variation affecting TF binding affinity at CRMs has a strong potential to have causal influence on their function and can provide insights into the logic of gene control. Variation in TF binding across multiple individuals has been assessed directly for several TFs (25)(26)(27)(28)(29)(30), but high resource requirements of these analyses limit the number of TFs and individuals profiled this way. Alternatively, the effects of local sequence variation on TF binding can be predicted, at least in part, based on prior information regarding the TFs' DNA binding preferences. The representation of such preferences in the form of position weight matrices (PWMs) (31) has proven particularly useful, as it provides a quantitative measure of how much a given sequence substitution is likely to perturb TF binding consensus. Consistent with this, we and others have previously shown that the specificity of TF binding preferences to a given motif position correlates with the functional constraint of the underlying DNA sequences, both within and across species (32)(33)(34). Classic PWM-based approaches to TF binding prediction focused on identifying short sequences showing a non-random fit to the PWM model compared with background (35,36). More recently, biophysical modelling of TF binding affinity (37,38) has provided a natural framework to extend this analysis by integrating over all PWM match signals within a DNA region (39,40), including those from lower affinity sites that are a known feature of many functional CRMs (41)(42)(43).
Long-range CRMs such as gene enhancers commonly act on their target promoters through DNA looping interactions (44,45). Therefore, information on 3D chromosomal organization enables predicting the putative target genes of these elements (46,47) and thus has the potential to significantly improve the functional interpretation of regulatory variation. Approaches that couple chromosome conformation capture with target sequence enrichment such as Promoter Capture Hi-C (PCHi-C) (48)(49)(50) are particularly useful in this regard, as they make it possible to detect regulatory interactions globally and at high resolution with reasonable amounts of sequencing (51)(52)(53)(54)(55)(56)(57)(58)(59).
Here, we integrate TF binding profiles in a human lymphoblastoid cell line (LCL) (4) with patterns of natural sequence variation (60) to generate an atlas of CRMs predicted to show significant TF binding variability across LCLs derived from multiple individuals. We delineate the putative target genes of these CRMs from their interactions with gene promoters based on PCHi-C and linear proximity (49,61) and test for associations between the CRMs' TF binding affinity and target gene expression using transcriptomics data for hundreds of LCLs (62). Prioritizing CRMs that show predicted variation in TF binding affinity based on a biophysical model (39,40) makes it feasible to perform association analysis in a manner that accounts for multiple variants affecting the binding of the same TF, as well as for multiple CRMs targeting the same gene. Using this approach, we reveal >1300 CRM variants associated with expression of specific genes, the majority of them undetected with conventional eQTL testing at a standard false discovery rate (FDR) threshold. We find that a large proportion of CRMs showing associations with the expression of distal genes localize in the immediate vicinity of the TSSs of other genes and connect to their targets via DNA looping interactions, suggesting their role as 'epromoters': the recently identified dual-action regulatory regions with promoter and distal enhancer activity (63)(64)(65).

CRM definition
ChIP-seq narrow peak files for 52 TFs in GM12878 were downloaded from the UCSC ENCODE portal (4). Where multiple datasets were available for the same TF, the intersect of the ChIP-seq peaks was taken for all TFs except ERG1, for which we took the union of the two datasets available, since one of them had substantially fewer peaks than the other. CRMs were defined by taking the union of the peaks for the 52 TFs with a minimum overlap of one base pair. For each TF detected as bound at a given CRM in GM12878 (based on ChIP-seq data), we computed the affinity for each haplotype and each PWM for this TF available from ENCODE (66). The library of ENCODE motifs was imported from the R package atSNP (67), and 41/52 TFs for which there was an exact match between TF name and motif name were taken forward to the analysis. TF affinities were computed using the TRAP biophysical model (39) as implemented in the R package tRap (https://github. com/matthuska/tRap). Default parameters were used, with the exception of setting pseudocount to zero, since we were using frequency as opposed to count matrices. We chose TRAP over approaches based on individual motif hits, as it naturally incorporates the effects of multiple low-affinity sites and multiple variants per CRM.

Detection of TF binding affinity variants
CRM binding affinities were normalized using a method proposed by Manke et al. (40), such that changes in them could be compared between different PWMs. Briefly, CRM affinities are converted to statistical scores (A) representing the probability of observing a given or higher affinity for a given TF in the background sequence (note that lower values of A therefore reflect higher affinities). Binding affinities are parameterized using the extreme value distribution whose parameters are estimated for a range of background sequences encompassing the lengths of all CRMs (40,100,200,250, 300, 400, 500, 800, 1000, 2000 and 3000) using the fit.gev function in the R package tRap. CRMs not bound by a given TF are cut/extended to the required length and used as background sequences.
For all CRM TF/PWM combinations with A < 0.1 in the highest affinity allele of GM12878, we computed the log fold change in affinity between all observed haplotypes and the highest affinity allele of GM12878: log FCA = log 10 (A ALT ) − log 10 (min(A GM12878 )), where min(A GM12878 ) is the normalized affinity of the highest affinity allele in GM12878 cells and A ALT is the normalized affinity of the alternative haplotype. For instances where A ALT or A GM12878 for a given PWM was zero, the lowest observed non-zero normalized affinity for that PWM across all CRMs was used instead. The log FCA values for multiple PWMs of the same TF were then combined by taking the median. Overall, this approach produced a single log FCA for each TF binding affinity haplotype at each CRM. We shall refer to this quantity as the 'log ratio' in the 'Results' section.

DeepSea analysis
For all SNPs at CRMs, DeepSea (68) predictions were obtained using the online tool (http://deepsea.princeton.edu/ job/analysis/create) with the SNPs in VCF files provided as input, in seven batches. Since the predictions of log fold change in signal generated by DeepSea can be noisy when probabilities are small, we used 'chromatin feature probability differences' (.diff files) as robust predictors. DeepSea predictions available for 33/41 TFs analysed in our study, as well as for DNase-seq signals, were used for comparison with our biophysical model predictions of TF binding affinity effects at SNP level.

DNase I sensitivity QTL analysis
The DNase I sensitivity QTL (dsQTL) dataset from (69) lists significant associations between normalized DNaseseq read depth (binned in 100 bp non-overlapping windows) and the genotypes of SNPs/indels within 1 kb of the DNase hypersensitivity sites (DHS) in 70 Yoruban LCLs. We downloaded this dataset from Gene Expression Omnibus (accession number GSE31388), and converted it to GRCh37 using liftOver (70). For all CRMs with a predicted log FCA > 0 for at least one TF, the individual effect of all SNPs at the CRM on TF affinity was calculated. CRMs were then filtered for those where the SNP causing the largest change in TF affinity ('driver SNP') had a minor allele frequency (MAF) below 0.05 in the 70 individuals from (69). We then counted the number of overlaps between these CRMs and the 100 bp DHS windows (minimum overlap 1 bp), repeating this for CRMs filtered according to successively larger log FCA thresholds. To estimate expected overlap, for each threshold, we randomly sampled a control set of CRMs 1000 times, matching the sample size and 'driver' SNP allele frequency distribution to the test set at a given threshold, and overlapped this set with DNase HS windows in the same way as the test set.

Comparison with ATAC-QTLs
ATAC-QTLs from (71) detected in at least two populations at P < 0.005 were used for analysis. For all SNPs at CRMs with a predicted log FCA > 0, we calculated the proportion of overlapping ATAC-QTLs over the exceeding thresholds of the maximum log FCA across all analysed TFs for each SNP. To estimate the expected overlap, we randomly sampled a control set of CRMs 100 times, matching the sample size and minor allele frequency distribution to those in the test set at a given threshold.

Comparison with MPRA data
MPRA results were downloaded from (72). The effects of SNPs on reporter expression (combined log 2 skew over two LCLs tested) were used for comparison with their maximum predicted effects on TF binding affinity obtained from the biophysical model in our study.

Linking of CRMs with target genes
PCHi-C data for GM12878 were obtained from Mifsud et al. (49). Significant interactions were re-called at a HindIII restriction fragment level using the CHiCAGO pipeline (61), with a CHiCAGO score cut-off of 5 (CHiCAGO scores correspond to soft-thresholded, logweighted P-values against the background model). Baits were annotated for transcriptional start sites (TSSs) using the bioMart package in R (73) based on Ensembl TSS data for GRCh37 reference assembly. Baits containing TSSs for more than one gene were excluded (4178 out of 22 076), leaving 17 898 baits in the analysis. CRMs were assigned to target promoters by overlapping with the promoter-interacting regions of significant interactions ('distal' CRMs). Restriction fragments immediately flanking the promoter fragment are excluded from PCHi-C analysis, creating a 'blind window'. Therefore, we additionally called 'proximal' CRMs using a window-based approach, assigning all CRMs located within 9 kb of the midpoint of the promoter-containing fragment to the respective promoter.

Gene expression data processing
We downloaded PEER-normalized (74) gene-level RPKMs for 359 EUR LCLs profiled in the GEUVADIS project (62) from ArrayExpress (75) (accession E-GEUV-3). The data were filtered to expressed genes by removing genes with zero read counts in >50% of samples. For expression association testing by linear regression, the PEER-normalized residuals for each gene were further rank-transformed to standard normal distribution, using the rntransform function in the R package GenABEL (76).

Association between TF binding affinity variants and gene expression: thresholded approach
In this approach, we classified each predicted TF binding affinity CRM haplotype as either 'high' or 'low' affinity based on a threshold. In some instances, however, using a hard threshold to classify alleles can result in alleles with very similar log fold affinity changes being differentially classified, which can obscure true affinity-expression associations. To avoid this, we used a dynamic thresholding approach, where for each affinity variant we set the threshold log FCA 0 to 80% of the value of the 85th percentile of all log FCA values less than or equal to the hard threshold of −0.3. All alleles with log FCA ≤ log FCA 0 were taken as low affinity. Alleles with either log FCA > log FCA 0 /4 (for log FCA 0 /4 > −0.3) or log FCA > −0.3 were taken as high affinity. Note that this resulted in some alleles classified as neither high nor low affinity. Individuals containing at least one unclassified allele for a given TF/CRM were excluded from the testing for the respective association (the number of individuals tested for each association is listed in Supplementary Table S1).
A regression model was then fitted using TF binding affinity CRM haplotypes as predictors of the expression level of their target genes (presented in terms of normalized PEER residuals). Suppose that a gene is targeted by K predicted TF affinity CRM variants, denoted as X = (X 1 , X 2 , . . . , X K ), which are encoded as the number of copies of the low-affinity allele carried by each individual. The regression model is fitted as follows: is the expected value of the normalized PEER residuals Y. Where multiple predicted TF affinity CRM variants targeting a given gene were in perfect correlation (|β| > 0.99), they were collapsed into a single predictor.
ANOVA was used to test the overall significance of each regression model, with multiple testing correction performed on the gene-level P-values by FDR estimation. For genes showing significant associations at 10% FDR in models with multiple TF binding affinity variants as predictors, t-tests were performed to identify variants with regression coefficients significantly different from zero. Variants with unadjusted coefficient-level P-values <0.05 were taken to be significantly associated with target gene expression, conditional on significant gene-level association at 10% FDR.

Association between TF binding affinity variants and gene expression: threshold-free approach
In this approach, we performed multiple regression using PEER expression residuals for each gene as the response variable, this time using the sum of log FCA across both alleles for each individual for each TF affinity CRM variant as predictors instead of thresholded CRM haplotypes. For each gene, all distal and proximal CRMs with log FCA > 0 were included. As with the thresholded approach, ANOVA was used to test the significance of each gene model, and genes showing associations at 10% FDR were considered significant.
Due to high collinearity among the predicted affinity changes, to identify specific CRM variants signifi-cantly associated with target gene expression we used elastic net regression for each significantly associated gene (λ 2 = 0.5). The significance of each predictor as it entered the model was then tested using a method by Lockhart et al. (77) and implemented in the R package covTest (https://cran.r-project.org/src/contrib/Archive/ covTest/covTest 1.02.tar.gz). Variants that entered the model with P < 0.05 and remained in the model were taken as significant.

eQTL fine mapping
We fine-mapped eQTL causal variants in the LCL expression data within a window of ±200 kb of each CRM, using a Bayesian stochastic search fine-mapping method that allows for multiple causal variants, GUESSFM (https:// github.com/chr1swallace/GUESSFM) (78). This requires a prior on the number of causal variants per region, which we set as Bin(n, 2/n) where n is the number of variants in the fine-mapping window. This setting gives a prior expectation of two causal variants per region but allows all values from 0 to n. We visually checked traces to ensure the Markov chain Monte Carlo (MCMC) samples had converged. Raw GUESSFM data have been uploaded to the Open Science Framework (OSF; https://osf.io/e5vsh/).
To estimate the proportion of possibly causal eQTLs identified by GUESSFM (marginal posterior probability of inclusion [mppi] 0.001) among the TF binding affinity variants showing the strongest eQTL signal per CRM ('test SNPs'), we compared it with the same proportion obtained for 'random SNPs'. The 'random SNPs' were sampled from the same ±200 kb windows around CRMs, matching the distribution of their minor allele frequencies to that across the 'test SNPs'.

Causal variant colocalization analysis
An association between an epromoter variant and the expression of both a proximal and a distal gene may indicate that this variant is causal for the expression of both genes. However, the same association may arise from distinct causal variants for each gene that are in LD with each other and are tagged by the same epromoter variant. To differentiate between these situations, we used the Bayesian colocalization technique coloc (79). Coloc evaluates the posterior probabilities of five mutually exclusive hypotheses: no association of any variant in the region with either trait (H0), association with first trait but not the second (H1), association with second trait but not the first (H2), two separate causal variants (H3) and finally a unique shared causal variant (H4). Coloc assumes at most one causal variant per locus. To mitigate this limitation, where there was evidence for multiple causal variants, we tested for colocalization between all pairs of signals for each gene by conditioning out the other signals. Coloc has also been originally designed for testing two sets of associations measured on different individuals. Therefore, before running it on the data measured in the same individuals (i.e. the expression of the proximal and distal gene across the 359 CEU LCLs), we confirmed by simulation that for a quantitative trait the results appear robust to correlated errors (Supplementary Figure S1).

An atlas of CRMs with predicted variation in TF binding affinity in LCLs
We used the ChIP-seq binding profiles of 52 TFs profiled by the ENCODE project (4) in GM12878 LCL to define 128 766 CRMs in these cells, merging across overlapping ChIP regions for multiple TFs (Figure 1). Just over half (55%) of CRMs defined this way were bound by more than a single TF. For 41/52 TFs with known PWMs, we then used a biophysical model (39) to estimate their binding affinity to each allele of each CRM in GM12878, pooling information across multiple PWMs for the same TF where available (see 'Materials and Methods' section). To enable the comparison of binding affinities between different TFs, we expressed them relative to the respective 'background' affinities using an approach based on the generalized extreme value distribution (40) (see 'Materials and Methods' section for details).
We next asked how natural genetic variation at CRMs affects their TF binding affinity. For this, we took advantage of the genotypes of an additional 358 LCLs also derived from European-ancestry individuals that are available from the 1000 Genomes Project (60). These LCLs showed sequence variation at 98 918 (79%) of the CRMs relative to GM12878. We then calculated a TF affinity log-ratio between each alternative haplotype and the highest-affinity haplotype of GM12878 (Figure 1; see 'Materials and Methods' section). SNP-level effects on TF affinity predicted by the biophysical model showed a significant correlation with those predicted by a deep learning algorithm DeepSea (68) trained on epigenomic data across tissues (r = 0.36, corr test P < 2.2e−16, Supplementary Figure S2A). Overall, 38 804 CRMs had one or more alternative haplotypes with predicted changes in binding affinity for at least one TF (affinity log ratios ranging between −12.9 and 13.17). We have made the full atlas of TF-binding CRM variants publicly available at https://osf.io/fa4u7.

TF-binding variants are enriched for associations with chromatin accessibility and effects on reporter gene expression
TF binding is known to be associated with increased chromatin accessibility. Consistent with this, variant effects on TF affinity predicted by the biophysical model correlated with DeepSea-predicted effects on DNase I signal (r = 0.33, corr test P < 2.2e−16, Supplementary Figure S2B). To validate these effects more directly, we took advantage of a published study (69) that profiled chromatin accessibility across 70 LCLs using DNase-seq and identified ∼9000 significant associations between DNase-seq signal and genotype (dsQTLs). If our predicted TF affinity variants reflected real changes in binding affinity, we would expect them to be enriched at regions of differential chromatin accessibility (see Figure 2A for an example). To verify this, we quantified enrichment of differential chromatin accessibility at sets of CRMs showing predicted TF affinity variation above successively larger thresholds. As can be seen from Figure 2B, CRMs with non-zero differences in TF binding affinity across LCLs showed a significant enrichment at differential DNase I sensitivity regions compared with a matched random set of CRMs (permutation test P < 0.001, see 'Materials and Methods' section for details). Moreover, this enrichment increased with the magnitude of the predicted affinity change ( Figure 2B).
ATAC-seq provides another readout of chromatin accessibility. Consistent with the findings from DNase-seq analysis, we observed that the magnitude of variant effects on TF affinity positively associated with an enrichment for ATAC-QTLs from a recent study using a much larger cohort of LCLs across populations (71) (Supplementary Figure S3).
Finally, we assessed the effects of TF-binding variants on reporter gene expression using data from a massively parallel reporter assay in LCLs (MPRA) (72), which included results for 1519 variants mapping to the CRMs from our study. Variant effects on reporter activity showed a significant correlation with those on TF affinity (r = 0.11, corr test P = 0.005, Supplementary Figure S4).
Jointly, these results provide evidence that our approach adequately predicts functionally relevant variant effects on TF binding.

Variation in TF binding affinity at CRMs associates with target gene expression
To identify quantitative associations between TF binding variation at CRMs and the expression of their target genes, we used genome-wide gene expression data from the GEU-VADIS project (62) that included 358/359 of the LCLs used in our analysis (with the exception of GM12878). In contrast to traditional eQTL testing, here we devised an approach that prioritizes TF-binding variants and their putative target genes a priori and performs testing at the CRM level. In total, we selected 3285 CRMs with predicted variation in the binding for at least one TF (log ratio >0.3). We then tested the association of each CRM haplotype with the expression levels of their target genes defined on the basis of 3D interactions or close spatial proximity (within 9 kb; see 'Materials and Methods' section). As evidence of 3D promoter-CRM interactions, we used high-resolution PCHi-C data in GM12878 cells (49,61). The highly reduced search space has enabled testing for associations at the gene level, with all CRMs targeting the same gene and showing TF binding variation included into the regression model (see 'Materials and Methods' section). This approach identified 245 'eGenes' with significant associations between predicted TF binding affinity at CRMs and gene expression (16% of 1530 genes tested, at 10% FDR; Supplementary Table S1). In total, 161 'proximal' (within 9 kb) and 101 'distal' TF-CRM affinity variants (with contacts detected by PCHi-C) were found to underlie these associations, corresponding to 26% and 6% of all variants tested, respectively (t-test Pvalue <0.05; Supplementary Table S1). Figure 3 shows an example of the detected association between the expression of KLF6 and variation in the binding affinity of BATF at a distal CRM that is located 88 kb away from KLF6 promoter and contacts it in 3D according to PCHi-C (genelevel FDR = 1.21 × 10 −2 , BATF variant P-value = 5.16 × 10 −4 , effect size = 0.26; the genome segmentation profile shown is based on chromHMM (80)). Individuals homozygous for the high-affinity BATF binding allele showed the lowest levels of KLF6 expression, while those homozy- gous for the low-affinity BATF binding alleles showed the highest levels ( Figure 3). This suggests that BATF acts as a negative regulator of KLF6 expression, consistent with its known role as a repressor of AP-1-dependent transcriptional activity (81).
A total of 420/1530 genes (27%) were linked with multiple predicted TF-binding variants (either for different TFs bound at the same CRM or at different CRMs). For 16 of these genes, we detected significant associations between more than one such variant and the expression level. One example is the nuclear receptor gene NR2F6 whose expression significantly associated with predicted variation in the binding affinities of SMC3 and SRF to distal CRMs located, respectively, 41 and 19 kb away (Figure 4; gene-level FDR = 4.06 × 10 −7 , SMC3 effect size = 0.26, P-value = 3 × 10 −4 ; SRF effect size = 0.61, P-value = 1.19 × 10 −7 ).
Owing to the a priori prioritization of variants for association testing in our approach (i.e. testing only variants predicted to impact TF binding), we carried out far fewer association tests than in a standard eQTL analysis, thus reducing the multiple testing burden and increasing sensitivity. We therefore asked whether we were able to detect additional associations compared with those reported for a standard eQTL analysis performed by the GEUVADIS project (note that this analysis also used an additional 103 LCLs not included in our study, which were either of non-European ancestry or not genotyped in 1000 Genomes project). To compare our CRM-based association results  (49). Two out of the three fragments interacting with KLF6 promoter are shown; the third fragment, which is located 850 kb away from the KLF6 promoter and contains the gene LINC00705, was omitted due to space constraints. The chromHMM genome segmentation tracks for GM12878 are shown immediately below (80). CRMs at the two distally interacting fragments and the TSS-proximal window are depicted in azure blue. to GEUVADIS eQTL SNPs, we identified the SNP causing the largest change in affinity for the respective TF at each CRM (192 eQTL SNPs in total at 5% FDR to match the FDR level used by GEUVADIS). Of these, 78 SNPs (42%) were detected as significant by GEUVADIS. Therefore, the remaining 114/192 (58%) eQTL SNPs identified in our approach corresponded to not previously reported associations.

Threshold-free testing based on TF binding affinities reveals further expression associations
The above-mentioned analysis was performed broadly within the conventional paradigm of eQTL testing, whereby expression was compared across three diploid genotypes (two homozygous and one heterozygous), except that these genotypes corresponded to cases whereby variation was predicted to appreciably disrupt TF binding based on a pre- defined threshold (we shall refer to this approach as 'thresholded'), and the gene-CRM combinations were selected for association testing based on PCHi-C data. However, since TF binding affinity haplotypes were defined at the CRM level, more than two haplotypes were commonly observed per CRM with respect to a given TF (in 12-100% cases depending on the TF). In the thresholded approach, we pooled multiple alleles into either 'high-affinity' or 'lowaffinity' haplotypes and disregarded outliers (see 'Materials and Methods' section). We reasoned, however, that it is also possible to regress gene expression against normalized TF binding affinities directly without thresholding and haplotype pooling, leading to increased precision and sensitivity of association testing. As expected, this 'threshold-free' approach revealed a considerably larger number of genes significantly associated with CRM affinity variants (1033 eGenes at 10% FDR compared with 245 detected in the 'thresholded' approach mentioned earlier). One challenge arising in the threshold-free approach is that it leads to many more TF affinity CRM variants tested for each gene. Since the same SNPs or those in LD with each other can impact CRM affinity for multiple TFs, the explanatory variables in the regression models are often correlated, posing challenges for the standard ordinary least squares (OLS)-based association testing. Therefore, to detect significant associations in the unthresholded setting, we performed elastic net regression for each of the 895/1033 identified eGenes that were targeted by multiple TF affinity CRM variants. To ascertain the significance of regression coefficients in elastic net regression, we used a covariance test for adaptive linear models (77), identifying 1328 significant CRM-gene associations for the 895 eGenes tested (Supplementary Table S2; see 'Materials and Methods' section for details). One example of a newly identified association is between a nucleotide transporter gene SLC29A3 and the binding affinity of SIN3A at a CRM overlapping with the TSS of SLC29A3 (gene-level FDR = 1.60 × 10 −4 ). Five alternative SIN3A binding affinity haplotypes were observed across the 358 LCLs ( Figure 5A), with log-fold changes in affinity for SIN3A (relative to the highest affinity allele of GM12878) ranging from −0.037 to 0.001 (elastic net effect size = −0.14, P-value ∼0; Figure 5B). In total, 72% of the TF-CRM variants showing significant associations with gene expression had three or more TF binding affinity haplotypes.

TF binding affinity variants are highly enriched for causal eQTLs
We asked what proportion of TF-binding variants showing association with target gene expression in our analysis could be fine-mapped as causal purely based on the pattern of association signals in their vicinity, without a priori prioritization and pooling of variants per CRM. To this end, we supplied genotype information for ±200 kb windows around the CRMs with detected associations and the respective gene expression data to GUESSFM, a Bayesian fine-mapping approach that accounts for possible multiple causal variants per locus (78). GUESSFM identified at least one causal variant in ∼38% of the analysed CRMs (1807/4718); associations in the remaining CRMs likely could not be fine-mapped due to a lack of statistical power. In ∼30% (548/1807) of CRMs with successful fine mapping, the TF-binding variant showing the strongest association per CRM was ranked as possibly causal (mppi > 0.001), and in the majority of such cases (477/548) this vari-ant was also ranked by GUESSFM among the top five highest scoring variants in the window (see Supplementary Table S3 and Figure 5C and D for examples). In contrast, just 2.6% (48/1807) random variants within the same windows (matched by allele frequency) were detected as potentially causal by GUESSFM, corresponding to a very significant enrichment of fine-mapped variants for those affecting TF binding (Fisher test P = 10 −126 ).

Many CRMs associated with distal gene expression show features of epromoters
We noted that a large number of distal CRMs showing association between TF binding affinity and target gene expression (224 CRMs, 243 TF-CRM variants; Supplementary Table S4) and connecting to the distal gene promoters in 3D based on PCHi-C also mapped in close proximity (within 200 bp) of the TSS of either one or more other genes (165 and 59 CRMs, respectively, and 284 eGenes; note that the number of eGenes is greater than that of CRMs due to some CRMs mapping in close proximity of multiple TSSs). The absolute majority (87%) of these CRMs localized within chromatin segments with the characteristic features of gene promoters ( Figure 6A). Taken together, Nucleic Acids Research, 2020, Vol. 48, No. 6 2875 Figure 4A for the colour key). Inset: Enlarged view of an interacting fragment containing three CRMs, one of which harbours variants predicted to impact ELF1 binding affinity and overlaps with the CLOCK promoter. (C) Colocalization analysis showing shared association between epromoter-located SNP rs12889775 and the expression of both its distal and proximal genes (IRF2BPL, top, and lncRNA RP11-7F17.7, bottom, respectively). Posterior probability of shared association estimated by the coloc software P H4 = 0.997. This SNP is predicted to affect the epromoter's binding affinity for EGR1 (see inset). this suggested that some promoter regions might act as distal regulatory regions of other genes, whose promoters they physically contact. This class of CRMs with dual promoter and enhancer activity were independently identified in two recent studies (63,64). We shall follow Dao et al. (63) in referring to these CRMs as 'epromoters'.
Most genes located in the immediate vicinity of the identified epromoters were appreciably expressed in LCLs (232/284, 82%). However, TF binding variation at nearly two-thirds of epromoters whose proximal gene was expressed (139 variants, 64.7%; see Supplementary Table S2) showed detectable association with a distal gene alone in independent tests (assessed with the threshold-free approach). For example, variation in ELF1 binding affinity at a CRM that shows promoter-associated chromatin marks and localizes within 200 bp from the TSS of CLOCK gene does not affect CLOCK expression. Instead, it associates with expression of SRD5A3 located 198 kb away, whose promoter it contacts in 3D as detected by PCHi-C ( Figure 6B The remaining 76 TF-epromoter CRM variants showed associations between with the expression levels of both distal and proximal genes. To obtain formal evidence that these associations were indeed driven by the same variant and not by different variants in LD with each other, we used colocalization analysis (79), while accounting for multiple independent associations (see 'Materials and Methods' section). We submitted to this analysis the most tractable subset of seven epromoters, for which the association of the respective TF-binding variant with distal gene expression was independently confirmed by fine mapping (GUESSFM mppi > 0.001). At 6/7 analysed epromoters, we found prevailing evidence of shared association signals for both the proximal and distal genes (P H4 > 0.66; Supplementary Table S5). An example of such high-confidence shared signal is variation in EGR1 binding affinity at the epromoter of lncRNA RP11-71F7.7 that associates with the expression of both RP11-71F7.7 and another gene, IRF2BPL ( Figure 6C). The promoters of these two genes, transcribed in a convergent orientation, are ∼69 kb apart and contact each other in 3D as detected by PCHi-C.
Taken together, our findings confirm long-range transcriptional regulation by epromoters and suggest that regulatory variants within these elements may have both shared and independent effects on the expression of their proximal and distal target genes.

DISCUSSION
In this study, we have generated an atlas of CRM variants predicted to affect TF binding in LCLs and established their associations with the expression of their putative target genes. The key methodological innovations of our work are the prioritization and pooling of variants at CRM level using a biophysical model of TF binding affinity, as well as the prioritization of CRM target genes based on highresolution PCHi-C data. We perform variant and target gene prioritization a priori of eQTL testing to increase detection sensitivity and the likelihood of revealing causal associations. Using this strategy, we have detected ∼1300 associations between CRM variants and target gene expression in LCLs. Our approach reveals eQTLs detected at high sensitivity, whose enrichment for causal variants is validated by statistical fine-mapping analysis and by comparison with independently generated MPRA data. Notably, we find that many TF-binding variants showing associations with distal gene expression localize to the promoters of other genes, in support of the recently characterized class of 'epromoter' regulatory elements (63,64).
The atlas of binding variants generated in this study is based on EUR individuals from 1000 Genomes Project release and extends our earlier work using the pilot data from the same project (32). Importantly, unlike in our earlier work (32) and other published resources (82,83), here we have used a biophysical model (39) that aggregates TF binding affinities across the whole CRM to increase sensitivity. This model has been used successfully in previous studies of cis-regulatory control (84)(85)(86)(87). The relevance of integrating information at CRM level is further highlighted by recent studies showing the importance of weak TF binding events in gene regulation (42,88,89). Therefore, our approach provides a biologically meaningful paradigm for variant pooling at CRM level.
In choosing to quantify variant effects on TF binding in terms of affinity changes, we were attracted by the direct biological interpretability of this metric. A complementary strategy to score TF affinity at CRM level is provided by hidden Markov models (HMMs) (90)(91)(92). HMM-based frameworks can be useful, for example, for modelling effects of TF cooperativity (90,91), which could be incorporated into future variant prioritization frameworks. Machine learning algorithms, and particularly deep neural networks, may potentially model even more complex relationships between DNA sequence and TF binding (68,(93)(94)(95), although typically at the expense of direct biological interpretability. Reassuringly, our predicted variant effects on TF binding affinity are generally correlated with the predictions of the well-established deep-learning model DeepSea (68). Notably, the biophysical model used in our study con-stitutes one of the layers in a recently proposed fully interpretable deep learning model of Drosophila transcriptional control (96), highlighting the continued relevance of this approach.
Predicting the effects of genetic variants on the expression of distal genes is a highly challenging task. To our knowledge, no machine learning model currently generates such predictions for CRM-promoter interaction distances beyond ∼50 kb, reinforcing the importance of evidence from functional genomics, chromosomal conformation and population genetics studies for understanding long-range variant effects. Here, to prioritize the target genes of distal regulatory variants at high sensitivity and resolution, we have taken advantage of PCHi-C data. PCHi-C provides a 15-20-fold enrichment of promoter interactions over the conventional Hi-C technology (48)(49)(50) that was previously used in variant effect analyses (95,97). Theoretically, the effects of nucleotide variants on TF binding can also be incorporated as a prior in global association analyses such as fgwas (98), and have already been used in eQTL fine mapping (99). A formal eQTL testing framework using 3D interaction data as a prior is, however, yet to be established.
Our finding that polymorphic TF binding sites at distal CRMs show gene expression associations less frequently compared with proximal regions is consistent with the high degree of redundancy of long-range regulatory elements (5)(6)(7)100,101). Predicting the extent of buffering of regulatory variation for a given CRM with a reasonable precision is an important problem that is currently highly challenging due to the sheer number of parameters and the relatively small sample sizes of multi-individual expression datasets. Profiling gene expression in the emerging much larger genotype panels such as UK10K (102) and UK Biobank (103) may provide opportunities for addressing this question.
We observe that a large proportion of CRMs showing associations with the expression of physically connected distal genes are located in the promoter regions of other genes. This finding provides support to the recently characterized class of 'epromoters': elements with a dual proximal and distal activity that were discovered on the large scale using high-throughput reporter and CRISPR knockout screens (63)(64)(65). Empirically, chromosomal interactions between epromoter CRMs and their distal targets fall into the category of promoter-promoter interactions. Until recently, these interactions have been considered primarily in the context of coordinated gene activation or repression (104)(105)(106), such as that observed in Hox and histone clusters (104,107). That some promoterpromoter contacts reflect relationships between epromoters and their distal target genes suggests that these contacts may show functionally and possibly even structurally distinct properties.
We show that TF binding variation at epromoters may or may not co-associate with the expression of both proximal and distal genes at the same time. Shared association is consistent with the findings from massively parallel reporter assays that the same sequences are often involved in mediating both promoter and enhancer activity in vitro (108). It is possible that some non-shared effects observed in our study in vivo are underpinned by the role of the affected TFs in mediating long-range contacts. Additionally, epromoter el-ements may show different degrees of redundancy with respect to the proximal and distal target genes.
Overall, our analysis demonstrates the potential of model-based prioritization and pooling of variants a priori of testing for increasing the sensitivity of identifying individual associations and revealing their shared biological properties.

DATA AVAILABILITY
The list of the detected TF affinity CRM variants, the full data on CRM variant-gene expression associations and the raw output of GUESSFM fine mapping have been uploaded to OSF (https://osf.io/fa4u7/). The scripts used to generate TF binding affinity variants and perform expression association testing have been uploaded to the same OSF repository. Scripts used for running GUESSFM and coloc are available from https://github.com/chr1swallace/ eqtlfm-mikhail/.