Whole-genome functional characterization of RE1 silencers using a modified massively parallel reporter assay

Summary Both upregulation and downregulation by cis-regulatory elements help modulate precise gene expression. However, our understanding of repressive elements is far more limited than activating elements. To address this gap, we characterized RE1, a group of transcriptional silencers bound by REST, at genome-wide scale using a modified massively parallel reporter assay (MPRAduo). MPRAduo empirically defined a minimal binding strength of REST (REST motif-intrinsic value [m-value]), above which cofactors colocalize and silence transcription. We identified 1,500 human variants that alter RE1 silencing and found that their effect sizes are predictable when they overlap with REST-binding sites above the m-value. Additionally, we demonstrate that non-canonical REST-binding motifs exhibit silencer function only if they precisely align half sites with specific spacer lengths. Our results show mechanistic insights into RE1, which allow us to predict its activity and effect of variants on RE1, providing a paradigm for performing genome-wide functional characterization of transcription-factor-binding sites.


In brief
Transcriptional silencers are understudied compared with activating elements. By using MPRAduo, Mouri et al. perform a whole-genome functional characterization screen of RE1 silencers and identify REST-binding motif characteristics and cofactor localization required for a functional silencer. They also identify human genetic variants that impact RE1 activity.

INTRODUCTION
Both inducing and repressing transcription by cis-regulatory elements (CREs) are crucial for the spatiotemporal responses controlling cell identity and function. 1 More than a half century after the discovery of a repressive element acting on the lac operon, 2 the rapid development of approaches to characterize CREs has revealed a multilayered epigenetic landscape, highlighting a dynamic network of gene regulation responsible for multicellular control and environmental response. 3,4 Detailed maps of histone modifications and chromatin accessibility have allowed us to annotate more than 800,000 candidate elements in the human genome that regulate gene expression with a focus toward elements that activate or promote transcription (i.e., enhancers and promoters). 5 This bias is driven by both biological and technical factors, partially due to our knowledge of chromatin marks that demarcate active enhancers and functional validation being more robust for sequences that increase transcription. This has resulted in fewer large-scale functional studies of repressive elements (i.e., silencers) and, as a result, a more limited understanding of the scale at which repressor elements function across the genome. 6,7 Thus, an increased focus of repressive elements is critical for understanding the full gene regulatory landscape. [8][9][10] Though most repressive elements remain poorly characterized, repressor element 1/neuron-restrictive silencer element (RE1/NRSE) is a well-defined group of silencers. 11,12 RE1 is bound by RE1-silencing transcription factor (REST), also known as NRSE factor (NRSF), which is a zinc finger transcriptional factor (TF) conserved through chordates. 13,14 Although RE1 was initially discovered as a silencer for neuron-specific genes in non-neuronal cells, REST has also been found to have crucial roles in the brain and in the repression of non-neuronal genes. 15,16 REST recruits histone deacetylase (HDAC) complexes and histone methyltransferase EHMT2/G9A to RE1 for silencing, [17][18][19] mediated by cofactors including RCOR1/ CoREST and SIN3A. 17,[20][21][22] Localization of these cofactors has been demonstrated to vary among RE1s, suggesting that different characteristics of RE1 are dependent on the localization of cofactors. 23 Our knowledge of TF interactions at RE1 can aid the understanding of silencer mechanisms but requires the systematic measurement of RE1 activities, which has not been done yet.
Massively parallel reporter assays (MPRAs) are a highthroughput functional genomic platform designed to directly measure the activity of millions of CREs and can identify genetic variants that modulate their regulatory activity. 24-27 Unbiased approaches using MPRA, such as characterizing disease-associated variants and locus tiling, indicate that an MPRA using a promoter with minimal activity has a preference toward detecting activation compared with repression. 27,28 Several examples have demonstrated that using stronger promoters in MPRA helps to detect repressive elements and that a significant portion (41%) of CREs unattributed to classical functional groups act as silencers when tested by an MPRA using a single promoter type. 9,28,29 It has also been shown that interactions between CREs can exhibit specificity depending on context including the cell type and TFs involved, emphasizing the importance of the CRE used for basal expression in a reporter assay when studying repressor function. 30,31 Thus, a systematic evaluation and an appropriate selection of transcription-activating CREs are essential for testing repressive elements at scale in reporter assays for the purpose of understanding their basic mechanisms and impact on disease. To accomplish this, we sought to use MPRAs to characterize the interaction between CREs to detect silencer activity at scale.

MPRAduo
To optimize the ability of MPRA to characterize silencer activity through the pairing of silencers with appropriate CREs, we developed MPRAduo, which tests two CREs located in cis on the same reporter vector ( Figures 1A and S1A). To associate tested elements with transcribed mRNA, we added 10-and 20-nucleotide barcodes to the activating CRE (E) and silencer (S) modules, respectively, and then cloned them into two standard MPRA plasmid vectors using unique linker sequences (pDGFP). As with standard MPRA libraries, we inserted a GFP open reading frame (ORF) with a minimal promoter between the test sequence and barcode (we label the single libraries E and S). Next, we amplified the oligonucleotide (oligo)-GFP-barcode cassette and cloned it into the reciprocal pDGFP library, resulting in combined E and S libraries for both alignments, which enables us to test the effect of the relative position of CREs (duo libraries: SE and ES, according to the alignment of two elements from 5 0 to 3 0 upstream of a minimal promoter). Following transfection of libraries into cultured cells, we isolated the GFP mRNA and performed sequencing of the two barcodes in the 3 0 UTR to recover the elements and their orientation for downstream analysis to quantify the expression level (STAR Methods).

MPRAduo benchmarking
We used MPRAduo to identify CREs that, when tested in combination with putative silencers, respond to repressive effects. We selected 19 activating CREs, 150-bp in length, previously tested by MPRA for evaluation in library E (E element). 27 E elements were chosen to represent a range of activity levels and TF binding. In addition, 2 negative controls with no reporter activity were included for a total of 21 unique sequences (STAR Methods). For library S, we used TF chromatin immunoprecipitation sequencing (ChIP-seq) peaks that included a binding motif for the ChIP target to select 509 candidate silencer elements from REST, CTCF (originating from topologically associating domain [TAD] boundaries or enhancer-like loci marked by H3K27ac), and YY1. For GFI1, we selected 292 sites based on the presence of a binding motif due to a lack of ChIP-seq data in our target cell type. We also included the chicken HS4 sequence, which is a well-known enhancer blocker (EB), and 5 human CTCF-binding sites from previously validated EBs. 32,33 In addition to the silencer candidates, we included 74 controls that have activity in standard MPRAs 27 and 806 matched background control sequences selected at random from the genome. Libraries E and S were constructed as single libraries and together in both orientations, resulting in libraries ES and SE, which contained, in total, 72,562 unique constructs. We created two pools containing both single libraries and one duo library and transfected each into GM12878 cells. Libraries were normalized both within and between pools by shifting the modal activity of negative controls for each library to zero (STAR Methods). Elements that had an activating effect in the single libraries (E and S) showed high correlation between the two pools, indicating that the system is highly reproducible across experiments (r = 0.77 for CREs, r = 0.65 for active control; Figures S1B and S1C). Duo libraries (ES and SE) showed similar agreement between orientations (r = 0.77) as well as to the single libraries when using a log-additive model (R 2 = 0.79 and 0.68) of the single library activity measurements ( Figures 1C, 1D, and S1D), which agrees with previous observations of how elements interact when tested by MPRAs. 31,34 MPRAduo showed significant repression by CTCF and RE1 ( Figure 1E; Table S5). CTCF-binding sites at TAD boundaries significantly repressed the basal expression level of 4 E elements in the ES library and 9 E elements in the SE library ( Figure S2A). RE1 showed the most significant repression in MPRAduo with 15 E elements in ES library and 13 E elements in SE library ( Figures 1F and S2B). Overall, RE1 repressed activity of 12 E elements in both alignments, with repression strength correlating with the basal level expression of E elements, although a few E elements, notably En11 and En15, were non-responsive to RE1 despite their medium to strong basal activities. These results demonstrate that MPRAduo can detect repression by RE1and CTCF-binding sites and identifies CREs that improve the signal-to-noise ratio of silencers within a reporter assay.
Whole-genome RE1 screening Encouraged by the ability of MPRAduo to characterize RE1 repression, we sought to comprehensively understand the mechanisms of RE1 activity genome wide. We selected 8,436 RE1 sites containing a canonical REST-binding motif and overlapping with a REST ChIP-seq peak in at least one of four human cell types: GM12878, K562, HepG2, and SK-N-SH ( Figure 2A). We also included 4,430 genomic sequences that overlap with a REST ChIP-seq peak in one or all of the four cell types but do not contain a canonical REST-binding motif. To avoid evaluating promoters, which are likely to increase expression in the reporter assay, we excluded all loci within 5-kb upstream of a transcription start site. Genomic sequences 200 bp in length containing the REST-binding motif in its center (from 91st to 111th nucleotides) were synthesized and assembled in an ES library alongside 5 E elements tested from the benchmark set. We selected E elements to cover a range of activity levels based on our benchmarking results, including one negative control (En02), one RE1 non-responsive CRE (En11), and 3 CREs that demonstrated significant repression by RE1 (En09, En19, and En21). We confirmed that all four CREs are active as marked by the presence of H3K27ac in the four cell types used in our screen. 35 We tested the whole-genome RE1 ES library in GM12878, HepG2, K562, and SK-N-SH cells, which, as a whole, represent the three germ layers ( Figure S3). Using t-stochastic neighbor embedding (t-SNE) to assess the relationship between cell-type (B) Proportion of RE1 that showed repression with 1% of false discovery rate (FDR) in at least one cell/enhancer combination. (C) Log 2 (odds ratio) for enrichment of epigenetic marks in the strong silencers shown in (B). **adjp < 0.01, ***adjp < 0.001 by Fisher's exact test corrected using BH. (D) Correlation between predicted binding score of REST and motif contribution (log 2 (scrambled/native)) with En19 in K562. The purple line indicates linear regression, and the orange line indicates piecewise linear regression. The dashed line indicates the change point determined by the piecewise linear regression. (E) MPRAduo activity with En19 for each cell type. Genomic REST ChIP+ indicates RE1s bound by REST in each cell type and genomic REST ChIPÀ indicates RE1s not bound by REST in the observing cell type but bound in other cell type(s). Gray line indicates the median of the negative control. *adjp < 0.05, **adjp < 0.01, ***adjp < 0.001 by U-test compared with negative controls corrected using BH.
4 Cell Genomics 3, 100234, January 11,2023 Article ll OPEN ACCESS specificity and E-element specificity, we observed the silencing activity of RE1s with a canonical REST motif for each E element clustered together across cell types, suggesting that, generally, REST activity within MPRAduo is more dependent on the genomic context than the cell type ( Figure S3C). In total, 2,657 REST-binding sites with a canonical motif (31%) and 486 without a canonical motif (10%) showed strong repression at a 1% false discovery rate (FDR) in at least one cell type and with one E element ( Figure 2B). These 3,143 sequences were depleted of epigenetic marks for enhancers (H3K27ac and H3K4me1) in the native genomic context of all four cell types tested with MPRAduo ( Figure 2C). We did not observe consensus enrichment for H3K27me3, a marker of Polycomb, in the four cell types.
To measure the precise contribution the 21-bp REST-binding motifs have on the repression by each 200-bp RE1 sequence, we removed the REST motif for 5,866 RE1 sequences by scrambling the motif. An effect score for each REST motif was calculated by taking the difference of expression between the scrambled and native sequence (motif contribution) and comparing it with the predicted binding score of a native sequence scored by FIMO 36 ( Figure 2D). While weak REST motifs did not contribute to repression, stronger motifs showed a clear contribution when paired with all E elements except for En02 and En11 in GM12878, the two elements that did not respond to RE1 in the pilot result. However, strong motifs contributed to the repression by RE1 when combined with the two E elements in the other three cell types, indicating the non-responsiveness of the two E elements for RE1 is celltype specific. To determine the boundary between weak and strong motifs, we modeled the correlation between binding score and repressive activity using piecewise linear regression with a change-point estimation. The 18 of 20 E element-cell combinations with an obvious shift of the correlation had an average change point of 20.86 (ranged from 19.0 to 21.9), above which the slope of the regression dramatically increased ( Figure S4; Table S10); we used this average change point as the boundary to delineate weak and strong REST-binding motifs. The estimated values of the change point were close to each other (SD = 0.805), indicating that the change point is independent of cell types. Expanding the view to the whole 200 bp of the silencer elements, the RE1 sequences that have strong motifs and are bound by REST in the tested cell (specific-ON) showed significant repression, while elements with weak motifs did not, and even showed a strong increase in expression for some sequences ( Figures 2E and S6). These results demonstrate a boundary of the REST-binding motif score (REST motif-intrinsic value [m-value]), which determines RE1 silencer function in MPRAduo.
Non-canonical binding motif requires precise arrangement and spacing of half sites We next focused on exploring the 10% of the REST-binding sites without canonical motifs that show repression by MPRAduo. The canonical 21-bp binding motif of REST consists of two half sites spaced 2 bp apart. Previous work has shown that non-canonical motifs containing both half sites of the canonical motif with different combinations, orientations, and spacer lengths are found in REST ChIP-seq peaks 37,38 ( Figure 3A). However, it is not clear how these arrangements of the half sites affect the silencer activity. To assess this, we identified half sites in the 4,430 REST-binding sites without a canonical motif and annotated non-canonical pairs of half sites with summary binding score above the REST m-value determined by canonical motif. Our library includes 204 sequences containing an atypically spaced motif and 57 flipped, 52 convergent, and 54 divergent motifs as well as 509 sequences with a single half site. Atypically spaced motifs were the only configuration to show significant repression in MPRAduo ( Figures 3B and S6A). Next, we compared the repression in MPRAduo of the different spacer lengths of atypically spaced motifs and found that sequences with 8-and 9-bp spacers repressed expression while other distances showed no repressive effect, concordant with observations seen in REST ChIP-seq signals ( Figures 3C, 3D, and S6B). The significant repression with an 8-or 9-bp gap was shown in all cell types with En19 and in some cell types with other E elements ( Figure S7).
To further evaluate spacer requirements of the REST-binding motif, we performed an additional MPRA where we modified the gap sequence of 8-or 9-bp spaced motifs and tested them in K562 cells. When the 8-or 9-bp gap sequence was scrambled, the expression level significantly increased, indicating that there is nucleotide constraint within the 8-or 9-bp gap sequence mediating silencing activity ( Figure 3E). However, we were unable to recover a distinct motif in the gap sequence associated with silencing (see STAR Methods). Furthermore, changing the gap sequences to 2 bp derived from canonical motifs further decreased the expression level, confirming that an 8-or 9-bp linker is sufficient for silencing but weaker than the canonical 2-bp gap sequence. These results provide direct functional support that the non-canonical REST-binding motif requires precise alignment of two half sites with a specific spacer length of 8 or 9 bp for repressive activity.
Group of TFs colocalize with REST to facilitate silencer function To classify additional cofactors of RE1, we sought to find TFs that may operate at RE1 in addition to REST. Using TF ChIPseq data, we identified 329 TFs that are colocalized at our RE1 sequences with canonical REST motifs in K562 cells; we chose K562 due to it having the lowest experimental noise in MPRAduo and the greatest abundance of ChIP-seq data. For each TF, we separated RE1 sequences into four groups based on the binding of TF and REST as determined by ChIP-seq (e.g., TF+/REST+, TF+/RESTÀ, TFÀ/REST+, and TFÀ/RESTÀ). We calculated the difference of the median expression levels between these groups as measured by MPRAduo in K562 cells (Dmedian) ( Figure 4A). Using the direction of Dmedians between TF+ and TFÀ groups with or without REST, we confidently placed 281 of 329 TFs into two categories: (1) 267 TFs were associated with positive expression activity when colocalized with REST, and (2) 14 TFs were associated with repressive activity when colocalized with REST ( Figure 4B, red and blue dots in the first and second quadrants, respectively). When not colocalized with REST, none of category 2 TFs were significantly associated with repressive activity, and some were instead associated with positive activity. Category 2 includes AFF1, Cell Genomics 3, 100234, January 11, 2023 5 Article ll OPEN ACCESS CHAMP1, CREB3, HINFP, MIER1, NCOA6, PTRF, PTTG1, TEAD2, TRIP13, ZNF197, ZNF644, and ZNF766 as well as EHMT2, a known cofactor of REST, while category 1 includes two known cofactors: RCOR1 and HDAC2. All 14 TFs in category 2 were significantly enriched at RE1 sequences with strong REST motifs compared with sites with weak motifs, while 191 of the category 1 TFs were significantly depleted, indicating that active RE1 recruits REST-dependent repressors and excludes REST-independent activators ( Figure 4C). We counted the number of category 2 TFs localized at each RE1 sequence and observed a correlation with repressive activity that was dependent on REST, suggesting that their recruitment is important for repression ( Figures 4D and 4E).
We next focused on the two category 2 TFs with the highest enrichment with strong REST motifs in our RE1 sequences; EHMT2 is recruited at RE1 by REST to suppress gene expression, 18 and MIER1 interacts with EHMT2 and HDACs through its ELM2 and SANT domains. 22 To evaluate the effect of MIER1 and EHMT2 colocalization on silencer function, we compared reporter activity and localizations of MIER1, EHMT2, and/or REST in the genomic context. RE1s associated with REST, MIER1, and EHMT2 showed significant repression in MPRAduo, while the RE1s associated with REST and either MIER1 or EHMT2 showed no silencing effect ( Figure 4F). Furthermore, RE1s associated with MIER1 and/or EHMT2 but not REST increased reporter expression. The REST motif contribution measured by comparing scrambled and native motifs is highest in the RE1s associated with all three TFs, suggesting that MIER1 and EHMT2 require a REST motif to facilitate repression by RE1 ( Figure 4G). We recapitulated the correlation between repression and colocalization of MIER2 (a paralog of MIER1), EHMT2, and REST in HepG2 cells, which did not contain MIER1 ChIP-seq data, suggesting the redundant function of MIER proteins on RE1 ( Figure S8).
Notably, we did not observe significant repression by RE1 sequences associated with HDAC2 and RCOR1, which were identified as category 1 TFs ( Figure 4H). However, REST motifs associated with REST and HDAC2 demonstrated a significant motif contribution, indicating that HDAC2 localized at functional RE1 silencers in the genome but that it is not a sufficient marker for silencers ( Figure 4I). Indeed, RE1s having strong REST motifs significantly reduced the expression level regardless of the association with RCOR1 and HDAC2 but, when localized with the two cofactors, showed stronger repression ( Figure S9A). To confirm the difference of the cofactors' localization in the native genomic context, we compared the ChIP-seq peaks in K562 and found that the weak motifs showed less occupancy by EHMT2 and MIER1 but a broad occupancy by RCOR1 and HDAC2, while all four cofactors showed higher occupancies at RE1s with strong motifs ( Figure S9B). This preference of EHMT2 and  Figure S9D). We also assessed gene expression of RE1 targets in K562 (defined as the nearest neighbor of the RE1 site) and found that the genes adjacent to an RE1 with EHMT2 and REST binding in K562 showed significantly lower expression than the genes adjacent to an RE1 without REST (these RE1s are defined as REST+ in a non-K562 cell type). This observation was independent of the motif strength, while the repression by RE1, when associated with HDAC2 and REST, was shown only with strong motifs ( Figure S9C). Overall, these results indicate that MIERs, EHMT2, and presumably other category 2 TFs have a crucial role in the repression by RE1.

Strong RE1 has a silencer function in human genome
To further validate the function of RE1 sites, we knocked out 5 loci that have REST-binding motifs with binding scores greater than the REST m-value, enriched category 2 TFs including EHMT2 and MIER1, and showed significant repression as measured by MPRAduo in all four cell types. We knocked out targets in HCT116 cells using CRISPR-Cas9 with cutting and insertion/ deletion (indel) efficiencies ranging from 31.9% to 94.5% (Figure S11A; Table S12). We measured the differential gene expression between edited and non-targeting negative controls using 3 0 tag sequencing of the bulk RNA (BRB-seq). 39 After knock out of the REST motif, 3 loci showed increased expression of at least 1 nearby gene, which is concordant with the release of REST-  Table S13). Interestingly, 2 of 3 validated edits had expression changes distal to the RE1 site. One locus had increased expression of two genes (MYL9 and RAB5IF), which were both distal to the RE1 site within the intron of DLGAP4 ( Figures 5A and 5B). The second locus, an RE1 site within the intron of SLC5A5, increased the expression level of the neighboring gene, CCDC124, after Cas9-mediated deletion ( Figures 5C and 5D). At the third validated locus, we observed short-range regulation of EPHA10 by RE1, which is located approximately 500-bp downstream of a promoter within the first intron of EPHA10 (Figures S11D and S11E). These results confirm that RE1 sites with strong REST-binding motifs act as a silencer to repress gene expression and that MPRAduo can identify and recapitulate the endogenous function of RE1s.

RE1-modulating variants in human genome
TF-binding motifs enrich for fine-mapped variants associated with human disease. 40,41 In order to understand the effect size and distribution of variants around REST-binding motifs, we tested 1,450 variants of various allele frequencies located in the REST motif using MPRAduo as well as 2,348 variants located within 25 bp from the REST motif ( Figure 6A). We compared the expression level between alleles (''allelic skew'') and identified variants that showed significant differential expression between alleles (''expression-modulating variants [emVars]''). 642 emVars inside the REST motif and 858 outside the REST motif were detected, with the majority identified from K562 and SK-N-SH (FDR & 0.01; Figure S12; Table S14). em-Vars were enriched inside the strong binding motif compared with outside (odds ratio 2.89, p = 5.32 3 10 À12 by Fisher's exact test) but not enriched inside of the weak binding motif (odds ratio = 1.12, p = 0.171). In addition, variants falling within strong motifs showed greater allelic skew compared with weak motifs or variants falling outside a motif ( Figures 6B and S13A). Allelic skew, as measured by MPRAduo, agreed with orthogonal measures of allelic activity, showing a strong correlation with the Article ll OPEN ACCESS difference of the predicted binding score (delta binding score) between alleles at strong motif sites, while those at weak motifs did not correlate ( Figures 6C and S13B). These results demonstrate the importance of first identifying variants that fall into strong motifs with a binding score above the REST m-value prior to considering the effect of the variant on REST binding. One example of an emVAR impacting a strong REST motif is rs28396985, which is in the intron of TSNARE1 (Figures 6D and  S13C). rs28396985 is located at a CRE bound by CTCF without any active marks as measured by H3K27ac or H3K4me3. 5 There are two CTCF-binding motifs 17-and 119-bp downstream of the variant that do not overlap with any known common variant. The G allele showed a significant increase of the expression relative to the A allele ( Figure 6E), which is in agreement with the predicted REST delta binding score (34.1 for the A allele and 30.4 for the G allele). We confirmed that REST preferentially binds the A allele as measured by ChIP-seq in GM12878 and K562, which are heterozygous for the variant ( Figure S13D). Expression quantitative trait locus (eQTL) results by GTEx showed that expression of TSNARE1 is increased by the G allele in multiple tissues including cerebellum and spleen, indicating a significant effect of the variant in vivo that is in agreement with the effect measured using MPRAduo ( Figure 6F).

RE1-modulating variants show populational difference
To understand how genetic variants in REST-binding sites impacted function, we interrogated the relationship between allele frequency and REST activity. For the majority of the major alleles, as estimated by global allele frequency, we observe stronger predicted REST-binding scores and repressive effects by MPRAduo than the corresponding minor allele ( Figure 6C). As we selected variants in REST ChIP-seq peaks from only four cell types, this effect may be due to ascertainment bias against rare or low-frequency alleles that create a REST-binding site. However, even after binning variants based on low (<1%) and moderate to high (>5%) allele frequency, we still observed a noticeable effect that was strongest at low frequency (Figure 6G). To perform an unbiased and exhaustive assessment, we next identified all known variants, both common and rare, that are predicted to create REST-binding motifs in the human genome regardless of their overlap with an existing REST ChIP-seq peak. We identified 10,069 variants where either allele contributes to a strong REST-binding motif and then compared the predicted binding score between the major and minor alleles. We used the delta binding score as a proxy for altered REST activity due to the strong correlations we observed with MPRAduo.
In agreement with the imbalance observed by MPRAduo, lower minor allele frequency (MAF) variants had significantly lower delta binding scores (which correspond to high allelic skew by MPRAduo) than higher MAF variants, which were more evenly distributed between positive and negative values ( Figure 6H). To identify a baseline expectation, we created random de novo variants that exhibited negative delta binding scores similar to the lowest MAF group (<0.001%). These findings suggest that low-frequency variants represent a random process and that alleles at increased allele frequency in the pop-ulation may undergo selective pressures against the disruption of established REST-binding sites.

DISCUSSION
In this study, we characterized whole-genome RE1 silencers using MPRAduo, which enables us to aid our ability to detect repressive effects. We empirically identified an m-value for REST, RE1, where a binding score above this m-value establishes an effective silencer likely through the recruitment of cofactors including MIERs and EHMT2. We note that weak binding motifs below the REST m-value overlap with REST ChIP-seq peaks. MPRAduo may have a limit of detection that excludes our ability to detect repressive effects from weak binding sites, or the loss of local and distal sequences required from the assay may impact local REST-binding kinetics. 42 Alternatively, REST may have a binding threshold that must be overcome for the interaction with its cofactors and successful silencing. Interestingly, many category 1 activators localize at weak REST-binding sites in the genome; these activators likely explain why RE1 sequences with weak binding sites more often increased reporter expression in our study and may also influence REST binding at weak sites. Together, observations suggest that weak REST motifs alone are less likely to play a role in actively silencing gene expression and may only have weak effects when it comes to fine-tuning CRE function. TFs in category 2 are repressive when localized with REST but are associated with an increase of expression or a negligible effect when they appear without REST, implying that some TFs are capable of both repressive and activating functions that are dependent on their surrounding context. 43,44 Such a dual role has been demonstrated for EHMT2, with the binding site and phosphorylation playing key roles. 45,46 Our results suggest that REST, when bound to a strong motif, may assist in switching the function of TFs to facilitate silencer activity.
In our initial screen for repressive elements responsive in MPRAs, we identified a subset of CTCF sites that were capable of decreasing reporter activity. CTCF is a key factor for maintaining TADs by facilitating three-dimensional chromatin loops and can be associated with increased transcription. 47,48 On the other hand, CTCF was originally discovered as a transcriptional repressor for c-Myc, 49 and CTCF-binding sites show insulator/ enhancer-blocker function, which inhibits the interaction between CREs, 48,50 supporting a multifunctional role of CTCF. MPRAduo showed significant repression by CTCF-binding sites located at TAD boundaries when tested alongside some specific E elements. In addition, these CTCF-binding sites showed stronger repression when located upstream of the E elements, and sometimes they increased expression levels when they were located between the E element and promoter. Further exploration using MPRAduo is required to understand molecular mechanisms underlying the repression by CTCF-binding sites and how adjacent TFs may impact function.
Our results show that a spacer length of 8 or 9 bp is required for silencing activity in non-canonical REST. These findings are concordant with enrichments observed in REST ChIP-seq 37,38 where REST binds stably to an 8-or 9-bp gapped motif through a conformational change that has the sixth zinc finger unbound to the DNA and stretched between half sites. 51 Despite exhaustive testing of half-site combinations, we did not observe silencing by any other configurations and instead detected increased expression for some, which was likely an effect of the flanking sequences, similar to our observation for the weak canonical binding sites. We did not find a consensus sequence for the 8-to 9-bp spacers; however, by scrambling the spacer, we did observe slight increases in activity, suggesting that there is some degree of cryptic constraint encoded within the spacer. Although the majority of TF-binding motifs are described using a positional weight matrix (PWM) with fixed length, some TF-binding motifs with variable spacer lengths have been discovered. 52,53 Our results emphasize the importance of motif discovery with variable gap length and the utility of performing functional validation.
We determined that accumulation of category 2 TFs at RE1 in the genome correlates with repression measured by MPRAduo, suggesting that category 2 TFs associate with REST at RE1 to facilitate silencer function. The MIER family has been demonstrated to physically interact with REST as well as EHMT2, suggesting that MIERs help the interaction between REST and EHMT2 in a protein complex. 54 ZNF644 is a zinc finger TF that also binds to EHMT2, providing additional support for a role of category 2 TFs as mediators at RE1. 55 Category 2 also includes proteins not previously associated with the RE1 complex such as TRIP13, an AAA+ ATPase that plays a role during meiosis. 56 We confirmed that the localization of ZNF644 and TRIP13 is associated with repression by RE1 in MPRAduo and repression of target gene expression in their genomic context ( Figure S10). However, direct evidence of interactions between REST and category 2 TFs within the genome are required for further verification.

Conclusion
Although genome-wide association studies are a powerful tool to find variants associated with traits, it is a challenge to isolate causal variants from other variants in tight linkage disequilibrium. 40,57 Chromatin marks and functional assays, including MPRAs, provide strong evidence for nominating causal variants; however, intersecting variants only with TF-binding motifs typically does not enrich for causal variants as effectively as those methods. 58 This work demonstrates that empirically estimating an m-value for each TF as a stringency filter could play an important role for identifying sequences that have clear functional activity and, furthermore, aid in the prioritization of variants that impact human health.

Limitations of the study
Our results using MPRAduo correlates well with endogenous TF-binding and chromatin profiles; however, MPRAduo uses synthesized fragments cloned into episomal plasmids, which may not fully recapitulate their native genomic context. This discrepancy may explain the poor correlation between RCOR1/HDAC2-binding and silencer activity measured by MPRAduo where additional cofactors or context is not captured within the 200-bp test sequence, which are both required for effective silencing. The limited length of the tested elements may also explain why we observed a small proportion of strong RE1s that increase expression when tested by MPRAduo.
Furthermore, the combinations of two elements on the MPRAduo vectors are independent from their original locations where the repertoire of local CREs to interact with and their physical distances apart are lacking. Indeed, our KO experiment of RE1s demonstrated the ability for long-range regulation of gene expression, highlighting the need to integrate results from MPRAduo with endogenous chromatin profiles to fully understand their genomic roles.
REST shows a different genomic binding profile in neurons than non-neuronal cells, which may limit our finding that RE1 function is less cell specific. 23,59 Regardless of their origin as neuroblastoma, the REST ChIP-seq profile of SK-N-SH cells is closer to non-neuronal cell types than neurons. 23 In addition, a neuronal-specific truncated isoform of REST (REST4) can drive differences between neuronal and non-neuronal function of RE1. 60,61 As a result, further evaluation is required to understand RE1 function at scale in neuronal cells.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Kousuke Mouri (kousuke.mouri@jax.org).

Materials availability
Plasmids generated in this study have been deposited to Addgene, 193,739 for pMPRAduo:minP:GFP and 193,740 for pMPRAduo:Dorf.
Data and code availability Datasets supporting this manuscript are available at NCBI GEO (Accession ID: GSE196171 for MPRAduo data and GSE212253 for BRB-seq data). Detailed protocol of MPRAduo and code supporting this manuscript is available on GitHub (general processing pipeline for MPRAduo: https://github.com/tewhey-lab/MPRAduo, protocol and data analysis: https://github.com/tewhey-lab/duoREST). TGGCCGACCTGG3 0 ) and have AsiSI or PmeI recognition sequences, respectively, inserted between the oligo and barcode to insert GFP and the other library. In the benchmarking library, vector A was used to clone library S with 1,687 sequences and vector P was used to clone library E with 21 sequences. In the whole genome RE1 library, vector P was used for library S with 24,000 sequences and vector A was used for library E with 5 sequences. In the non-canonical motif library, vector P was used for library S with 1,000 sequences and vector A was used for library E with 5 sequences. The structure of the vectors and Illumina reads are illustrated in Figure S15A.  Table S1) cycled with the following conditions: 98 C for 30s, 4 or 6 cycles of (98 C for 10 s, 60 C for 15 s, 65 C 45 s), 72 C for 5 min. Chicken HS4 sequence was amplified from pBluescriptII[attB/Ins1] (addgene #74100), adding 20 bp barcodes and adapter sequences by 6 cycles of PCR reactions 50 mL in volume in the same method as the benchmarking library. Oligos for library E were synthesized as 180 bp sequences containing 150 bp of genomic context and 15 bp of adapter sequences (IDT) and cloned into the pMPRAduo:Dorf vector without barcodes, sequence verified, and individual clones were selected. Individual plasmids were linearized by PmeI (NEB, R0560), 180 bp sequence were amplified to add 10 bp barcodes and additional adapter sequences using a 6 cycles of PCR reaction in 20 mL volume containing 10 mL of Q5 NEBNext Master Mix and 0.5 mM forward and reverse primers (IDT) (primers 4 and 5, Table S1) cycled with the following conditions: 98 C for 30s, 6 cycles of (98 C for 10 s, 60 C for 15 s, 65 C 45 s), 72 C for 5 min. The PCR products were purified using 13 volume of AMpure XP (Beckman Coulter, A63881) and eluted with water. The purified amplicons were equalmolar pooled and assembled into vector P for benchmarking library or separately assembled into A-vector for Whole-genome REST library.

Vector assembly of single libraries
To assemble the single Dorf library, 1 mg of PCR product containing barcoded oligos were inserted into 1 mg SfiI (NEB, R0123) digested empty vector A or P by gibson assembly NEBuilder HiFi DNA Assembly Master Mix (NEB, E2621) in an 80 mL reaction. After 1 h of incubation at 50 C, DNA was purified using a Monarch PCR & DNA Clean up kit (NEB, T1030) and eluted in 12 mL of water.
Assembled vectors of library S and cHS4 were mixed with a 1500:1 ratio of molarity. The mixture electroporated into 100 mL of 10-beta E.coli (NEB, C3020K, 2kV, 200 ohm, 25 mF) in order to achieve a transformation efficiency equal to 300-times the number of unique oligos sequences (target CFU: 500K). The electroporated bacteria was immediately split into ten 1 mL aliquots of outgrowth medium (NEB, B9035S) and incubated at 37 C for 1 h then independently scaled up in 20 mL of LB supplemented with 100 mg/mL of carbenicillin (Teknova, C8001) and incubated on a shaker at 37 C for 9.5 h; after outgrowth, the cultured bacteria was pooled prior plasmid purification (Qiagen, 12,943 or 12,963). Four of the aliquots were sampled and plated with serial dilutions after 1 h recovery to estimate transformation efficiency.
Library E was transformed into 5-alpha E.coli (NEB, C2987H), recovered in SOC medium (NEB, B9020S) and incubated at 37 C for 1 h then diluted and plated onto LB agar supplemented with 100 mg/mL of carbenicillin (Teknova, L1010) and incubated overnight at 37 C. Approximately 2100 colonies were harvested by washing plates with LB, followed by plasmid purification (Qiagen, 12,943). To construct the whole genome RE1 binding library, individual clones for each of the 5 library E sequences were cultured in 20 mL of LB with 100 mg/mL of carbenicillin overnight at 37 C, equal volumes of the clones were combined together and the pool expanded in 20 mL of LB with 100 mg/mL carbenicillin for 9 h followed by plasmid purification (Qiagen, 12,943). Approximately 250 colonies in total were harvested.
To construct the mpra:gfp single library, 10 mg of Dorf library plasmid was digested with 100 units of AsiSI for vector A or PmeI for vector P (NEB, R0630 or R0560) at 37 C overnight. A GFP open reading frame (ORF) with a minimal promoter and partial 3 0 UTR was amplified from pMPRAduo:minP:GFP (addgene pending) in 1600 mL volume containing 800 mL of Q5 High-Fidelity 2X Master Mix (NEB, M0492L) and 0.5 mM forward and reverse primers (primers 7 and 8 for vector A and primers 9 and 10 for vector P, Table S1) cycled with the following conditions: 98 C for 30s, 20 cycles of (98 C for 10 s, 60 C for 15 s, 72 C 45 s), 72 C for 5 min. The PCR product was purified using 1.53 volume of AMpure XP and inserted by gibson assembly using 3 mg of linearized Dorf library and 9 mg of the GFP amplicon in 100 mL total volume for 90 min at 50 C. Assembled vectors were purified using 13 volume of AMpure XP and a secondary digest performed with 40 U of AsiSI or PmeI, 5 U of RecBCD (NEB, M0345), 10 mg BSA, 1 mM ATP, 13 NEB Buffer 4 at 37 C overnight followed by purification with a Monarch PCR & DNA Clean up kit using 12 mL of water for elution. 3 mL of mpra:gfp plasmid was transformed into 50 mL of 10-beta cell by electroporation (2 kV, 200 ohm 25 mF). The electroporated bacteria were recovered and cultured with 100 mL of LB supplemented with 100 mg/mL of carbenicillin in the same way of Dorf library.

Vector assembly of duo libraries
To assemble the duo library for the benchmarking set, oligos and barcodes including GFP ORF with minimal promoter were amplified from the single mpra:gfp library of plasmid A or P by PCR using 50 mL reaction volumes containing 1 ng of plasmid, 25 mL of Q5 NEBNext Master Mix and 0.5 mM forward and reverse primers (IDT) (primers 11 and 12 to amplify vector A and 13 and 14 to amplify vector P, Table S1) cycled with the following conditions: 98 C for 30s, 12 cycles of (98 C for 10 s, 60 C for 15 s, 65 C 1 min), 72 C for 5 min. The amplified cassettes were inserted by gibson assembly using 4 mg of the amplicons and 2 mg of Dorf library plasmid, linearized by AsiSI or PmeI, in a 200 mL reaction incubated for 90 min at 50 C followed by AMpure XP purification using 75 mL of water for elution. Total eluted volume was digested by incubation with 50 U of AsiSI, 50 U of PmeI, 5 U of RecBCD, 10 mg BSA, 1 mM ATP, 13 NEB Buffer 4 at 37 C overnight and purified by Monarch PCR & DNA Clean up kit using 12 mL of water for elution.
For the whole genome RE1 library and non-canonical motif library, 220 ng of barcoded oligos were directly inserted into 2 mg of AsiSI-digested single Dorf library using gibson assembly in a 200 mL reaction incubated for 60 min at 50 C and purified by Monarch PCR & DNA Clean up kit using 12 mL of water for elution. The purified ligation product was electroporated into 100 mL of 10-beta E.coli (NEB, C3020K, 2kV, 200 ohm, 25 mF) in order to achieve a transformation efficiency equal to 200-times the number of unique oligos combinations (target CFU: 4.8 M for whole genome RE1 library and 1 M for non-canonical motif library). The electroporated bacteria was immediately split into ten 1 mL aliquots of outgrowth medium and incubated at 37 C for 1 h then independently scaled up in 20 mL of LB supplemented with 100 mg/mL of carbenicillin and incubated on a shaker at 37 C for 9.5 h; after outgrowth, the cultured Cell Genomics 3, 100234, January 11, 2023 e3 Article ll OPEN ACCESS bacteria was pooled prior plasmid purification (Qiagen,12,963). 20 mg of Dorf duo library plasmid was digested with180 U of PmeI and 20 U of AsiSI at 37 C overnight. A GFP open reading frame (ORF) with a minimal promoter and partial 3 0 UTR was amplified and inserted by gibson assembly using 1.6 mg of linearized Dorf duo library and 5.3 mg of the GFP amplicon in 250 mL total volume for 90 min at 50 C. Assembled vectors were purified using 13 volume of AMpure XP and a secondary digest performed with 40 U of AsiSI or PmeI, 5 U of RecBCD (NEB, M0345), 10 mg BSA, 1 mM ATP, 13 NEB Buffer 4 at 37 C overnight followed by purification with AMpureXP using 40 mL of water for elution. 6 mL of mpra:gfp plasmid was transformed into 200 mL of 10-beta cell by electroporation (2 kV, 200 ohm 25 mF). The electroporated bacteria were recovered and expanded in 3000 mL of TB (Teknova, T0315) for whole genome RE1 library and 1000 mL of LB for non-canonical motif library supplemented with 100 mg/mL of carbenicillin for 16 h at 30 C followed by plasmid purification (Qiagen, 12,991).
MPRA transfections 10 7 of GM12878 cells were mixed with 10 mg of plasmid and electroporated in a 100 mL volumes of RPMI with the Neon transfection system (Thermo Fisher Scientific, MPK5000) using 3 pulses of 1200 V for 20 ms. In total, 10 3 10 7 cells were used for each replicate of the benchmark libraries and 50 3 10 7 cells were used for each replicate of the whole genome RE1 library. 10 7 of HepG2 cells mixed with 5 mg of plasmid were electroporated in a 100 mL volumes of Resuspension Buffer R with the Neon transfection system using 1 pulse at 1200 V for 50 ms each. In total, 15 3 10 7 cells were used for each replicate. 10 7 K562 cells mixed with 5 mg of plasmid were electroporated in a 100 mL volumes of Resuspension Buffer R with the Neon transfection system using 3 pulses of 1450 V for 10 ms. In total, 15 3 10 7 cells were used for each replicate of the whole genome RE1 library and 5 3 10 7 cells were used for each replicate of the non-canonical motif library. 10 7 SK-N-SH cells mixed with 10 mg of plasmid were electroporated in a 100 mL volumes of Resuspension Buffer R with the Neon transfection system using 3 pulses of 1200 V for 20 ms. In total, 15 3 10 7 cells were used for each replicate. All cell lines were recovered from the culture medium 24 h post-transfection by centrifugation, washed 3 times with PBS, and frozen at À80 C in Buffer RLT supplemented 40 mM of DTT.

RNA extraction and cDNA synthesis
Total RNA of the transfected cells was extracted using RNeasy Maxi (Qiagen, 75,162) following the manufacturer's protocol including the on-column DNase treatment. Total RNA was secondarily digested by 20 U of Turbo DNase (Thermo Fisher Scientific, AM2238) in 1.65 mL of total volume for 1 h at 37 C. The digestion was stopped by the addition of 15 mL of 10% SDS and 150 mL of 0.5 M EDTA, followed by incubation at 70 C for 5 min. To capture GFP mRNA,1200 mL of Formamide (Thermo Fisher, 4,311,320), 600 mL of 203 SSC (Thermo Fisher, 15,557,044) and 2 mL of Biotin-labeled GFP probe (primers 15-17, Table S1) were added directly to the stopped DNase reaction mixture and incubated for 2.5 h at 65 C with rotation. 400 mL of Dynabeads Streptavidin C1 (Thermo Fisher, 65,002) was prewashed, eluted to 500 mL of 203 SSC and added to the reaction followed by agitation on a HulaMixer (Thermo Fisher, 15920D) at room temperature for 15 min. The magnetic beads were then washed once with 13 SSC and twice with 0.13 SSC and 50 mL of water was added along with 1 U of SUPERase In (Thermo Fisher, AM2694). Beads were treated with 2 U of Turbo DNase at 37 C overnight and the digestion was stopped with the addition of 1 mL of 10% SDS followed by purification with RNA clean XP purification beads (Bechman Coulter, A63987) using 37 mL of water for elution. cDNA was synthesized from the purified DNasetreated GFP mRNA using Super-Script III with a 1 mM final concentration of primer specific to the 3 0 UTR of GFP (primer 18, Table S1) at 47 C for 80 min. Synthesized cDNA was purified by AMpure XP and eluted in 30 mL of EB (Qiagen, 19,086).

Sequencing library construction and Illumina sequencing
To pair barcodes with oligo sequences Dorf plasmid was amplified by PCR in a total reaction volume of 200 mL containing 400 ng of plasmid DNA, 100 mL of Q5 NEBNext Master Mix and 0.5 mM of forward and reverse primers (primers 19 and 20 for plasmid A and 21 and 22 for plasmid P, Table S1) cycled with the following conditions: 98 C for 30s, 5 cycles of (98 C for 10 s, 62 C for 15 s, 72 C 30 s), 72 C for 2 min. PCR products were purified using 13 volume of AMpure XP and eluted in 30 mL of EB. Illumina indices were added to each sample by amplifying 20 mL of the elution in a 100 mL of PCR reaction with 50 mL of Q5 NEBNext Master mix and 0.5 mM of forward and reverse primers (primers 25 and 26, Table S1) cycled with the following conditions: 98 C for 30s,6 cycles of (98 C for 10 s, 62 C for 15 s, 72 C 30 s), 72 C for 2 min. Indexed samples were purified using 13 volume of AMpure XP, eluted in 30 mL of EB and sequenced using 2 3 250 bp chemistry on an Illumina MiSeq instrument at the Jackson Laboratory.
cDNA tag sequencing libraries were amplified by PCR each 100 mL in volume containing 10 mL of cDNA, 50 mL of Q5 NEBNext Master Mix and 0.5 mM of forward and reverse primers (primers 23 and 24 in Table S1) cycled with the following conditions: 98 C for 30s, 6-13 cycles of (98 C for 10 s, 62 C for 15 s, 72 C 30 s), 72 C for 2 min. The cycle number was estimated by qPCR with 10 mL of the same reaction and 1:60,000 diluted SYBR Green I (Thermo Fisher, S7563) and 1 mL of cDNA or 1 mL of diluted plasmid DNA used as a standard curve for the qPCR. To construct tag sequencing libraries of the plasmid pools, plasmids were diluted based on the qPCR results to mirror the cDNA samples, and amplified using the same PCR conditions and cycles as the cDNA. PCR products of the cDNA and plasmid libraries were purified using 13 volume of AMpure XP and eluted in 30 mL of EB. Illumina indices were added to each sample by amplifying 20 mL of the elution in a 100 mL of PCR reaction with 50 mL of Q5 NEBNext Master mix and 0.5 mM of forward and reverse primers (primers 25 and 26, Table S1) cycled with the following conditions: 98 C for 30s,6 cycles of (98 C for e4 Cell Genomics 3, 100234, January 11,2023 Article ll OPEN ACCESS 10 s, 62 C for 15 s, 72 C 30 s), 72 C for 2 min. Indexed samples were purified using 13 volume of AMpure XP, eluted in 30 mL of EB and sequenced using 2 3 150 bp chemistry on an Illumina NextSeq 550 or 1 3 150 bp S1 chemistry on an Illumina NovaSeq instrument at the Jackson Laboratory.
CRISPR genome editing and BRB-seq Guide sequences were cloned in px459v2 (addgene 62,988) using BbsI sites. 73 1.5 3 10 6 of HCT116 cells were mixed with 15 mg of plasmid and electroporated in a 10 mL volume of R buffer with the Neon transfection system (Thermo Fisher Scientific, MPK5000) using 1 pulse of 1530 V for 20 ms. Cells with two independent rounds of electroporation were mixed and immediately separated to 5 replicates. Cells were recovered in 1 mL of culture medium for 24 h, followed by selection under the culture medium supplemented with 1 mg/mL of Puromycin for two days. Selected cells were recovered for 8-9 days and harvested for genomic DNA and RNA prep.
Total RNA of the 1 3 10 6 cells was extracted using RNeasy Mini (Qiagen, 74,104) following the manufacturer's protocol including the on-column DNase treatment. 500 ng of total RNA for each replicate was reverse-transcribed, pooled, and purified in two columns of Monarch PCR & DNA Clean up kit, followed by second-strand synthesis and tagmentation as the original BRB-seq protocol. 39 Tagmented cDNA pool was amplified with P5_BRB and BRB_Idx7N5 primers (5 mL, primers 27 and 28, Table S1) using Q5 NEBNext Master Mix heat-activated at 98 C for 30 s before adding DNA with the following conditions: 72 C 3 min, 98 C for 30 s and 10 cycles of (98 C for 10 s, 63 C for 30 s, 72 C 60 s), 72 C for 5 min. Library was sequenced using High Output chemistry on an Illumina NextSeq 550 instrument at the Jackson Laboratory with 21 bp for read 1 and 72 bp for read 2 using a custom read1 primer (primer 29, Figure S1). Sequenced reads were aligned using STAR (v2.7.6a,-outFilterMultimapNmax 1) 74 and demultiplexed using BRB-seq Tools (v1.6). 39 Degenerated counts of UMIs were analyzed using DESeq2 (v1.26.0) using default parameters.
The indel efficiency of CRISPR-KO is measured by PCR from genomic DNA followed by Illumina sequencing. Genomic DNA was prepped from 10 5 cells, eluted in 30 uL EB and amplified with primer pairs (5 mL, primers 30-39, Table S1) using Q5 NEBNext Master Mix with the following conditions: 72 C 3 min, 98 C for 30 s and 20 cycles of (98 C for 10 s, 62 C for 30 s, 72 C 60 s), 72 C for 5 min. PCR products were purified using 13 volume of AMpure XP and eluted in 30 mL of water. Illumina indices were added to each sample by amplifying 2 mL of the elution in a 50 mL of PCR reaction with 25 mL of Q5 NEBNext Master mix and 0.5 mM of forward and reverse primers (primers 25 and 26, Table S1) cycled with the following conditions: 98 C for 30s,6 cycles of (98 C for 10 s, 62 C for 15 s, 72 C 30 s), 72 C for 2 min. Indexed samples were purified using 13 volume of AMpure XP, eluted in 30 mL of EB.

QUANTIFICATION AND STATISTICAL ANALYSIS
All statistical details for the experimental results including the statistical tests and (adj)p values can be found in the figures and figure legends.
Pre-processing of reads The first of the pre-processing steps works to create a map between the barcodes and oligos. Paired-end 250 bp reads from the sequencing of both single libraries were merged into single amplicons using Flash2 62 (v2.2.00, flags: -M 200), then, the positions of the UMI, barcode, and oligo from each amplicon were identified by using the 3' linker sequences of the barcode/oligo and the 5 0 linker sequences of the UMI/barcode/oligo ( Figure S15B). The barcode-oligo pairs were then aligned to the original oligo sequences using minimap2, 63 and the resulting SAM output was filtered for mapping quality. Barcodes were sorted and parsed to remove barcodes mapping to multiple oligos and organized into a dictionary of barcode-oligo pairs.
The second pre-processing pipeline extracts counts from the replicated tag sequences. In the benchmark set, the tag sequences were based on paired-end 150-bp reads; reads were first merged into single amplicons using Flash2 for each replicate. In the whole genome RE1 set, the tag sequences were 150-bp single-end reads, so this step was unnecessary. In both sequencing sets the reads were sorted into single and duo libraries based on the number of bases between the tail of the GFP and 3 0 end of the sequence. Here, regardless of whether or not there were single barcodes in the replicate sequences, any duo barcode sets that had less than 110 bp from the GFP tail to the 3 0 end were classified as singles. The barcodes were extracted and the ones that were present in the dictionary, or in the case of duo barcodes in both dictionaries, were included in the count table.

Full analysis
Datasets were filtered based on the number of barcodes observed for an oligo's count (S20 benchmark, S10 whole genome RE1) and the average number of DNA counts across replicates (S100 benchmark, S20 whole genome RE1). Since the benchmark set included single libraries, and utilized two sequencing runs, the oligos were additionally filtered to ensure the single libraries contained the same oligos across all libraries, and that the duo oligos were only made up of oligos that were present in the filtered single oligo libraries. After filtering the oligos, a DE-Seq-based analysis 64 followed by a summit-shift normalization was performed ( Figure S15B); the summit-shift normalization, which consists of shifting the mode of the negative control combinations to zero, was performed on the benchmark library, the whole genome RE1 library as well as the non-canonical library to normalize between libraries and cell types. On the whole genome RE1 library, this was expanded with a cell-type specific analysis that was adjusted based on the summit shift normalization. To identify ''expression-modulating variants'' (emVars) in the whole genome RE1 library, the difference between Cell Genomics 3, 100234, January 11, 2023 e5 Article ll OPEN ACCESS the log2FoldChange of the alternate and reference allele for each replicate within a cell type were compared using the Student's t test corrected using Benjamini Hochberg procedure (FDR >0.01) similar to previous approaches. 27 In order to effectively compare between the two runs in the benchmark set, a quantile normalization 65 of the log2FoldChanges was performed across libraries.
For the benchmarking library, 69,456 constructs were recovered from all four libraries (95.7%), 55,077 passed the filters (75.9% of all constructs). After filtering, 28,297 and 25,455 constructs were included from SE and ES libraries respectively, with 25,222 combinations captured in both alignments and used in the downstream analysis (Table S4). For the whole genome RE1 library, 115,830 (96.52% of all constructs) constructs on average were recovered across cell types. After filtering, 105,794 (88.2% of all constructs) constructs on average across cell types were used in the downstream analysis (Table S6).

Silencer selection of Benchmark library
To generate silencer candidates for the benchmark set, ChIP-seq peaks from ENCODE overlapping with TF binding motif were selected with a cut off binding score for REST (ENCFF048JKT, Factorbook score 5.95) and YY1 (ENCFF967ACD, Factorbook score 3.9). CTCF sites (ENCFF002DAJ, Factorbook score 4.95) were intersected by Factorbook motifs and ChIP-seq peaks of H3K27ac (ENCFF411MHX) or 40kb from TAD boundaries. 47 Putative GFI1 binding sites were selected for inclusion based only on having a Factorbook score higher than 0.9. Selected binding sites were expanded to 200 bp by centering the TF motif in the test sequence and extending the genomic sequence on either end. TF binding sites which were located 5 kb from transcription start sites (TSS) annotated by Ensembl were removed. Fifty thousand random genomic sequences were selected by using bedtools (version 2.29.2). 66 After removing sequences located 5 kb from TSS, 806 random sequences matching the distribution of GC content for the TF binding sequences were selected as random negative controls.

E elements selection
To select the 19 E elements for the benchmarking set, non-coding human elements which were previously tested for activity in GM12878 and HepG2 cells by MPRA were used. 27 602 elements which had significant activity (-log10Padj >4) in both cells and also overlapped at least one transcription factor binding annotation in ENCODE for GM12878 or HepG2 were selected. These elements were separated into 24 clusters using k-means (Scikit learn, version 0.24.2). From 12 out of 24 clusters which had more than 20 elements, 19 E elements were picked up to have diverse expression levels in MPRA and diverse TF binding based on ChIP-seq peaks in GM12878 from ENCODE. Two negative controls which did not show expression in MPRA or do not have active marks (H3K27ac, H3K4me1, H3K4me3) were also selected from Tewhey et al.

2016.
Whole genome RE1 library To select sequences for the whole genome RE1 library, narrow peak datasets of ChIP-seq for REST in 4 cell types (GM12878: ENCFF048JKT, ENVFF677KJB, HepG2: ENCFF153JLK, ENCFF854KPC, K562: ENCFF895QLA, ENCFF558VPP, SK-N-SH: ENCFF781PAL, ENCFF946MYA from ENCODE) were combined together. A 21 bp REST binding motif (Factorbook from ENCODE) was intersected with the ChIP-seq peaks and expanded 90 bp to 5 0 and 89 bp to 3 0 by bedtools. Elements located within 5 kb upstream from a TSS were removed. To identify reference sequences without canonical motifs, 1000 cell specific ChIP-seq peaks for each cell type were randomly selected from the narrow peaks which did not intersect with the REST binding motif or reside within 5 kb upstream of a TSS. For GM12878, only 934 peaks fit this criteria and all were included in the library. In addition, all 496 sequences without canonical motifs which overlap ChIP-seq peaks in all 4 cell types were added to the library. These peaks were trimmed or expanded to 200 bp, keeping the center of the peak at the center of oligos. Alternate alleles of 2,865 human variants with 1% or higher MAF and randomly selected 1,000 human variants with less than 1% of MAF in at least one of three populations (eastern Asia, Europe, and Africa) from 1000 Genome Project were included in the library. 3,074 reference sequences which have a variant for test and randomly selected 2,792 reference sequences with canonical REST motifs were selected and scrambled motif sequences. The 21-bp motifs for these sequences were randomly shuffled and then their binding score of REST checked by using SPRy-SARUS (ver2.0.1) in HOCOMOCO v11 by using a cut-off score of 5. Scrambled sequences which had a REST binding motif detected by SPRy-SARAS starting from 75 to 110th nucleotides were randomly scrambled again. The random genomic controls from the benchmarking set were also included in the whole genome RE1 library.
Log additive modeling Log additive model and its score were calculated by using LinearRegression from Scikit learn (version 0.24.2). The coefficients and fitness scores were resulted as below: ChIP-seq dataset All analysis with ChIP-seq data was done with the GRCh38 human genome. Reference sequences in the tested library were converted to GRCh38 using liftOver (ENCODE). Accession numbers of ChIP-seq data in ENCODE database are listed in Table S11. Overlap between tested elements and ChIP-seq peaks is detected by bedtools (options: -F 0.5 -f 0.5 -e). Heatmap was plotted by using deep-Tools 68 (version 3.5.1, options: -colorMap RdBu -whatToShow 'heatmap and colorbar' -zMin À4 -zMax 4 -heatmapWidth 10) after making matrix (options: -referencePoint center -b 1000 -a 1000 -skipZeros -p 4) from bigwig files with accession number ENCFF407OAJ, ENCFF090ZAX, ENCFF313DTO, ENCFF857APX and ENCFF065PDS. To analyze the REST ChiP-seq signal in K562, ENCFF558VPP was used. For allele specific TF binding assay, ENCFF887ZNY, ENFCC750IJA, ENCFF116CTI, and ENCFF191OSK were used for REST and ENCFF747TJH, ENCFF430XCG, ENCFF172KOJ, and ENCFF172KOJ were used for CTCF. BAM files were analyzed by Integrative Genomics Viewer. 69 Identifying half sites and non-canonical motif Half sites were identified using FIMO. The first-9th and 12th-21st nucleotides of REST binding motif (JASPAR MA0138.2) were used for left and right half sites respectively. REST binding sites with two-half sites separated with a spacer under 100 bp were categorized according to orientations of the two-half sites.

Searching consensus sequence in atypically gap
Whole 200 bp tested element or gap sequence of 8 or 9 bp atypically spaced motifs are analyzed using MEME 70 with three conditions: all native genomic sequences, native genomic sequences with the Z score under À1, and native or scrambled sequences which had lower expression level than the other of the pair. The 8 and 9 bp spaced motifs are analyzed separately in all conditions.

Piecewise regression
For piecewise regression between predicted binding score and log skew for reference and scrambled, segmented 71 was used (ver. 1.3-4, psi = c(20)). The results of the regression and R 2 are in Table S10. The average score of junctions, 20.86, was calculated without the two outliers (En02 and En11 for GM12878) and used to separate weak and strong motifs.
Genome wide variant overlapping to RE1 All single-nucleotide substitutions from gnomAD v3.1.1 were checked to determine whether both alleles overlap with a REST binding motif (MA0138.2) by using FIMO with a ±20 bp window with a threshold score of 20.89 in either allele. Frequency of variants were called using MafDb.gnomAD.r3.0.GRCh38. Random variants were generated using 5 runs of Mutation-simulator 72 (Ver 2.0.3, options: -sn 0.001) followed by the collection of the percentage of nucleotide substitutions by referring to the proportion of all gnomAD variants from chromosome 1.