Screening for functional regulatory variants in open chromatin using GenIE-ATAC

Abstract Understanding the effects of genetic variation in gene regulatory elements is crucial to interpreting genome function. This is particularly pertinent for the hundreds of thousands of disease-associated variants identified by GWAS, which frequently sit within gene regulatory elements but whose functional effects are often unknown. Current methods are limited in their scalability and ability to assay regulatory variants in their endogenous context, independently of other tightly linked variants. Here, we present a new medium-throughput screening system: genome engineering based interrogation of enhancers assay for transposase accessible chromatin (GenIE-ATAC), that measures the effect of individual variants on chromatin accessibility in their endogenous genomic and chromatin context. We employ this assay to screen for the effects of regulatory variants in human induced pluripotent stem cells, validating a subset of causal variants, and extend our software package (rgenie) to analyse these new data. We demonstrate that this methodology can be used to understand the impact of defined deletions and point mutations within transcription factor binding sites. We thus establish GenIE-ATAC as a method to screen for the effect of gene regulatory element variation, allowing identification and prioritisation of causal variants from GWAS for functional follow-up and understanding the mechanisms of regulatory element function.


INTRODUCTION
Genome-wide association studies (GWAS) of complex diseases and traits have identified an ever-growing list of associated genetic loci currently numbering 515 302 ( 1 ). Howe v er, it remains hugely challenging to identify specific causal variants and genes, since the majority of these loci lie in noncoding regions of the genome and there are often many candidate variants in linkage disequilibrium (LD). The identification of causal genes has been aided by expression quantitati v e trait loci (eQTL) datasets from relevant cell types, since colocalization between eQTL and GWAS results can indicate a shared genetic cause of disease and gene expression change ( 2 , 3 ). Whilst useful, these datasets are curr ently underpower ed, r esulting in many disease associated loci lacking an eQTL association to a specific gene, and in other cases, multiple genes being implicated at the same locus ( 4 ). Another method to link a disease association to a gene is through annotation of enhancer-promoter loops using chromatin conformation capture (3C, HiC, pchiC, cap-tureC) methodologies ( 5 ), but these have limitations in resolution, frequently missing short-range interactions ( 6 ), and often implicating multiple genes ( 7 ).
Establishing causal variants is perhaps even more challenging. Statistical fine-mapping methods can be used to narro w do wn the list of putati v e causati v e SNPs from human genetic data ( 4 , 8 ), but their resolution is limited by the recombination structure in human populations. Sites of open chromatin, marked by accessibility to DNA modifying enzymes (e.g. DNAseI hypersensitivity ( 9 ) or assa y f or transposase accessible chromatin, ATAC ( 10 )), are bound by transcription factors (TF) in place of canonical nucleosomes. Variants residing at these genomic locations could ther efor e ex ert their effects by alter ed TF binding and gene regula tion. Thus, annota tion of chroma tin features, such as open chromatin domains and histone modifications, can also help to identify likely causati v e SNPs and provide an insight into the molecular mechanism ( 11 ).
Howe v er, such chroma tin fea tur es ar e highly cell type and stimulation specific, making it difficult to rule out particular variants, and frequently implicate multiple candidate variants. Chromatin QTL studies are able to identify regula tory DNA varia tion tha t directly impacts chromatin architecture ( 12 ) but similarly to eQTL analysis, these studies ar e often underpower ed and do not necessarily implicate individual variants ( 6 ). Importantly, none of these methods are able to directly demonstrate the causality of a specific variant. Massi v ely parallel reporter assays can be used to understand the effect of specific variants on enhancer activity ( 13 , 14 ), but enhancers are removed from their nati v e genomic and chromatin context, and thus are unable to demonstra te tha t a specific variant truly has an effect in vivo . Use of genome engineering technologies to introduce specific SNPs is an attracti v e way to causally link a variant to a change in enhancer or gene function ( 15 ), but these experiments are time-consuming and difficult to scale. It is also possible to use CRISPR inhibition or activation techniques targeted to presumed enhancer regions to screen for those that are acti v e in a particular cell type (16)(17)(18), but the resolution of these methods is limited to around 1 kb, and they do not show an effect of a specific genetic variant(s).
To circumvent some of these limitations, we have previously de v eloped a medium throughput arr ay ed screening method we termed genome engineering based interrogation of enhancers (GenIE) that allows screening for the effect of a variant of interest on transcription of the gene in which it resides ( 19 ). Howe v er, this is limited to those variants present within transcribed regions of the genome, which does not allow analysis of the large number of intergenic enhancers, in which GWAS variants are selecti v ely enriched ( 20 ). Here we hav e de v eloped a method, GenIE-ATAC, that allows screening for the effect of specific genetic variants on the open chromatin region in which they are located, in their endogenous context.

Experimental design
The GenIE-ATAC method r equir es the SNP of interest to be located within an ATAC peak in the assayed cell type. For each SNP, three primers were designed: an upstream biotin ylated f orward primer f or the linear PCR enrichment step, a nested forward primer for the ATAC PCR step, and a re v erse primer for gDNA amplification of the region (Supplementary Figure S1). All primers bound within the ATAC peak, the forward primers did not overlap with each other and were located close to the SNP but outside the HDR template for editing experiments. The nested forward primer and reverse primer contained adaptor sequence for the addition of barcodes for Miseq ( 31 ) and amplicons were < 295 bp to allow sequencing using a 150 PE Miseq run. All primers were unique in BLAT searches. For re v erse e xperiments, equi valent primers were designed but in the opposite directions. For editing experiments, we chose the guide with a cut site closest to the SNP of interest and off-target cutting of guides was checked using WGE ( https://www.sanger.ac.uk/htgt/wge/ ). All guide sequences, HDR oligo sequences, and primer sequences used are detailed in Supplementary Tables S1 and S2.

Tn5 production
GenIE-ATAC r equir es the Tn5 protein to be loaded with a single oligonucleotide (MEDS-A) which differs from the commercially produced Tn5 which is loaded with MEDS-A and B oligonucleotides. We purified Tn5 using a C-terminal intein tag and a chitin-binding domain as previously published ( 32 ), with some adaptations including the use of chitin magnetic beads. We transformed pTBX1-Tn5 plasmid (Addgene 60240) into C3013 cells (NEB) and expressed Tn5 as previously described. Bacterial pellets (from 500 ml cultur e) wer e r esuspended in 12 ml cold HEGX (20 mM HEPES-KOH at pH 7.2, 0.8 M NaCl, 1 mM EDTA, 10% glycerol, 0.2% Triton X-100) plus proteinase inhibitors (PI) (Complete, Roche), sonicated on ice at 80% power for 8 × 30 s on / 30 s off with a probe sonicator and centrifuged (13 000 rpm for 20 min 4 • C) to produce 12.5 ml cleared lysate. Neutralised PEI (1 ml 10%, final concentration 0.8%) was added dropwise to the lysate to precipitate the DNA, and the lysate was centrifuged to clear (13 000 rpm for 20 min 4 • C). The supernatant ( ∼12 ml) was added to 1 ml HEGX washed chitin magnetic resin (NEB) and left to bind for 1 h at 4 • C. Beads were washed 4 times with HEGX buffer before adding 250 nmol (50 l) of annealed MEDS-A oligo in 3 ml HEGX buffer plus PI and rocked overnight at RT.
Annealed MEDS: Oligos wer e r esuspended in 10 mM Tris-HCl (pH 8.5) to a concentration of 10 nmol / l and 25 l of each were mixed, heated to 95 • C and left to cool slowly to RT. Importantly, annealed oligos were used on the same day.
After incubation with oligos, the beads were washed 4 times with HEGX buffer, and Tn5 protein was released from the beads by incubation with 50 mM fresh DTT in 4 ml HEGX for 48 h at 4 • C. Eluted protein was collected and dialysed overnight in 2L 2 × Tn5 dialysis buffer (100 HEPES-KOH at pH 7.2, 0.2 M NaCl, 0.2 mM EDTA, 2 mM DTT, 0.2% Triton X-100, 20% glycerol) a t 4 • C . The Tn5 was concentrated using 10K MWCO spin protein concentrator (Pierce) to 24 uM (1.32 mg / ml) as determined by Bradford. Then an equal volume of 100% glycerol was added and carefully mixed before aliquoting and storing PAGE 3 OF 13 Nucleic Acids Research, 2023, Vol. 51, No. 11 e64 a t −20 • C . Tn5 activity was titrated using gDNA in tagmenta tion buf fer (5 × buf fer: 50 mM TAPS-NaOH, 25 mM MgCl 2 , 50% v / v DMF pH 8.5). The amount of Tn5 that was needed to tagment 1 g gDNA to around 400 bp was found to be equivalent to the amount of Tn5 needed to be added to 0.5 × 10 6 nuclei to tagment to the optimal size of fragments (around 400-800 bp) for GenIE-ATAC. This amount was between 1-5 l of Tn5 produced as described above.

Arra y ed CRISPR-cas9 editing
hiPSCs were edited by nucleofection of ribonucleoprotein (RNP) complex containing full-length chemically modified synthetic guide RNA and SpCas9, along with a ssODN repair template ( 19 , 21 ). Briefly, SpCas9 was expressed and purified from Esc heric hia coli using a His-tag. Diluted Sp-Cas9 (5 l, 20 g) was mixed with full-length guide RNA (Synthego) (5 l, 225 pmol) at RT for 20 min for RNP complexes to f orm, f ollowed by addition of the ssODN repair template (5 l, 500 pmol) just before the nucleofection. Cells were dissociated using accutase and 1 × 10 6 cells in P3 buffer wer e mix ed with the RNP / template complex and nucleofected in large cuvettes (V4XP-3024 Lonza) using 4D-Nucleofector on program CA137. After nucleofection cells were plated onto a 10 cm dishes coated with Synthemax (5 g / cm 2 ) with TeSR-E8 supplemented with Rock inhibitor. After 24 h, the media was exchanged for TeSR-E8 and after ∼10 days the cells, when they had grown to a pproximatel y 80% confluence, were harvested by accutase and processed for ATAC (0.5 × 10 6 cells) or pellets (2 × 10 6 cells) were frozen for gDN A extraction. Routinel y at least three ATAC samples and three gDNA samples were analysed per pool of cells.

A T A C assay, linear PCR and nested PCR
hiPSCs grown on Synthemax were washed with PBS and 6 ml accutase was added for 5 min a t 37 • C . The accutase was aspirated and the cells were gently resuspended in 10 ml TeSR-E8. Cells were pelleted (300 g 3 min) and resuspended in 1 ml cold PBS. After counting, 0.5 × 10 6 cells wer e transferr ed to an eppendorf (not loBind) and pelleted (300 g 3 min). The cells were then resuspended in 500 l fresh ice cold sucrose buffer (10 mM Tris-HCl pH 7.5, 3 mM CaCl 2 , 2 mM MgCl 2 , 0.32 M sucrose) and incubated on ice for 12 min. Triton-X-100 was added to a final concentration of 0.5% (25 l of a fresh 10% solution), mixed gently and incubated on ice for 6 min. The w hole l ysis solution was transferred to a new eppendorf (not loBind) and spun 450g for 5 min at 4 • C to collect the nuclei. All traces of the lysis buffer wer e r emoved from the nuclei and 50 l tagmentation mixture (10 l 5 × tagmentation buffer, Tn5 (amount from titration above), water to 50 l) was added to the nuclei, gently resuspended and transferred to a Lobind eppendorf. Nuclei were tagmented for 30 min at 37 • C and the DNA was cleaned up using a MinElute PCR cleanup kit (Qiagen) and eluted in 21 l EB. The DNA was quantified using Qubit and was typically 30-60 ng / l. For each SNP, 3-4 independent tagmentations were carried out from the same pool of cells.
Linear PCR was carried out on the purified DNA for each ATAC repeat: Biotinylated DNA pulldown was performed using Dynabeads MyOne Streptavidin C1 and 20 l slurry was used per PCR. The beads were washed twice in 40 l PBS with 0.1% BSA and once with 2xBW buffer (10 mM Tris-HCl, pH 7.5, 1 mM EDTA, 2 mM NaCl) befor e being r esuspended in 50 l 2 × BW buffer. The whole linear PCR (50 l) was added to the bead suspension and rotated overnight at RT. The beads were collected on the magnet, and the supernatant removed. The beads were washed once with 100 l water, and resuspended in 30 l EB.
The subsequent amplification of DNA used the beads as template with the nested forward primer specific for the SNP of interest and the Tn5A primer (T CGT CG-GCA GCGTCA G) specific for the oligonucleotide loaded into the Tn5 and integrated into the DNA during tagmentation. The amount of beads was tested (Supplementary Figure S4) and routinely we use 10 l allowing up to three technical repeats to be performed per ATAC. PCR was carried out using PowerUp SYBR Green Master Mix (Applied Biosystems) with 10 l bead template and 0.4 M final concentration forward and reverse primers in a 50 l reaction.
The PCR products were analysed on a 2% agarose gel, stained with ethidium bromide, and should be a smear of DNA sizes ranging from ∼400-800 bp (similar to the tagmented DNA). gDNA was pr epar ed using the MagAttract HWM kit (Qiagen) ( 19 ). PCR was carried out using PowerUp SYBR Green Master Mix (Applied Biosystems) with 5 l gDNA (250-500 ng) template and 0.4 M final concentration forwar d and re v erse primers in a 50 l reaction. Typically 3-4 PCRs were carried out from one gDNA preparation. We also tested using tagmented gDNA as the template for PCR (Supplementary Figure S2) and in these experiments 250-500 ng DNA was tagmented as described in the ATAC experiment, cleaned up using the MinElute PCR cleanup kit (Qiagen) and used as PCR template.

Barcoding and sequencing
In order to add the Illumina indices ( 31 ) we performed a second PCR using 1 l PCR1 (from ATAC or gDNA samples), PowerUp SYBR Green Master Mix (Applied Biosystems), and 0.4 M final concentration forward and re v erse primers in a 25 l reaction.

FGFR OCT4 binding site experiment
To carry out editing, we mixed 2 HDR oligos equally (for 1bp edit or 3 bp edit at the OCT4 binding site in the FGFR intronic super enhancer) and performed a nucleofection with RNP as described above. We performed three biological repeats for ATAC, gDNA and RNA samples with three technical repeats for each at the PCR stage. The RNA sample was processed as previously published ( 19 ).

CTCF saturation mutagenesis experiment
To carry out the editing for the sa tura tion mutagenesis of CTCF binding site, we mixed 9 HDR oligos equally and performed 8 nucleofections with RNP as described above. The cells were pooled and grown over 5 × 10 cm dishes and nine ATAC and nine gDNA independent samples were harvested. For the linear PCR we inputted 20 l tagmented DNA, and after biotin pulldown, we used all 30 l beads into the PCR (split over 4 PCR reactions). These PCRs were pooled before adding the Miseq barcodes. For the gDNA PCRs, we inputted 4x more gDNA and split over four PCR r eactions befor e pooling for Miseq bar coding. This was to make sure we had good coverage of all of the editing e v ents that had taken place in the pool of cells.

Sequencing and read alignment
PCR amplicons were sequenced using 2 × 150 bp reads on an Illumina MiSeq instrument. Since amplicons were all smaller than 295 bp, we merged the overlapping 150bp pair ed-end r eads using FLASH v1.2.11 ( 33 ) to improve alignment of Cas9-induced deletions. As input to FLASH we specified a minimum overlap of 10 bp, fragment size as the amplicon size, fragment standar d de viation of 40, and maximum mismatch density of 10%, and used the -allowouties parameter. A mean of 98.6% of reads could be successfully merged, with standar d de viation of 1.5%. For all regions except rs141252451 in MIR8070, we aligned merged reads to the GRCh38 human r efer ence sequence using bwa mem v0.7.17 ( 34 ), with lenient parameters to allow aligning . For rs141252451, we aligned to an amplicon sequence that included the 2-bp insertion defined by rs141252451, since rgenie excludes insertion reads by default.

Analysis with rgenie
The core analysis performed in rgenie is the same as previously described ( 19 ). Briefly, for each replicate (ATAC or gDNA) at each locus, rgenie extracts reads mapping to the targeted region from the aligned BAM file. To quantify the effect of an allele X on the ATAC-seq readout, we first determine for each replicate the ratio of the read count of X to that of the WT allele, r = (read count X / read count WT). We separately compute the mean of this ratio for ATAC replicates, r A , and for gDNA replicates, r G . If allele X alters chromatin accessibility, then the ratios will differ between ATAC and gDNA, i.e. r A = r G . We use a two-tailed unequal variances t-test to test for a difference in this ratio, and we report p-values from this test. Although the statistical analysis is identical for an RNA or an ATAC readout, we updated rgenie to provide plots and statistical summaries tailored to an ATAC anal ysis. Specificall y, an ATAC anal ysis by default reports only deletions w holl y contained within a ±6 bp window around the targeted SNP. This is because with ATAC-seq the reads span a range of sizes depending upon the ATAC peak shape, with most reads tending to be short. As a result, longer deletions are not reliably captured; this differs from RNA, where all reads span the full amplicon. We also highlight allele ef fect sta tistics for the two most prevalent deletions within this window. In Supplementary  Tables S3 and S4, we provide a statistical analysis summary from a ppl ying rgenie to the eight SNPs and the CTCF mutagenesis e xperiments, respecti v ely. See the Data Availability section for links to rgenie input file details, amplicon sequences, and HDR / WT alleles for these analyses.

RESULTS
We extended our existing GenIE method ( 19 ) to assay the effect of variants lying within gene regulatory elements on the chromatin accessibility of that r egion. This incr eases the proportion of GWAS-associated variants ( 20 ) we can assay by a further 23.7% to a total of 92.7% across one or other assay (Figure 1 A). It also allows us to measure the effect of a variant on both transcription (GenIE) and chromatin accessibility (GenIE-ATAC) for up to 52.9% of GWAS associated variants. We introduced a variant of interest by genome editing, and used a locus-specific ATAC to measure the effect of each allele on chromatin accessibility. Thus, we could directly demonstrate the causality of a specific variant  in altering chromatin at its endogenous location within the genome. This will allow for instance to distinguish between se v eral causal variants in high LD, or to establish cause and effect relationships for co-regulated enhancers. In contrast to the alternati v e approach of making edited isogenic cell lines, we were able to avoid the time-consuming step of clonal isolation ( 21 ) and the problem of clone-to-clone variability ( 22 ) by analysing the pool of edited cells. This made it possible to perform an arr ay ed screen of dozens of variants, and will, in the future, allow the use of more complex and disease relevant cellular models such as differentiated induced pluripotent stem cells (iPSCs) and primary cell types. Since there is no necessity for extended culture or clonal growth, this should be possible in most cell types, making it possible to assay the variants that are in chromatin accessible regions specific to a particular cell-type or 'state', which are frequently critical in disease. We used human iPSCs to establish our method, since these cells can be used to model regulatory element variation in a wide variety of disease-relevant cell types and are amenable to CRISPR-mediated genome editing. We first deli v ered Cas9 and a guide RNA as a ribonucleoprotein along with a 100 nt single-stranded DNA oligonucleotide homology directed repair (HDR) template ( 21 ). This generated a population of cells containing the WT allele (unedited) and the HDR allele (SNP introduced) along with a large number of indels (small insertions / deletions) (Figure 1 B).
We harvested genomic DN A (gDN A) from the mixed population of cells, performed PCR across the edited site and assessed the frequency of different alleles within the population using high throughput sequencing of the amplicon. Using exactly the same population of cells we performed ATAC, which selecti v ely integrates adaptor sequences into open chromatin using the Tn5 transposase. We then carried out a nested PCR using a target-specific primer near to the edited site and a uni v ersal primer that binds the Tn5 adaptor sequence. This amplicon was also sequenced, and the effect on chromatin accessibility for the edited allele was then calculated as the ratio of sequencing reads in the ATAC relati v e to the gDNA for the edited allele, normalised to the same ratio for the unedited (WT) allele (Figure 1 B).
Two de v elopments were necessary to successfully establish a site-specific ATAC protocol. Firstly, we purified our own Tn5 and loaded it with a single oligonucleotide sequence (MEDS-A sequence) to allow a single PCR primer binding site for the subsequent PCR of the ATAC sample. Our method also r equir es a substantial input of ATAC DNA into the PCR to avoid any PCR bias, and the cost of commercial Tn5 protein would limit the number of variants that could be assayed in a single experiment. Secondly, we needed to carry out a linear PCR using a biotinylated primer, followed by a biotin pulldown of the amplified DNA to enrich for the region of interest prior to a semi-nested PCR (Supplementary Figure S1). This is because the Tn5 MEDS-A sequence is integrated at tens or hundreds of thousands of sites in the genome, whereas the gene specific primers only have one binding site, and thus a semi-nested PCR alone was unable to give sufficient enrichment of the desir ed r egion.
We first tested GenIE-ATAC on three heterozygous SNPs in KOLF 2 iPSCs that overlapped with accessible chromatin regions defined by ATAC-seq in these cells. These variants were selected due to their high probability of affecting chromatin accessibility based on fine-mapping of chroma tin QTL da ta fr om sensory neur ons ( 23 ). As chr oma tin QTL da ta was not av ailable from hiPSCs, the v ariants were cross referenced with hiPSC ATACseq to make sure that the variants were within chromatin accessible regions, but it is still possible that they may not have an effect in this cell type. As expected for heterozygous alleles, when we performed amplicon sequencing over the variant of interest from gDNA we observed an equal proportion of the two alleles (ratio of 1) (Figure 2 , red bars).
For each SNP tested, when we analysed amplicon sequencing using GenIE-ATAC, there was an imbalance of reads between the two alleles (ratio skewed away from 1), with the alt alleles either decreasing chromatin accessibility (for SNP1) or increasing it (for SNPs 2 and 3). These r esults wer e r ecapitulated using primers designed in the r everse direction with the locus specific primer on the opposite side of the variant (Supplementary Figure S2). Importantly, the effects of SNPs on chromatin accessibility measured by GenIE-ATAC were consistent with the expectations from the chromatin QTL data and confirmed independently by analysis of allele-specific reads from genome-wide ATACseq in hiPSCs ( 19 ). As negati v e controls for GenIE-ATAC we also performed the experiment using three heterozygous SNPs located in ATAC peaks that had no strong allelic imbalance in ATAC-seq (Figure 2 , bottom panel), and once again the data fr om GenIE-ATAC corr obora ted tha t seen with genome-wide allele-specific ATAC-seq. We used the standar d de via tion between replica tes (0.0477) to estima te the sensitivity of the assay. With three replica tes, a t a Pvalue of 0.01 we can detect a 21.9% change in accessibility, or a 13.3% change with a P -value of 0.05.
The control in these experiments was a simple PCR from gDNA, but we wanted to validate that we obtained the same results with tagmented gDNA that had been through the linear PCR and enrichment process. We found that tagmentation and enrichment did not cause an allelic bias, and we still observed a ratio of 1 for heterozygous alleles. Howe v er it did increase the variation we observed (Supplementary Figure S3) when compared to simple gDNA PCR, likely due to sampling bias. Thus, we used gDNA PCR as the control for subsequent experiments to increase consistency and the throughput of the assay. In order to increase the number of replicates that could be achie v ed per SNP per experiment, we also tested the effect of titrating the amount of ATAC DNA to be added into the linear PCR. The results and variation were similar with 3 times less input material (Supplementary Figure S4), which we used for subsequent experiments.
We next used GenIE-ATAC to screen 8 naturally occurring SNPs for their causal effect on accessibility of chromatin after editing with CRISPR. These SNPs were selected based on being homozygous, located in ATAC peaks in our iPSC line, and having a high probability of affecting chromatin accessibility from chromatin QTL data ( 23 ). The details of these regions and their predicted effects on transcription factor binding using the deep learning method DeepSEA ( 24 ) and manual curation are indicated in Table  1 .
The editing efficiency of each of these SNPs in our WT iPSC line (KOLF 2) was variable but generally high (Figure  3 A, bottom panel). This may be because they are located in regions of accessible chroma tin tha t hav e pre viously been shown to have particularly high editing rates ( 25 ). This is advantageous for GenIE-ATAC since all variants will be located within ATAC peaks. Even in the cases with the lowest HDR rates (1%), using 1 million cells and assuming a 50% survival, ther e ar e around 5000 independent editing e v ents, w hich essentiall y eliminates clonal artefacts. Data was processed using an R package we have developed (rgenie) which automates the analysis of experiments, and reports quality control and statistical analysis of the effect of both HDR alleles and defined small deletions over the variant of interest.
The HDR effect size after editing the desired SNPs to the alternati v e allele (Figure 3 A, upper panel, and Supplementary Table S3) showed that three of the edited SNPs did not have any significant effect on chromatin accessibility (regions 1, 4 and 5). We have previously shown that the SNP in region 4 has an effect on transcription of the MUL1 gene ( 19 ), so a lack of change in chromatin accessibility suggests its effect is mediated post-transcriptionall y, likel y at the RNA le v el. SNPs edited in regions 2, 3 and 8 all showed a decrease in chromatin accessibility consistent with a reduction in transcription factor binding. The Region 2 SNP is within a CTCF binding motif and the editing of a G > A within the motif is consistent with a predicted reduction of CTCF binding and chromatin accessibility upon editing. Region 8 had the largest effect size in the screen, and in this case the HDR e v ent created a 2 bp insertion (CA) within a CTCF binding site, which was predicted to result in the complete loss of CT CF binding. Inter estingly, the SNPs in regions 6 and 7 showed an increase in chromatin accessibility, potentially due to the introduction of a new binding site for a transcription factor.
As well as introducing the SNP of interest at each site by HDR, the GenIE-ATAC method also creates a large population of deletion and insertion alleles around the cut site of the Cas9 / guide. Since insertions are generally of lower fr equency, and ar e mor e difficult to interpr et, we excluded these from our analysis, but the deletion r epertoir e provides information about the effect of these mutations on chromatin accessibility. Accordingly, we analysed the effect of The most frequent deletion types by read count from GenIE-ATAC targeting of SNP 2 (rs12269414). The effect size of ATAC for each allele relati v e to the WT is indicated on the right hand side. The HDR allele is highlighted in green. Del 1 and Del 2 (highlighted in blue and orange r espectively) ar e small defined deletions around SNP of interest. Deletion alleles which have no effect change are indicated by red boxes, and recapitulate the CTCF motif (left). all deletions w holl y contained within a 6 bp window around the SNP of interest (Figure 3 A, middle panel, and Supplementary Table S3). As expected, deletion of bases within an ATAC peak very often results in a reduction of chromatin accessibility due to disruption of transcription factor binding, and we observed this in six of our eight edited locations. Howe v er, deletions around the SNP in region 5 did not have any effect and deletions in region 6 had an opposite effect to the introduction of the SNP, demonstrating the importance of being able to introduce and assay specific desired variants. Closer inspection of individual deletions in region 4 (Supplementary Figure S5) showed that those that removed a poly-C region had a strong effect on chromatin accessibility, but those that did not had a weak or no effect. This suggests that these C bases are important for binding of an unknown factor, and highlights how analysis of specific deletions can provide information about mechanism of action. Analysis of region 2 (Figure  3 B), demonstra ted tha t whilst introduction of the SNP led to a loss of accessibility, the deletions fell into two classes where accessibility was virtually abrogated, or was retained (r ed box es). Closer inspection showed that in the case of the specific deletions which maintained chromatin accessibility, the adjacent sequences were such that the CTCF consensus site was recapitulated upon deletion. Thus, it is highl y likel y that CT CF binding was r esponsible for the accessible chromatin peak at this site. Deletions in region 8 showed a similar but smaller effect than HDR-generated 2 bp insertion, but closer analysis showed that the three most frequent deletions had only very small effects whereas most other deletions had strong effects. This is probably due to the repetiti v e nature of this region which likely recapitulated a binding site in these three specific examples. All plots were generated using the rgenie R package and the full output for all SNPs in this experiment is shown in supplementary note.
Since the majority of SNPs in chromatin accessible regions are also transcribed (Figure 1 ), there is an opportunity to measure the effect of variants on both transcription using GenIE ( 19 ) and chromatin accessibility using GenIE-ATAC . To demonstra te this, we chose a transcribed intronic region that is part of a super enhancer of the FGFR1 gene that had been shown to have activity in a reporter assay ( 26 ), overlaps with a chromatin accessible region in hiPSCs and contains a consensus OCT4-SOX2 binding site (Figure 4 A). We performed an experiment where we analysed the effect of making a single or triple mutation in the OCT4 binding site using both GenIE and GenIE-ATAC. Editing was effecti v e with HDR rates of 4.9% and 2.7% for single and triple mutations respecti v ely (Figure 4 A). This showed that as predicted, both single and triple mutations and deletions reduced chromatin accessibility and transcription (Figure 4 B, supplementary note). The triple mutation showed a more substantial decrease in chromatin accessibility than the single mutant, but there was no difference in their effect on transcription. This may be due to a loss of binding of the neighbouring SOX2 protein, which may affect chromatin accessibility but not further decrease the activity of the enhancer. It highlights how chromatin accessibility is not perfectly correlated with transcription, and how the two assays can complement each other and help to understand the mechanisms of action of specific variants in vivo .   ( 19 )), the r egion of DNA that gave luciferase activity in reporter assay ( 26 ), and the POU5F1:SOX2 binding site along with CRISPR edits that were introduced to abolish OCT4 binding (either T-G HDR 1 or AT G-T GC HDR 3). Pie chart shows editing r ates. ( B ) Gr aphs showing effect size of HDR or deletions in GenIE-ATAC (left) or GenIE (right) assay. Data processed using rgenie. All error bars r epr esent 95% confidence intervals, n = 9.

CCATTAAAATGTAAAAA
The effects of point mutations within transcription factor binding sites are often difficult to predict based on consensus sequences alone ( 27 ). We ther efor e performed GenIE-ATAC in a multiplex format to simultaneously introduce multiple SNPs across a defined region in one experiment and obtain nucleotide-le v el resolution of the determinants of transcription factor binding. We targeted the CTCF binding site in region 2 and designed HDR oligo templates to mutate 9 bases in the binding site individually into every other possible base (Figure 5 A).
We performed GenIE-ATAC using a pool of all HDR oligos. It was not possible to introduce mutations at the N in the NGG PAM due to the fact that Cas9 would continue to introduce indels e v en after successful HDR, and bases further from the Cas9 cut site could not be evaluated due to low HDR rates. We increased the amount of ATAC DNA material and gDNA into each PCR proportionally for the number of HDR e v ents to maintain r epr esentation (see methods). Figure 5 B shows the effect size of each substitution (see also Supplementary Table S4). For example, a substi-tution of an A to G in position 4 of the CTCF site had no effect on binding, whereas mutation to any other base drama tically decreased chroma tin accessibility. The inverse of this effect size reflects the importance of each base for CT CF binding, and broadly r ecapitulates the CT CF binding motif (Figure 5 C, D). There are some differences, howe v er, which could be due to se v eral reasons. Firstly, the consensus binding motif measures the frequency of each base across all binding sites, and the nucleotide determinants of binding at a specific locus may be different (e.g. due to competiti v e or cooperati v e binding interactions, effects of DNA major / minor groove structure, or chroma tin modifica tions or structure). Secondl y, DN A binding factors other than CTCF could be contributing to the changes in chromatin accessibility. Finally, the effect of mutation has been shown to be highly context-specific, such that mutation of a consensus base can be rescued or buffered by a second mutation elsewhere in the binding site ( 27 ). This result illustrates the context-dependence of transcription factor binding and the difficulty in predicting the effect of a mutation from binding site consensus alone. Bottom panel shows editing ra tes. Da ta was pr ocessed using rgenie. All err or bars r epr esent 95% confidence interv als, and P v alues are as indicated. ( C ) Sequence logo showing the consensus binding site as determined by GenIE-ATAC, with letter size proportional to the square of the relati v e GenIE-ATAC effect size when mutated to that nucleotide (relati v e to the WT / consensus, set to 1). ( D ) Scatter plot showing correlation between GenIE-ATAC motif proportion size and Jasper CTCF motif proportion size. Points r epr esent the GenIE effect size for each nucleotide divided by the sum over all four nucleotides at the same position; error bars are the 95% confidence intervals for a gi v en nucleotide di vided by the sum of these over all 4 nucleotides at the same position.

DISCUSSION
We belie v e that GenIE-ATAC will be a generally useful screening method to identify causal variants from GWAS studies and to understand transcription factor binding in the context of chromatin. Importantly, it is able to functionally link specific genetic variants to changes in chromatin accessibility in their endogenous locus and genomic context. W hilst chroma tin accessibility is a dir ect r eadout of the effect of variants on transcription factor binding, it is not always dir ectly corr elated to downstr eam effects on transcription. We show that GenIE-ATAC can be combined with GenIE to sim ultaneousl y measure effects on transcription in many cases. This can help to identify alternati v e mechanisms of action, for example transcriptional r epr essor binding and post-transcriptional RNA regulatory elements. Our new methodology can thus validate causal variants, and provide mechanistic information about how they could in-