High-throughput sequencing SELEX for the determination of DNA-binding protein specificities in vitro

Summary High-throughput sequencing SELEX (HT-SELEX) is a powerful technique for unbiased determination of preferred target motifs of DNA-binding proteins in vitro. The procedure depends upon selection of DNA binding sites from a random library of oligonucleotides by purifying protein-DNA complexes and amplifying bound DNA using the polymerase chain reaction. Here, we describe an optimized step-by-step protocol for HT-SELEX compatible with Illumina sequencing. We also introduce a bioinformatic pipeline (eme_selex) facilitating the detection of promiscuous DNA binding by analyzing the enrichment of all possible k-mers. For complete details on the use and execution of this protocol, please refer to Pantier et al. (2021).


SUMMARY
High-throughput sequencing SELEX (HT-SELEX) is a powerful technique for unbiased determination of preferred target motifs of DNA-binding proteins in vitro. The procedure depends upon selection of DNA binding sites from a random library of oligonucleotides by purifying protein-DNA complexes and amplifying bound DNA using the polymerase chain reaction. Here, we describe an optimized step-by-step protocol for HT-SELEX compatible with Illumina sequencing. We also introduce a bioinformatic pipeline (eme_selex) facilitating the detection of promiscuous DNA binding by analyzing the enrichment of all possible k-mers. For complete details on the use and execution of this protocol, please refer to Pantier et al. (2021).

BEFORE YOU BEGIN
Systematic evolution of ligands by exponential enrichment (SELEX) is a molecular biology technique allowing the in vitro selection of DNA oligonucleotide duplexes with high affinity for a target ligand (Ellington and Szostak, 1990;Tuerk and Gold, 1990). This technology can be coupled with highthroughput sequencing (HT-SELEX) to determine transcription factor binding specificities (Roulet et al., 2002;Jolma et al., 2010;Slattery et al., 2011).
Here, we describe the stepwise performance and analysis of HT-SELEX using purified SALL4 C2H2 zinc-finger clusters as ''bait'' (Pantier et al., 2021). However, this protocol can be applied to a wide range of DNA-binding proteins or DNA-binding domains (see limitations). Two critical reagents are required to initiate HT-SELEX: a library of random oligonucleotides; and a purified DNA-binding protein fused with an affinity tag. 4. To verify the generation of double-stranded DNA libraries, run a small amount of PCR reaction (5 mL) on a 10% polyacrylamide gel and stain with a 0.5 mg/mL ethidium bromide solution. You should observe a single band at 83 bp and no detectable heteroduplexes (see Figure 2). 5. Purify SELEX libraries using the Qiagen MinElute PCR purification kit and following manufacturer's protocol. To obtain high concentrations, pool 83 identical PCR reactions into 13 MinElute column and elute with 20 mL of EB Buffer (included in the kit, 10 mM Tris-HCl pH8.5) or H 2 O. 6. Evaluate DNA concentration and integrity of purified SELEX libraries using a Nanodrop spectrophotometer.
Purify DNA-binding proteins fused with an affinity tag Timing: 1-2 weeks Here, we describe the HT-SELEX protocol using histidine-tagged SALL4 C2H2 zinc-finger cluster 4 (ZFC4). We do not provide a generic protocol for protein expression and purification, as this process should be optimized for each individual protein. Detailed information regarding the choice of expression systems and purification strategies is extensively discussed in the literature (Grä slund et al., 2008;Kielkopf et al., 2021).
For more information regarding the purification of SALL4 ZFC4, please refer to our previously published manuscript (Pantier et al., 2021). Recombinant proteins were diluted to a concentration of 0.5 mg/mL in protein buffer (20 mM Tris-HCl pH7.5, 150 mM NaCl), and aliquots were stored at À80 C.
Note: The addition of an affinity tag is critical both for purifying proteins from bacterial extracts and for purifying protein-DNA complexes during the SELEX protocol. We prefer the hexahistidine tag as it is small (6 residues) and allows for cost-efficient purification by immobilized metal affinity chromatography (IMAC). Other tags can be used to facilitate protein expression and solubilization (e.g., GST, MBP), but their larger size might impact the DNA binding capacity of fusion proteins.

MATERIALS AND EQUIPMENT
Alternative choices of reagents.
Alternatives: Here, we used the Phusion DNA polymerase (NEB, Cat#M0530L) to PCR amplify SELEX libraries. Other high-fidelity DNA polymerases can be used for this purpose.
Alternatives: We used Ni Sepharose 6 Fast Flow resin (Cytiva, Cat#17531806) corresponding to nickel-charged agarose beads for the purification of histidine-tagged proteins. If a different affinity tag was used, choose the appropriate reagent (e.g., glutathione resin for the purification of GST-tagged proteins).
Alternatives: Here, we used the MinElute PCR purification kit (Qiagen, Cat#28004). If using an alternative kit, check that the minimum size of purified products is compatible with the purification of SELEX libraries (83 bp). Alternatives: Here, we used KAPA Pure beads (Roche, Cat#07983271001) to clean-up highthroughput sequencing libraries. Alternative reagents can be used, such as AMPure XP beads (Beckman Coulter, Cat#A63880).
Oligonucleotides for the generation and amplification of SELEX libraries.
CRITICAL: ''N'' refers to random nucleotides (25%A, 25%T, 25%G, 25%C). It is important to order oligonucleotides only with standard desalting, and no extra purification step (e.g., PAGE/HPLC purification) which risks excluding some DNA sequences and biasing the randomness of libraries.
Note: HT-SELEX has been validated with random inserts ranging from 14 bp to 40 bp (Jolma et al., 2010(Jolma et al., , 2013Nitta et al., 2015). In this protocol we chose a 20 bp insert, which covers motifs for the vast majority of sequence-specific DNA-binding proteins (i.e., those with a binding site %20 bp).
Oligonucleotides for the generation of high-throughput sequencing libraries.
Order the following oligonucleotides (see generation of cycle 0 libraries and SELEX protocol):

Name Sequence
Random library 1 TACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNNNNNAGATCGGAAGAG CACACGTCTG Note: Each ''Seqlib RV'' primer contains a unique 8 bp barcode (underlined) which will be used to tag HT-SELEX samples. This will allow the pooling of multiple libraries for high-throughput sequencing and their subsequent de-multiplexing. If designing additional ''Seqlib RV'' primers, make sure that all barcodes contain at least two mismatches between each other, and that the base composition of barcodes is homogenous at every position.
Preparation of buffers for the HT-SELEX protocol.
Note: The molecular mass of poly(dI-dC) (Merck Life Science, Cat#P4929) is lot-dependent. Calculate the precise amount of water to add each time.
Preparation of 10% polyacrylamide gels for electrophoresis.
CRITICAL: Add Ammonium persulfate and Tetramethylethylenediamine (TEMED) last to induce polymerization. Quickly cast gels following the addition of these reagents.

STEP-BY-STEP METHOD DETAILS
Perform SELEX (repeat these steps 2-6 times) Timing: 1.5 days (32-6) During SELEX, a library of random oligonucleotides is mixed with a DNA-binding protein of interest fused with an affinity tag. Protein-DNA complexes are purified and bound sequences are amplified by the polymerase chain reaction (PCR). This material is re-used for successive cycles of SELEX until most of the library contains high affinity binding sites. For transcription factors, 2-3 cycles are usually sufficient for successful HT-SELEX (Jolma et al., 2010(Jolma et al., , 2013. However, we performed up to 63 SELEX cycles to characterize SALL4 ZFC4 which promiscuously binds to multiple AT-rich sequences (Pantier et al., 2021).

Prepare buffers.
On the day of the experiment, prepare a mastermix of ''SELEX binding buffer'' (SELEX buffer supplemented with 5 mg/mL poly(dI-dC) and 0.5 mM DTT) and ''SELEX wash buffer'' (SELEX buffer supplemented with 0.5 mM DTT).
2. Equilibrate Ni Sepharose 6 Fast Flow beads in SELEX binding buffer. a. Take out the required amount of Ni Sepharose 6 Fast Flow resin (55 mL 3 number of samples) and transfer into a in a 1.5 mL tube (e.g., for 6 samples, take out 330 mL of Ni Sepharose 6 Fast Flow resin).
Note: The total amount of resin includes a 10% excess to account for small inaccuracies when pipetting multiple samples.
Note: If a large volume of Ni Sepharose 6 Fast Flow resin is required, split into several 1.5 mL tubes (maximum 500 mL resin/tube) and prepare additional SELEX binding buffer accordingly.
b. Add 1 mL of SELEX binding buffer and resuspend beads thoroughly by inverting the tube multiple times. c. Centrifuge for 1 min at 400 3 g. Discard the supernatant without disturbing the beads pellet. Note: It is important to include a negative control SELEX reaction, without addition of proteins, to control for any sequence bias that could be associated with repeated PCR cycling. It is also advised to perform SELEX with independent libraries, which are used as technical replicates (see generation of cycle 0 libraries).
b. Incubate on a rotating wheel for 10 min at room temperature. 4. Purify protein-DNA complexes.
a. To capture protein-DNA complexes, add 50 mL of equilibrated Ni Sepharose 6 Fast Flow resin (from step 2) to each SELEX sample. b. Incubate for 20 min on a rotating wheel at room temperature. c. To remove non-specifically bound DNA-protein complexes, add 1 mL of SELEX wash buffer and resuspend beads thoroughly by inverting the tube multiple times. d. Centrifuge for 1 min at 400 3 g. Discard the supernatant without disturbing the beads pellet. e. Wash beads 43 more times (steps 4c-d). f. Resuspend the resin in 100 mL H 2 O.
Note: Elution of DNA from the beads is not necessary, as this material can be directly used as a template for PCR amplification of SELEX libraries.
Pause point: The resin (protein-DNA complexes) can be stored at À20 C (long term). This material can be used at a later time for PCR amplification.

PCR-amplify enriched DNA.
CRITICAL: The amount of DNA bound to the resin is unknown and usually varies between SELEX samples. Therefore, it is important to empirically determine the optimal number of PCR cycles to amplify each SELEX library (see the following steps).

Reagent Final concentration Amount
Histidine-tagged DNA-binding protein 10 mg/mL 1 mg SELEX DNA library (cycle N-1) 1 mg/mL (15 mg/mL for the first cycle) To control the amplification of libraries, run a small amount of PCR reaction (5 mL) on a 10% polyacrylamide gel and stain with a 0.5 mg/mL ethidium bromide solution (see Figure 3). e. For each SELEX sample, select the optimal PCR reaction and discard other tubes (see Figure 3).
f. Purify DNA using the Qiagen MinElute PCR purification kit and following manufacturer's protocol. Elute with 20 mL of EB Buffer (included in the kit, 10 mM Tris-HCl pH8.5) or H 2 O.
Note: A single PCR reaction will yield enough DNA to proceed with the protocol.
g. Evaluate DNA concentration and integrity of purified SELEX libraries using a Nanodrop spectrophotometer.
Pause point: Store purified SELEX libraries at À20 C (long term).
h. Use DNA as an input to repeat an additional cycle of SELEX (N+1).
CRITICAL: Remember to save an aliquot of purified SELEX library (z20 ng) for highthroughput sequencing (see generation of HT-SELEX libraries for Illumina sequencing).

Generate HT-SELEX libraries for Illumina sequencing
Timing: 1.5 days After multiple SELEX cycles, DNA libraries contain a significant proportion of high affinity DNA binding sites for the target protein. This step describes the conversion of SELEX libraries into HT-SELEX libraries containing Illumina adapters and unique barcodes (see Figure 4). These samples are subsequently pooled and submitted to high-throughput sequencing to reveal preferred DNA motifs.
6. Select SELEX samples to submit to high-throughput sequencing.

PCR cycling conditions (203 cycles)
Steps Temperature Time Cycles Initial c. To control the amplification of libraries, run a small amount of PCR reaction (5 mL) on a 10% polyacrylamide gel and stain with a 0.5 mg/mL ethidium bromide solution (see Figure 5). d. Purify HT-SELEX libraries using the Qiagen MinElute PCR purification kit and following manufacturer's protocol. Elute with 20 mL of EB Buffer (included in the kit, 10 mM Tris-HCl pH8.5) or H 2 O.
Note: For each SELEX sample, a single PCR reaction will yield enough DNA to proceed with high-throughput sequencing.
Note: Long PCR primers were used to generate HT-SELEX libraries, and these oligonucleotides are not completely eliminated following PCR purification with the Qiagen MinElute column.
e. Evaluate DNA concentration and integrity of purified HT-SELEX libraries using a Nanodrop spectrophotometer.
Pause point: Store purified HT-SELEX libraries at À20 C (long term). These samples can be pooled and submitted to high-throughput sequencing at a later time.
8. Prepare a sequencing library pool and submit to high-throughput sequencing.
a. Use Nanodrop quantification to pool all HT-SELEX libraries in equimolar amounts in a 1.5 mL tube.
CRITICAL: Make sure that all libraries in the pool contain unique indexes, so that each library can be de-multiplexed following high-throughput sequencing.
b. To ensure complete removal of leftover PCR primers contaminating libraries, perform a cleanup with KAPA Pure beads following manufacturer's protocol. Use a 33 bead-to-sample ratio (e.g., add 150 mL of beads to 50 mL of HT-SELEX pool) to eliminate oligonucleotides below 100 bp (see Figure 6).
Pause point: Store purified HT-SELEX library pool at À20 C (long term). This material can be submitted to high-throughput sequencing at a later time.
c. Perform a final quality control on the library pool using the Agilent High Sensitivity DNA Kit and the 2100 Bioanalyzer instrument (following manufacturer's protocol) (see Figure 6).
Alternatives: Run the library pool on a 10% polyacrylamide gel and stain with a 0.5 mg/mL ethidium bromide solution, as previously described.
d. Submit the HT-SELEX library pool to high-throughput sequencing using an Illumina instrument (e.g., Miseq/NextSeq/NovaSeq). Single-end sequencing is sufficient to cover the 20 bp insert containing putative DNA binding motifs (see Figure 4). A sequencing depth of 10,000-50,000 reads per sample should be sufficient to obtain robust quantification of DNA motifs for HT-SE-LEX (see troubleshooting 2).

EXPECTED OUTCOMES
The final output of the HT-SELEX protocol is the library pool subjected to Illumina sequencing (see Figure 6). Intermediate material corresponding to protein-DNA complexes (bead suspension) and purified SELEX libraries without Illumina adapters can be stored long term at À20 C (see Pause steps during the SELEX protocol).
The section below describes a complete bioinformatic workflow to process sequencing data and quantify the enrichment of DNA motifs. SALL4 ZFC4 HT-SELEX dataset (including processed files) is available in ArrayExpress: E-MTAB-9236. Additionally, we sequenced the same libraries at higher throughput to determine the minimal sequencing depth for HT-SELEX analysis (see troubleshooting 2). This new dataset is also available in ArrayExpress: E-MTAB-11484.

QUANTIFICATION AND STATISTICAL ANALYSIS Bioinformatic analysis
Timing: 1 day Note: Analysis time will vary depending on the sequencing depth of HT-SELEX datasets and the length of DNA motifs (k-mers) to analyze.
1. Setup the package management system ''conda'' following the instructions available here: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html. 2. Install all required software inside a conda environment from your command line: 3. Generate a tab-separated values (TSV) file containing metadata of your HT-SELEX samples using the following format: 4. Pre-process and quality-trim sequencing reads.
CRITICAL: Trim sequencing reads to the exact size of the library insert (in our case 20 bp). For more information, regarding library design, see materials and equipment section.
a. Execute flexbar for each individual sample using the following parameters: Note: Use a workflow manager such as Snakemake (https://snakemake.readthedocs.io) to automate this step for all samples.
5. Calculate k-mer frequency using the Python package eme_selex (tested on python version 3.10). a. Calculate the abundance of 5-mer motifs for all samples using the following Python code: Note: The choice of k-mers length (up to 10 bp, see limitations) depends on the DNA-binding protein of interest. In our case, we determined that SALL4 binds to short DNA motifs of 3-5 bp (Pantier et al., 2021). Note: To observe DNA binding of ZFC4 according to DNA base composition, we divided all 5-mer motifs into different categories depending on their proportion of A/T nucleotides (see Figure 7). Please refer to our bioinformatic workflow documentation (https://eme-selex. readthedocs.io) for source code.
CRITICAL: Always compare the enrichment of k-mers (DNA motifs) with the initial random library (cycle 0) and negative control (see Figure 7). These controls will confirm the specific enrichment of DNA motifs during the SELEX protocol.
Note: Here, we observed a progressive enrichment of a large number of AT-rich k-mers throughout the SELEX protocol (cycles 1/3/6), which confirmed promiscuous binding of Figure 7. Enrichment of all possible 5-mer DNA motifs during the HT-SELEX protocol for SALL4 ZFC4 (blue) compared to negative control (gray) DNA motifs (k-mers) were divided into six categories of increasing A/T content. Error bars indicate the technical variability with independent SELEX libraries.

OPEN ACCESS
STAR Protocols 3, 101490, September 16, 2022 SALL4 ZFC4. In the case of specific DNA binding, only few DNA motifs would have been enriched, with high similarity to the most abundant k-mer (Jolma et al., 2010).

LIMITATIONS
HT-SELEX relies on the detection of protein-DNA interactions in vitro. Alternative HT-SELEX protocols were developed to study binding to other substrates such as methylated DNA (Yin et al., 2017) and RNA (Jolma et al., 2020). However, this technique is not suitable for proteins binding indirectly to DNA, for example via interactions with histones or via protein-protein interactions with transcription factors.
It is often necessary to express small protein fragments (e.g., C2H2 zinc-fingers, Homeodomain) rather than full-length proteins. However, this strategy is not possible for proteins for which the DNA-binding domain has not yet been mapped.
Our Python package ''eme_selex'' is developed to analyze and quantify the abundance of k-mers up to 10 bp, which is sufficient for most transcription factors. Analyzing k-mers of length 11 bp or higher is computationally challenging for a personal computer, and is therefore not possible at this point using eme_selex.

TROUBLESHOOTING
Problem 1 How to determine optimal PCR conditions to amplify SELEX libraries.
Over-amplification or excessive amounts of DNA template will result in the formation of heteroduplexes (also known as ''bubble products'') due to annealing of mismatched sequences (Thompson et al., 2002;Kanagawa, 2003). These unwanted products containing secondary structures can be detected by gel electrophoresis, as they run higher than their expected size (see Figures 3 and 8).

Potential solution
To determine optimal PCR conditions to amplify SELEX libraries (see generation of cycle 0 libraries and SELEX protocol), two strategies can be adopted: Perform the same PCR multiple times with increasing amounts of DNA template and a fixed number of PCR cycles (see Figure 8).

OPEN ACCESS
Alternatively, perform the same PCR multiple times with a fixed amount of DNA template and increasing numbers of PCR cycles (see Figure 3).

Problem 2
How to determine the optimal sequencing depth for HT-SELEX analysis.

Potential solution
In our previous study (Pantier et al., 2021), we sequenced SALL4 ZFC4 HT-SELEX libraries with an average sequencing depth of 20,000 reads per sample (ArrayExpress: E-MTAB-9236). In order to determine the optimal sequencing depth for HT-SELEX analysis (see generation of HT-SELEX libraries for Illumina sequencing), we re-sequenced the same libraries with a very high coverage of z3,000,000 reads per sample (ArrayExpress: E-MTAB-11484). Using this new dataset, we simulated varying coverages by sub-sampling 500,000, 50,000 and 10,000 reads, respectively. For all conditions, we calculated the abundance of all k-mers (from 5 to 10 bp) using eme_selex, and compared their ranks with the highest coverage dataset (see Figure 9). We found a very high correlation (Spearman R 2 ) between samples at all sequencing depths for short DNA motifs (k-mers length 5-6 bp), corresponding to ZFC4 binding sites. These results indicate that accurate quantification of k-mer abundance can still be obtained at low sequencing coverage (z10,000 reads per sample). Higher sequencing coverage (at least 500,000 reads per sample) would be recommended to investigate promiscuous binding to long DNA motifs (k-mers length >7 bp).
Note: The number of DNA motifs increases exponentially when k increases from to 5 to 10 bp. Comparing the abundance of k-mers across varying sequencing depth is meaningful only for proteins binding promiscuously to a large number of DNA motifs. For more information regarding the overlap of top-ranking DNA motifs, please refer to our bioinformatic workflow documentation: https://eme-selex.readthedocs.io/en/latest/coverage.html.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Adrian Bird (a.bird@ed.ac.uk).