Efficient acquisition of tens of thousands of short tandem repeats in single-cell whole-genome-amplified DNA

Summary Short tandem repeats (STRs) are highly abundant in the human genome, but existing approaches for accurate genotyping of STRs are limited. Here, we describe a protocol for duplex molecular inversion probes for high-throughput and cost-effective STR enrichment. We have successfully tested panels targeting as many as 50K STRs in several thousands of genomic samples (e.g., HeLa cells, Du145 cells, leukemia cells, melanoma cells). However, because the protocol is plate based, the sample size is limited to a few thousand. For complete details on the use and execution of this protocol, please refer to Tao et al. (2021).


SUMMARY
Short tandem repeats (STRs) are highly abundant in the human genome, but existing approaches for accurate genotyping of STRs are limited. Here, we describe a protocol for duplex molecular inversion probes for high-throughput and costeffective STR enrichment. We have successfully tested panels targeting as many as 50K STRs in several thousands of genomic samples (e.g., HeLa cells, Du145 cells, leukemia cells, melanoma cells). However, because the protocol is plate based, the sample size is limited to a few thousand. For complete details on the use and execution of this protocol, please refer to Tao et al. (2021).

BEFORE YOU BEGIN
The protocol below describes the specific steps for using whole genome amplified genomic DNA (RE-PLI-g Mini Kit, Qiagen) from Du145 single cells for the 12K OM6 STR panel presented in our Cell Reports Methods paper (Tao et al., 2021) (Custom Array). However, we have also used this protocol for primary cells such as melanoma, leukemia, T-cells, Macrophages, etc. and other whole genome amplification kits such as REPLI-g Single Cell Kit, Ampli1WGA kit, MALBAC single cell WGA kit etc.

Timing: [2 days]
Prepare the duplex molecular inversion probes for a 12K panel of selected human STRs, OM6, to enrich these targets from the single cell WGA DNA in the following steps. b. Keep the mix at 56 C on a heat block c. Transfer reaction plate from the PCR machine to a 56 C heat block when the hybridization step is finished. d. Add 10 mL of Gap Filling Mix to each well, carefully mix by pipette, seal tightly and quickly return plate to the PCR machine. e. Run a 4-h 56 C incubation, deactivate for 20 min at 68 C, then keep at 4 C until next step.
Pause point: After the gap filling step, the reaction plate can be stored at 4 C fridge for up to two days.

Digestion of linear DNA:
a. Prepare Digestion Mix 15 min before gap filling ends.
b. Retrieve reaction plate from PCR machine. Note: take care when removing cover. c. Add 2 mL of the Digestion Mix to each well and mix. d. Spin down the reaction plate and seal. e. Incubate at 37 C for 60 min, 80 C for 10 min and 95 C for 5 min.
Pause point: the reactions can be stored at À20 C for at least 2x months after the digestion step.
CRITICAL: Seal the plate tight, avoid evaporation.

Timing: [4 days]
Illumina sequencing adapters and unique barcode per cell are added by a barcoding PCR. Then all the samples are pooled into one tube in equal volume and then equal molecular concentration. The pools are size selected by Blue Pippin to remove dimmers and by products. library pools passed quality control are sequenced on MiSeq or NextSeq with default illumine sequencing primers.  (Figure 2 is a reuse of panel 1 in Supplementary Figure 1 from our Cell Reports Methods paper (Tao et al., 2021) and confirms a single peak around 300 bp. Troubleshooting 2 f. Dilute size-selected pool to make 12 mL of 4 nM (4 fmol/mL) library for Illumina NGS calculated based on the concentration and average size reported by the Tape Station.
7. Diagnostic sequencing ($17 h for sequencing, $2 h for analysis) Troubleshooting 5 a. Sequence at 10 pM loading concentration. We recommend to run on a 300 cycle MiSeq Nano flow cell in pair end mode. Set Read1 and Read2 as 151, and both Index1 and Index2 reads as 8. Minimum read length we have tested is 125 3 2 pair end to allow sequencing through the repeat regions of most STRs in our design. Default sequencing primers suffice for sequencing. c. Map merged reads against customized STR reference (as shown in Figure 3) of all amplicons with bowtie2, each appearing multiple times, once with every possible STR length.
d. For more details, parallel execution and integration to the clineage analysis system, please see the codes at: https://github.com/shapirolab/clineage/blob/master/sequencing/ analysis/full_msv/full_msv.py e. Extract the total number of reads per sample from ''sorted_assignment_bam'' with pysam.
8. Balancing reads per sample a. Calculate the scaling volume for each sample based on the total number of reads extracted from the diagnostic sequencing result to equalize the read coverage per sample. For example, sample A got 500 reads, sample B got 1000 reads in the diagnostic sequencing, to equalize the read coverage in the following production sequencing, we can pool 2 ul sample A with 1 ul sample B. b. According to the scaling volume, pool purified samples from step (5a) manually or by Echo550, then concentrate by miniElute, elute in 35mL ddH2O. c. Prepare production sequencing library for pooled samples as in step (6). 9. Production sequencing ($29 h for sequencing) The minimum reads per samples is 1M, and the minimum read length is 125 3 2 pair end. We recommend to sequence up to 200 samples on one NextSeq500 high output flow cell with 15132 pair-end run parameters according to manufactory manual and relying on default sequencing primers. Set both Index1 and Index2 as 8. Load at 1.8-2.2 pM concentration. (Figure 3) Optional: If the production sequencing doesn't generate enough reads for some samples (i.e over 1M reads for samples enriched with the OM6 panel), another round of NextSeq could be conducted using the same library for these samples. Consider Hiseq or NovaSeq platforms for large scale projects.

EXPECTED OUTCOMES
We expect to get and $150 bp precursors size and $110 bp probe size after digestion as shown in Figure1. The sequencing ready library size after size selection and purification should be $300 bp as detected by Tape Station and no/minimum primer dimmers 170-240, see Figure 2.

LIMITATIONS
Poor quality of whole genome amplified genomic DNA may prevent hybridization, gap fill, and full library preparation. The protocol is plate-based, so the sample size is limited to a few thousand.

TROUBLESHOOTING Problem 1
The sequencing library after size selection by Blue Pippin resulting DNA concentration is too low to load on Illumina sequencer. [Step 6d]

Potential solution
Increase the pooling volume per sample from 2 ul to 5 ul for the Blue Pippin loading pool. Use the same elution volume 40 ul to increase the original DNA amount loaded in Blue Pippin.

Problem 2
Primer dimers at 170-240 bp are still presenting in significant ratio to the desired library peak around 300 bp in diagnostic libraries detected by Tape Station after size selection by Blue Pippin. [Step 6e] Potential solution Check the quality of single cell WGA DNA by size and concentration, make sure to use good quality WGA DNA for the majority of samples.

Problem 3
Significant by product in large size more than 300 bp detected by Tape Station presented in probe production PCR. [Step 3 ]

Potential solution
Check the template concentration used in production PCR, make sure to dilute it to 1 ng/ul; reduce the production PCR cycles to 10 or 11.

Problem 4
Significant undigested probes $150 bp remains in the Tape Station quality control step. [Step 4]

Potential solution
Check the concentration of the input precursor again to make sure <30 ng/ul concentration used in digestion reaction; With the same digestion setting, digest the probes again, and purify by Mini Elute, run quality control by Tape Station. Problem 5 Low sequencing quality presented by the illumina sequencer, including low passing filter clusters, low Q30. [Step 7] Potential solution Consider the sequencing complexity in both the amplicon region and index region, especially when handling small panel (<100 targets) and small scale of samples (<20). Spike in 20% PhiX in such cases could help improve the overall sequencing quality.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact: Ehud Shapiro: ehud.shapiro@weizmann.ac.il

Materials availability
This study did not generate new unique reagents.

Data and code availability
The data supporting the current study are subject to the rules of regulations of the ethical committee of the Weizmann Institute of Sciences. Requests for data should be directed to the lead contact, Ehud Shapiro: ehud.shapiro@weizmann.ac.il For further details regarding the computational analysis, parallel execution, and the cell lineage system, please see: https://github.com/shapirolab/clineage