Frequent sgRNA-barcode recombination in single-cell perturbation assays

Simultaneously detecting CRISPR-based perturbations and induced transcriptional changes in the same cell is a powerful approach to unraveling genome function. Several lentiviral approaches have been developed, some of which rely on the detection of distally located genetic barcodes as an indirect proxy of sgRNA identity. Since barcodes are often several kilobases from their corresponding sgRNAs, viral recombination-mediated swapping of barcodes and sgRNAs is feasible. Using a self-circularization-based sgRNA-barcode library preparation protocol, we estimate the recombination rate to be ~50% and we trace this phenomenon to the pooled viral packaging step. Recombination is random, and decreases the signal-to-noise ratio of the assay. Our results suggest that alternative approaches can increase the throughput and sensitivity of single-cell perturbation assays.


Introduction
Recently, single-cell RNA sequencing (scRNA-seq) has been coupled with CRISPR-mediated perturbations, allowing functional assessment of genes (Perturb-seq, CRISP-seq, CROP-seq) [1][2][3] and enhancers (Mosaic-seq) [4] with a transcriptomic readout. All of these techniques deliver CRISPR components to cells through a lentiviral system, and each one has devised a unique strategy to detect sgRNAs through scRNA-Seq. Since the scRNA-seq strategies used are 3'-biased, most of these approaches insert a molecular barcode immediately before the poly (A) signal as an indirect proxy of sgRNA expression in each cell (Fig 1). Therefore, the accuracy and sensitivity of these approaches rely on pre-identification of sgRNA-barcode relationships and unambiguous recovery of barcode information in every cell assayed.
However, barcoding could introduce noise due to lentiviral recombination. Two viral genomes are packaged into each lentiviral / retroviral particle [5], and are non-covalently linked [6]. During viral genome replication, the reverse transcriptase can switch from one template to another when it synthesizes a DNA provirus from a dimeric RNA genome, and this process happens most frequently at homologous regions [7][8][9]. The frequency of recombination depends on the distance between the two regions, which has been estimated to be 2% every kilobase [7,10]. Thus, when libraries of distinct sgRNA-barcode viruses are packaged together in single-cell perturbation assays, template switching could lead to barcode a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 recombination that randomly shuffles sgRNA/barcode linkages. This event would interfere with the accurate detection of sgRNAs. A similar concern has also been raised recently on lentivirus-based genetic screening technologies [11].

Cell lines and culture
K562 cells were cultured in IMDM Medium plus 10% FBS and pen/strep at 37˚C and 5% CO2. HEK293T cells were cultured in DMEM with 10% FBS and Pen/Strep. Both cells were acquired from ATCC (CCL-243 and CRL-3216).

Plasmids
The lenti-sgRNA(MS2)-puro plasmid (Addgene ID: 73795) was used for sgRNA expression. The 12-bp barcode region flanked by a BsrGI and an EcoRI cutsite was inserted into this plasmid by using overlap PCR and Gibson assembly. Specifically, a 108 bp oligo with 12 bp random oligo sequence was synthesized and amplified by PCR yielding double-stranded DNA. This fragment was then inserted into the linearized plasmid (cut with BsrGI and EcoRI) by Gibson assembly. After transformation, single clones were selected, and the barcode sequence of each clone was confirmed by Sanger sequencing. The insertion of sgRNAs was performed using BsmBI and T7 ligase, following the Golden Gate assembly protocol from the laboratory of Feng Zhang [12]. To minimize bacterial recombination, all the plasmids were transformed with Stellar Competent Cells (Clontech), and grown at 30˚C.  American Heart Association fellow (heart.org, 16POST29910007). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests:
The authors have declared that no competing interests exist.

Virus packaging, titration and infection
For virus packaging, 293T cells were seeded in a 6-cm dish (3X10 6 cells) one day before transfection. The indicated viral plasmid(s) were co-transfected with lentiviral packaging plasmids pMD2.G and psPAX2 (Addgene ID 12259 and 12260) with 4:2:3 ratio by using linear polyethylenimine (PEI). Twelve hours after transfection, media was changed to fresh DMEM with 10% FBS plus Pen/Strep. Seventy-two hours after transfection, virus-containing media was collected, passed through a 45 μm filter, and aliquoted into 1.5ml tubes. Viruses were stored in -80C before infection or titration. Virus were then titrated and used for infection based on the methods described previously [4]. For infection of K562 cells, 2X10 5 cells (in 500μl medium, with 8ng/μl polybrene) were used. After mixing with the indicated amount of virus stock, the cells were centrifuged at 1000g for 1 hour at 37C and then returned to the incubator. The media was changed with fresh media containing 1μg/μl puromycin in the following day. The cells were selected for 7 days with media refreshed every two days and then collected for genomic DNA extraction and downstream library preparation.

Construction of sequencing libraries
Library construction was performed as previously described [4], with some modifications. Briefly, a 3kb amplicon flanked by the sgRNA and barcode sequences was amplified from plasmids or genomic DNA extracts. Then the fragment was self-circularized, and a second round of PCR was performed to yield a 400bp fragment with sgRNA and barcode adjacent to each other. The detailed protocol is available through potocols.io (dx.doi.org/10.17504/protocols.io. pufdntn).

Analysis
The Illumina NextSeq500 bcl files were de-multiplexed by using bcl2fastq (Illumina). Then the fastq files of two reads were combined and the reads with any base under quality score 10 were discarded. Then the sgRNA sequences and barcode sequences were extracted and compared with the known list, allowing 2 base-pair mismatches. The total reads per barcode-sgRNA pair were summarized and used to plot the figures.

Results
To systematically measure the noise introduced by viral recombination during Mosaic-seq, we individually cloned 20 unique sgRNAs into backbones with known barcode sequences (Fig 2A). We then monitored how pooling the samples at the transformation, viral packaging, or viral infection steps affected sgRNA-barcode recombination. To directly measure sgRNA-barcode pairs in each sample, we constructed deep sequencing libraries on plasmid pools and genomic DNA extracts. However, this problem is complicated by the large distance (~3-kb in Mosaic-seq) separating each sgRNA to its barcode. Our strategy involves PCR amplification of~3kb sgRNA-barcode amplicons followed by a self-circularization step, which reduces the sgRNA/barcode distance to a sequenceable distance of~400-bp ( Fig 2B).
Since self-circularization is mediated by ligation, noise could be introduced by this method to assess recombination rates. To quantify this noise, we first examined Plasmid Library 2 (PL2), in which every sgRNA-barcode plasmid was constructed, transformed and extracted separately. We observe that 74.0% of reads (median) for each barcode is correctly linked to its known sgRNA pair, while the remaining 26.0% of reads are randomly linked to other sgRNAs (Fig 3A). This random collision rate correlates with the total abundance of each sgRNA in the library. As PL2 plasmids were independently processed, sgRNA-barcode recombination should be negligible. Therefore, 26.0% noise we observed is likely derived from our ligation-mediated method for detecting recombination.
Then, we examined recombination after pooled bacterial transformation (PL1). We also observed a median of 79.1% of reads exhibited correct sgRNA-barcode linkages, suggesting that pooled transformation does not significantly contribute to recombination in a library of sgRNA-barcode plasmids.
Next, we examined sgRNA-barcode pairs after viral integration into the human genome. At the four stages of Golden Gate ligation, transformation, viral packaging, and infection, samples were pooled, and sgRNA-barcode sequencing libraries were constructed on genomic DNA (Fig 2A). Two genomic DNA libraries, in which sgRNA-barcode lentiviruses were individually packaged (GL3-4), maintain the correct sgRNA-barcode linkages (median of 83.6% and 84.4%, respectively) (Fig 3A), which is comparable to the plasmid libraries PL1-2.
In contrast, genomic DNA libraries in which plasmid libraries were pooled prior to viral packaging (GL1 and GL2), exhibited significant sgRNA-barcode recombination. The most abundant sgRNA of each barcode occupies less than half of the reads (median of 42.2% and 41.3%, respectively), which is greater than a 50% loss compared to GL3-4. Recombination is random, and none of the incorrect sgRNA-barcode pairs are dominant over the expected pairs ( Fig 3B). These results suggest that, using a strategy in which sgRNAs are separated from barcodes by several kilobases, recombination will be frequent if plasmid libraries are pooled prior to viral packaging.  To further test whether recombination depends on viral titer, we infected cells at high and low multiplicity of infection (MOI = 2 and MOI = 0.2). Based on Poisson statistics, >90% of antibiotic-selected cells are expected to be infected by exactly one virus at MOI = 0.2, which we hypothesize could reduce observed recombination rates compared to cells infected at high MOI. However, we observed no significant difference in recombination between the high and low MOI samples (Fig 3A), suggesting that the observed sgRNA-barcode shuffling is not due to recombination between multiple viruses infecting a single cell.

Discussion and conclusions
Here we used a self-ligation-based method to assess the recombination between sgRNAs and barcodes during Mosaic-seq. While our method has a relatively high baseline level of noise  (~20%), our data confirms sgRNA-barcode recombination during pooled preparation of Mosaic-seq libraries. Recombination is random and accounts for~50% of reads. While this noise is unlikely to create false positive hits, it does reduce the overall signal-to-noise of the assay, which we expect will decrease sensitivity. We postulate that similar recombination events will exist in other methods that rely on lentiviral/retroviral delivery systems that are coupled to indirect detection of DNA-based barcodes.
We observed that recombination only occurs when libraries are pooled before the virus production step, independent of the viral titer used during infection. This suggests that recombination predominantly occurs between two viral genomes packaged into the same virion, but not between distinct virus infecting the same cell. Thus, at low throughput, this problem can be overcome by constructing and packaging each virus separately. However, for large scale library preparation, the CROP-seq sgRNA plasmid is an improved solution [3]. In CROP-seq, the sgRNA cassette is inserted into the 3'LTR of the virus, which becomes part of the puromycin-resistance mRNA transcribed by EF1α promoter. Therefore, the sgRNA can be directly detected by scRNA-seq without the use of indirect barcodes. Moreover, CROP-seq dramatically simplifies the construction of large-scale sgRNA libraries since barcodes do not need to be constructed. By reducing sgRNA/barcode recombination, the sensitivity of single-cell perturbation assays could increase substantially. During preparation of this manuscript, similar observations have also been independently reported [13][14][15]. We believe that these improvements will significantly expand the application of single-cell perturbation assays, enabling the construction of large-scale libraries to systematically perturb and unravel transcriptional regulation from systems perspective.