CASB: A concanavalin A-based sample barcoding strategy for single-cell sequencing

Sample multiplexing facilitates single cell sequencing by reducing costs, revealing subtle difference between similar samples, and identifying artifacts such as cell doublets. However, universal and cost-effective strategies are rather limited. Here, we reported a Concanavalin A-based Sample Barcoding strategy (CASB), which could be followed by both single-cell mRNA and ATAC (assay for transposase accessible chromatin) sequencing techniques. The method involves minimal sample processing, thereby preserving intact transcriptomic or epigenomic patterns. We demonstrated its high labeling efficiency, high accuracy in assigning cells/nuclei to samples regardless of cell type and genetic background, as well as high sensitivity in detecting doublets by two applications: 1) CASB followed by scRNA-seq to track the transcriptomic dynamics of a cancer cell line perturbed by multiple drugs, which revealed compound-specific heterogeneous response; 2) CASB together with both snATAC-seq and scRNA-seq to illustrate the IFN-γ-mediated dynamic changes on epigenome and transcriptome profile, which identified the transcription factor underlying heterogeneous IFN-γ response.


27
Single-cell mRNA sequencing (scRNA-seq) and single-nucleus assay for transposase-accessible 28 chromatin using sequencing (snATAC-seq) have emerged as powerful technologies for interrogating 29 the heterogeneous transcriptional profiles and chromatin landscapes of multicellular subjects 30 (Buenrostro,  endeavor, which introduce sample barcodes using either genetic or non-genetic mechanisms. 51 Genetically, researchers have used various strategies to express an exogenous gene with sample-52 specific barcodes at its 3' UTR, which can be captured similarly as endogenous genes (Hurley et al.,53 2020, Weinreb et al., 2020); non-genetically, people have used oligonucleotide containing a sample 54 barcode followed by a poly-A sequences, which can be immobilized on the cell or nuclear membrane 55 through anchoring molecules (e.g. antibody and lipid) ( (Srivatsan, McFaline-Figueroa et al., 2020), and then captured during 58 reverse transcription. Although being already quite powerful, each of these methods has still its own 59 liabilities, including issues with scalability, universality or the potential to introduce artefactual 60 perturbations. Moreover, all of these methods have only been combined with scRNA/snRNA-seq, and 61 have not yet been applied and are likely incompatible with snATAC-seq. 62 Here, we developed a Concanavalin A-based Sample Barcoding strategy (CASB) that overcomes many 63 of these limitations. Taking advantage of the glycoprotein-binding ability of concanavalin A (ConA), 64 CASB was used to label cell or nucleus with biotinylated single-strand DNA (ssDNA) through a 65 streptavidin bridge. CASB could be easily adapted into scRNA/snATAC-seq workflows, and showed 66 high accuracy in assigning cells or nuclei regardless of genetic background as well as in resolving cell 67 doublets. The application of CASB in samples with time-series experiments, followed by scRNA-and/or 68 snATAC-seq, allows revealing diverse transcriptome/epigenome dynamics. 69

CASB enables cell and nucleus labeling with ssDNA 71
The CASB consists of three components: biotinylated ConA, streptavidin and biotinylated ssDNA as 72 barcoding molecules. Both ConA and streptavidin form homo-tetramer autonomously, allowing the 73 assembly of ConA-streptavidin-ssDNA complex (Fig 1A). Relying on the glycoprotein-binding ability of 74 ConA, such assembled complex can be immobilized on the cell or nuclear membrane (Fig 1A). To 75 measure how many ssDNA molecules can be immobilized on the cell membrane, a biotinylated ssDNA 76 with 5' and 3' PCR handles flanking an eight-nucleotide (N8) random sequence was used to label the 77 cells ( Fig 1A). After incubation with different quantity of preassembled ConA-streptavidin-ssDNA 78 complex in PBS at 4 °C (Methods), the number of ssDNA molecules immobilized on mouse embryonic 79 stem cells (mESC) was quantified using qPCR. As shown in Figure 1B, the amount of ssDNA 80 immobilized on cells increased with the increased usage of ConA-streptavidin-ssDNA complex, and 81 could reach as many as 50,000 molecules per cell. To test whether ssDNA may fall off from labeled 82 cells and cause cross-contamination during sample pooling, a mouse embryonic fibroblast (MEF) cell 83 population expressing mCherry fluorescent proteins was labeled with the ssDNA and then mixed with 84 another MEF cell population expressing GFP fluorescent proteins, which was only coated with "empty" 85 ConA (Methods). After 30 min incubation in PBS at 4 °C, mCherry and GFP positive cells were 86 separated using FACS and subjected to qPCR measurement. As shown in Figure 1C, the ssDNA 87 immobilized on mCherry + cells was not detectable from GFP + cells, demonstrating the stability of CASB 88 labeling. In addition to labeling the whole cell, we also measured the labeling efficiency of CASB for cell 89 nucleus, in which nuclei were labeled with preassembled ConA-streptavidin-ssDNA complex in nuclear 90 extraction buffer at 4 °C (Methods). As shown in Figure EV1A, the amount of ssDNA immobilized on 91 nuclei increased with the increased usage of ConA-streptavidin-ssDNA complex, and reached at least 92 120,000 molecules per nucleus. Taken together, these results demonstrated that CASB is able to stably 93 label both cell and nucleus with ssDNA. 94 95 CASB enables scRNA-seq sample multiplexing 96 In scRNA-seq, cell specific barcodes are attached to the cDNA during reverse transcription (RT) by 97 using primers consisting of a cell barcode sequence, a unique molecular identifier (UMI) sequence and 98 a poly-T sequence that anchors to the poly-A tail of mRNA molecule. To make our CASB compatible 99 with the standard scRNA-seq workflow, we designed a biotinylated barcoding ssDNA with a 5' PCR 100 handle followed by a N8 barcode and a 30 nt poly-A tail, which can be captured by a RT primer consisting 101 of a PCR handle followed by a 30 nt poly-T tail (Fig 2A). After CASB labeling, MEF cells were directly 102 lysed and subjected to RT reaction (Methods). The barcoding ssDNA immobilized on cell membrane 103 was quantified together with the endogenous housekeeping gene ActB using qPCR. As shown in Figure  104 EV1B, both barcoding ssDNA and ActB gene can be efficiently capture by RT primer. Therefore, CASB 105 could be easily adapted into scRNA-seq workflow with high efficiency. 106

107
To demonstrate the strength of CASB in scRNA-seq, a breast cancer cell line MDA-MB-231 was 108 perturbed with 5 different compounds, collected at 3 different time points after treatment and pooled with 109 3 other breast cancer cell lines as well as MEF cells after separate sample labelling using CASB (Fig  110   2B). Unlabeled MDA-MB-231 cells were also added into the sample pool to measure the potential 111 influence of CASB on transcriptome profile. Sample pool was then subjected to scRNA-seq using the 112 10x Genomics Chromium system with minor modifications: 1) in order to examine the efficiency of CASB 113 to detect doublets, we intentionally overloaded the system (~20,000 instead of ~10,000 cells 114 recommended by the manufacturer) to create more cell doublets; 2) CASB barcode and transcriptome 115 library were separated by size selection before next-generation sequencing library construction, 116 enabling pooled sequencing at user-defined proportions (Methods). in the same droplet, as they contained two or three major barcodes ( Fig EV1C). Indeed, the doublets 124 consisting of both mouse and human cells, which could be unambiguously detected based on their 125 mapping results, could also be efficiently identified based on the mixture of CASB barcodes. As shown 126 in Figure 2C, out of 110 mouse-human doublets, 107 (97.3%) were defined as doublets based on our 127 CASB data. When compared with singlets, more UMI derived from both CASB barcode and mRNA 128 transcripts were detected in doublets (Fig EV1D), further validating the correct assignment of cell 129 doublets. Within 7623 singlets, the number of detected UMI from CASB per cell ranged from 245 to 2134 (5-95 percentile), and significantly correlated with UMI detected for endogenous transcripts in the 131 same cell (Fig EV1E), suggesting a similar cell-specific capture efficiency between CASB barcode and 132 endogenous transcripts, and that CASB did not impair mRNA capture. Based on the qPCR quantification, 133 the same amount of CASB mixture could label cells with ~20,000 ssDNA (Fig 1B), mean UMI (1051) 134 detected in the scRNA-seq indicates a ~5% capture efficiency at current sequencing depth (25 million 135 total sequencing reads). To determine the variation of labeling efficiency among different cells, given 136 the cell-specific capture efficiency, we first normalized the number of UMI numbers from CASB by that 137 of UMI derived from the endogenous transcripts in the same cell. As shown in Figure EV1F For the 7623 cells with unambiguously assigned sample origin, we then clustered them based on their 143 scRNA-seq profiles. As shown in Figure 2D, different human and mouse cells formed 5 distinct cell 144 clusters, respectively. Each cluster was composed of cells from individual cell line labeled with distinct 145 CASB indices (Fig 2D). We also compared the untreated MDA-MB-231 cells with to those without CASB 146 labelling. As shown in Figure EV1G  intratumoral heterogeneity has been associated with therapy resistance, we investigated whether drug 156 treatments could lead to heterogeneous response in the MDA-MB-231 cells. We focused on compound 157 OSI-027, as it induced the largest transcriptomic changes ( Fig EV1J). As shown in Figure 2E  the latter were less sensitive to the OSI-027. Neighbor proportion analysis also confirmed that, untreated 163 cells were well separated from treated cells in cluster 0, while it is not the case for cluster 1 and 2 (Fig  164   EV2B). As cellular heterogeneity existed already in the untreated MDA-MB-231 cells (Fig EV2C), we 165 sought to further check whether the insensitive cell populations were also resistant to other effective 166 compounds. Indeed, as shown in Figure 2F and EV2D, the same two cell clusters also appeared less 167 sensitive to Niraparib and Rucaparib, suggesting the intrinsic multidrug insensitivity. To explore the 168 underlying factors, we perform function enrichment analysis on genes that were commonly up-or 169 downregulated in cluster 1 and 2 compared to cluster 0 using IPA software (Methods). Interestingly, 170 these genes were highly enriched in the cellular compromise and movement pathways (Table EV1 & 171 Fig EV2E). Importantly, many of them (e.g. VIM) were upregulated in cluster 1 and 2, and known to 172 promote cellular movement ( Fig EV2F). In tumor cells, increased cell motility mediated by epithelial-173 mesenchymal transition (EMT) is also highly associated with drug resistance (Singh & Settleman, 2010, 174 Zhang & Weinberg, 2018). Our results suggested that the intrinsic multidrug insensitivity of MDA-MB-175 231 cells may result from the activated EMT. More interestingly, when overlap the potential insensitivity-176 causing genes in cluster 1 and 2 with OSI-27-regulated genes, we observe that many genes, including 177 VIM, SQSTM1, NPM1 and RACK1, were also upregulated by OSI-027 in sensitive cells ( Fig 2G). Given 178 these genes are involved in promoting EMT and potentially also drug resistance (Fig EV2F), this 179 observation indicated the potential of OSI-027 treatment in inducing acquired therapy resistance. 180 181 CASB enables snATAC-seq sample multiplexing 182 So far, no sample multiplexing method has been developed for droplet-based snATAC-seq. In droplet-183 based snATAC-seq, cell nuclei are firstly incubated with transposase in bulk, where genomic DNA is 184 fragmented and tagged with adapter sequences. Afterwards, single cells are encapsulated, and cell 185 barcodes are added to DNA fragments during PCR in individual droplets using primers targeting the 186 adapter sequences. To adapt CASB into snATAC-seq workflow, we designed a 222 nt barcoding ssDNA 187 with S5-ME and S7-ME adapter sequences flanking a sequence containing sample barcodes ( Fig 3A). 188 S5-ME and S7-ME adapter sequences were used as primer anchoring sites during snATAC-seq library 189 amplification (Methods). The labeling efficiency using such ssDNA was measured similarly as before, in 190 which nuclei were directly labeled with preassembled ConA-streptavidin-ssDNA complex in nuclear 191 extraction buffer at 4 °C (Methods). As shown in Figure  STAT peaks showed continuous activation ( Fig 3E&F). This is expected as IFN-γ is able to activate 212 JAK/STAT signaling through binding to its receptor, which in term activates the expression of IFN-γ-213 responsive genes, including transcription factors IRFs (Leonard & O'Shea, 1998). Interestingly, the large 214 variation of NF-kB peak intensity did not result from IFN-γ treatment, but was instead largely due to the 215 heterogeneity of HAP1 cells (Fig 3F). It is known that NF-kB can be activated by IFN-γ and able to 216 facilitate the transcription activation of IFN-γ targets, including CXCLs (Pfeffer,  To evaluate whether heterogeneous NF-kB activity causes heterogeneous IFN-γ response, IFN-γ 221 treated samples from the same time points were also analyzed using CASB followed by scRNA-seq. A 222 total of 3407 cells were captured, 294 of which were identified as cell doublets and 9 cells were 223 unlabeled ( Fig EV4A&B). As shown in Figure 4A, HAP1 cells showed a continuous shift in the 224 transcriptome profile from 0 to 12 h on UMAP projection. To globally evaluate the correlation between 225 chromatin accessibility and gene expression, we analyzed the dynamic expression patterns of predicted 226 IRF and STAT target genes. In consistent with the activity of IRF and STAT observed in snATAC-seq 227 (Fig 3C), the expression of their target genes also exhibited continuous upregulation (Fig 4B). heterogeneous NF-kB activity identified in snATAC-seq, the expression of predicted NF-kB target genes 233 was compared between the two cell populations. Consistent with the heterogeneous NF-kB activity, its 234 target genes also exhibit heterogeneous expression, and were expressed at a higher level in cluster 2 235 at later time points (Fig 4D). The high induction of CXCL10 and 11, well-known targets of IFN-γ, only in 236 cluster 2 cells further corroborate that the heterogeneous NF-kB activity indeed results in differential 237 responses to IFN-γ in HAP1 cells (Fig 4E&S4C). 238

239
CASB is a flexible sample barcoding approach, ready to be prepared in an average molecular biology 240 laboratory. CASB could be used to label cells and nuclei of different cell types and from different species. 241 Moreover, the binding of CASB molecules to the subject is fast and stable, and takes place even at low 242 temperature, which is critical to preserve sample integrity. Importantly, the design of CASB barcoding 243 ssDNA is extremely flexible, which can be easily adapted to different single cell sequencing workflows. were used to amplify 222 bp fragments from CROPseq-Guide-Puro plasmids (#86708, Addgene) which 314 have been inserted with different gRNA sequences. To generate ssDNAs, purified PCR products were 315 denatured at 95 °C for 2 mins and immediately put on ice. Information of CASB barcode sequences and 316 their corresponding samples can be found in Table EV2. 317

Assembly of CASB barcode 318
Biotinylated ConA (C2272, Sigma-Aldrich) and streptavidin (CS10471, Coolaber) were dissolved in 50% 319 glycerol at concentration of 1.6 µM and store in -20 °C, while different biotinylated ssDNAs were diluted 320 at concentration of 100 nM in nuclease-free water and stored in -20 °C. To assemble CASB barcode, 321 streptavidin was firstly mixed with biotinylated ssDNA at molar ratio of 4:1 and incubated for 10 min at 322 room temperature. Afterwards, biotinylated ConA was added to the streptavidin-ssDNA mix at molar 323 ratio of 1:1 and incubated for 10 min at room temperature. 324

Cell labeling with CASB 325
Half a million cells were collected and suspended in 0.5 mL PBS. Indicated amount of CASB barcode 326 was added to the cells and incubated for 10 mins on ice after thorough mixing. For labeling mCherry+ 327 MEF cells, 2.5 µL assembled CASB was used. For RT-qPCR and scRNA-seq, 5 µL assembled CASB 328 barcode was used. Afterwards, cells were washed once with PBS and then subjected to direct qPCR, 329 direct RT reaction or mixed for scRNA-seq. For qPCR quantification, three independent biological 330 replications were performed for each experiment. 331

Nuclei labeling with CASB 332
Cells were thawed by adding 800 µL warm culture medium and collected by centrifugation. Afterwards, 333 cells were resuspended in 0.5 mL nuclei extraction buffer (NUC101-1KT, Sigma-Aldrich), incubated for 334 5 min on ice, and collected by centrifugation (500 g). Extracted nuclei were incubated again with 0.5 mL 335 nuclei extraction buffer, in which indicated amount of CASB barcode was added, for another 5 min on 336 ice. For snATAC-seq, 2.5 µL assembled CASB barcode was used. Nuclei were then washed once with 337 nuclei wash buffer (10 mM Tris 7.4, 10 mM NaCl, 3mM MgCl 2 , 1% BSA, 0.1% Tween-20) and then 338 subjected to direct qPCR or mixed for snATAC-seq. For qPCR quantification, three independent 339 biological replications were performed for each experiment.  After the pre-processing, RNA UMI count matrices were prepared for scRNA-seq analysis using the 393 'Seurat' R package (Butler, Hoffman et al., 2018). Cells with no more than 4000 reads or 2000 expressed 394 genes were removed. Outlier cells with elevated mitochondrial gene expression were visually defined 395 and discarded. Ribosomal genes and mitochondrial genes were then filtered out. 396 'sctransform' R package (Hafemeister & Satija, 2019) was used to normalize the RNA UMI count data 397 and find highly variable genes. These variable genes were then used during principal component 398 analysis (PCA). Elbow plot was used to select the top principal components. Then these principal 399 components were used for dimensionality reduction with UMAP and unsupervised clustering with 400 Louvain method. Differential gene expression analysis was performed using the 'FindMarkers' function 401 in Seurat with 'MAST' method (Finak, McDavid et al., 2015). To quantify the magnitude of perturbation 402 induced by drug on gene expression, we compared the proportion of each cell's k (k=9) nearest neighbors in principal component space with the 'knn.covertree' R package. The proportion was 404 normalized by the cell numbers of different groups. 405

snATAC-seq data analysis 406
After filtering out the reads with sample barcodes, Cellranger-atac (version 1.2.0) was used to process 407 the raw FASTQ files. The reads were aligned to hg38 reference using BWA-MEM . 408 The filtered peak-by-cell matrix (h5 file) obtained after cellranger-atac processing was used in the 409 subsequent analysis. The matrix was first binarized. Cells of low quality (no more than 2000 peaks or 410 more than 500000 peaks, percent of reads in peaks <= 30%, percent of peaks in ENCODE black list >5%) 411 were filtered out. Only cells defined by HTODemux as 'singlet' were used for subsequent analysis. ATAC 412 peaks with low coverage (less than 50 cells) or ultra-high coverage (more than 2000 cells) were also 413 removed. The binarized count matrix was normalized by term frequency inverse document frequency 414 (TF-IDF). Latent semantic indexing analysis was performed as applying singular value decomposition 415 (SVD) on the normalized count matrix. Only the 2nd-50th dimensions after the SVD were passed to 416 UMAP for 2D visualization. Motif analysis was performed using chromVAR (Schep, Wu et al., 2017). 417 The predicted target genes of TF were defined by the nearest genes within 100 kb of the activated ATAC 418 peaks with the TF motif at 12h. The activated ATAC peaks were called by using FindMarkers function 419 with parameter test.use='LR'. The average expression levels of these predicted target genes were 420 calculated using 'AddModuleScore' function in Seurat package.   Table EV1