Single-Cell DNA Sequencing : From Analog to Digital

Single-cell DNA sequencing is emerging as a powerful and essential tool for interrogating heterogeneous tissues such as cancer samples. Its full potential in cancer research and diagnostics will only be realized when digital methodology is implemented and the cost is significantly reduced.


Introduction
Single-cell DNA sequencing refers to sequencing of genomic DNA isolated from single cells.This DNA source is in contrast to that used for bulk DNA sequencing, in which genomic DNA is isolated from hundreds to millions of cells.Bulk DNA sequencing includes Sanger sequencing and next-generation sequencing (NGS) methods and has been widely used in genomic studies on human diseases.Bulk DNA sequencing has been effective in studies of homogeneous systems, but has been proven inadequate in analysis of solid tumors, which contains cancer cells with various degrees in heterogeneity as well as noncancerous fibroblasts, endothelial cells, lymphocytes, and macrophages.Noncancerous cells can contribute more than 50% of the total DNA extracted from tumors, potentially masking important genetic aberrations (1) .Even when normal cells are removed, bulk sequencing still averages out both the heterogeneity of cancerous cells in a tumor tissue and genomic instability over time (2,3) .Single-cell DNA sequencing can reveal the dynamics of mutations of tumor cell subpopulations in unprecedented detail (3)(4)(5)(6)(7)(8)(9) .accurate and sensitive CNV detection should be an important aspect of single-cell DNA sequencing.
Another overlooked structural change is the ploidy change.In a cancer case study, Gray et al.  reported that the entire cancer genome of a patient was amplified uniformly to 4n without significant SNVs in cancer-related genes, resulting in an almost identical NGS result to that of normal 2n cells.Gray concluded that a NGS-based approach alone is insufficient for cancer diagnostics (12) .
Given that cancer mutations are comprised of a full range of mutations, and that any one or any combination of mutations can lead to cancer, an ideal single-cell DNA sequencing method should be capable of capturing mutations of all kinds.This seems to be one of the most demanding requirements for singlecell approaches.For example, in Gray's case, they split the sample, with one portion subject to NGS interrogation for SNVs and the other portion investigated by FISH to examine ploidy.However, these two methods cannot be carried out on the same single cell.

A lot of single-cell interrogations require preparation of single-cell suspension.
A recent review covers singulating, sorting, and enrichment of targeted populations (13) .
As the singulating process destroys spatial information of single cells in the tissue environment, ideally, each single cell in the suspension carries this information (14) , ( WO/2017/075293 ) , which assists in understanding and interpretation of the roles of various cells and their interactions in a 3D space.
Once a single-cell suspension is prepared, assays for single-cell RNA detection (15) 15 , single-cell protein detection (16) , single-cell DNA detection (17) , or co-detections (18) can be carried out to address the needs of investigators.This review deals with only one detection method, DNA sequencing.
Single-cell DNA sequencing requires the construction of DNA libraries from single cells.The quantity of genomic DNA in a regular single cell is about 6 pg, which is inadequate for a complete NGS library for full coverage.The single-cell genome requires amplification by many thousand-fold before sequencing.This amplification process must generate a sufficient quantity of DNA in an unbiased and uniform way with minimal dropout.Because amplification is the critical step of single-cell DNA library preparation, in most cases, the amplification method is used to represent the single-cell DNA library preparation method.DOP-PCR (19,20) , MDA (21)(22)(23)(24) , and MALBAC (6) are cases in the point.The same convention is followed in this article.Likewise, what are described as single-cell sequencing methods are actually library preparation methods.

Shallow sequencing libraries
In the past several years, several labs reported various single-cell library preparation methods (See Table 1).
One method that has gained favor is degenerate oligonucleotide-primed PCR (DOP-PCR) (19) .As the name implies, the primer in DOP-PCR is a random hexamer sequence flanked by a defined sequence at its 5′ end and by another defined sequence at the 3′ end.In the first exponential PCR cycles, annealing and extension are done at a low temperature to ensure random priming at low stringency, and then PCR amplification shifts to a higher annealing temperature to achieve greater stringency.DOP-PCR generates a pool of 200-1000 bp fragments that can be converted to a NGS library.By shallowly sequencing the single-cell libraries generated by DOP-PCR (20) , the Hicks lab demonstrated robust determination of genome-wide copy number profiles in single cells, which was in high concordance with results of bulk analysis.They also examined breast cancer cells from human tissues and discovered extensive heterogeneity.The DOP-PCR single-cell method has been an economic choice to study tumor clonality in terms of CNVs.The lack of SNV information hinders its wider application.
An alternative way to detect CNVs by shallow sequencing was reported by Zahn et al., who constructed Nextera ® libraries directly on the unamplified single-cell genomes (7) and sequenced the libraries without further amplification.The quantity of DNA from a single cell is not sufficient for high coverage, but 2%-7% coverage was sufficient for their intended detection of CNVs.They demonstrated the presence of minor subpopulations of cells in cancer tissues, which were missed by bulk sequencing.Interestingly, to compensate for the lack of comprehensive SNV information, the authors generated a "bulk-equivalent" SNV map by aggregating the shallow reads of single-cell genomes in silico.This SNV detection method cannot replace SNV detection at single-cell resolution.
The copy number call employed by both methods is based on the assumption that the number of reads mapping to a region is proportional to the number of times the region appears in the genome.A diploid region is expected to generate roughly twice as many reads as a haploid region, and two alleles are expected to generate roughly equal numbers of reads.
However, several unexplored factors may complicate the seemingly simple and straightforward copy number call.First, random primers may not be completely random, so some regions may be primed more frequently than other regions; second, random annealing is sometimes haphazard, making the method inconsistent; third, the defined sequence at the 3′ end is not uniformly distributed along the genome, and therefore some regions will be favored over other regions; fourth, similar to the priming issue in DOP-PCR, formation of clusters on the NGS platform may not be random, but skewed in a different way, which could add noise in read counts; fifth, the chance to generate fragments of different sizes may not be equal, while each fragment has an equal weight in Notes: 1 Whether or not the absolute number of copies is determined dictates whether the SNV call is digital or analog. 2 CNVs can be detected in the TnBC library based on abundance in shallow sequencing, or detected in absolute number based on UFI in high coverage. 3When the copy number of the mutant is determined in digital and absolute number, the mutation call is digital.Otherwise it is analog. 4The targeted region can be detected, but with unreliable results.copy number call; sixth, the chance to generate fragments of different GC content may not be equal in a given experiment, but the fragments have equal weights in copy number calls; seventh, it is not certain how fragment size or GC content will affect the efficiency of cluster formation in different types of instruments or sequencing chemistry.All these concerns dissipate when a digital sequencing method is used (see TnBC library below).

Deep analog sequencing libraries
Detecting SNVs has been the mainstay for many tumor studies and diagnostics.Two single cell preparation methods, multiple displacement amplification (MDA) (21)(22)(23)(24) and multiple annealing and looping-based amplification cycles (MALBAC) (6) are two frequently used methods to detect SNVs along with CNVs.
MDA uses a pool of degenerate nucleotides to initiate amplification by a highly processive Phi29 DNA polymerase.The process generates copious random genomic DNA (25) .The amplification proceeds so fast that a large quantity of amplified long fragments from DNA are generated in a few hours.An NGS library from each amplified single-cell genome is then constructed using Nextera (25) or by ligation (26) .Navin's lab at MD Anderson Cancer Center reported use of MDA libraries to study estrogen-receptor-positive (ER1) breast cancer and a triple-negative (TN) ductal carcinoma (8) .
They showed aneuploid rearrangements occurring early in tumor evolution and remaining stable as tumor masses clonally expanded.SNVs were evolving gradually and gaining extensive clonal diversity.TN tumor cells had an accelerated mutation rate when compared with the ER1 tumor cells.
Another method, MALBAC, requires use of a pool of primers, each primer having eight variable nucleotides connected to a common 27-nucleotide tail at its 5′ end.The first amplification cycles are quasilinear, followed by an exponential phase, when the common 27mer is used as the primer at a higher annealing temperature (6) 6 .Zong et al. used MALBAC to identify a total of 2.2 × 10 6 SNVs among all single cells by the MALBAC approach, comparable to the number (2.8 × 10 6 ) detected by the bulk method (6) .
Because cancer genomes often experience copy number changes and SNV detection is influenced by copy number changes, we decided to analyze how reliably the reported CNV changes are detected using published data.We found that the abundance of reads per bin from the same assumed copy number regions in chromosomes in MALBAC and DOP-PCR data sets follows normal distributions, and their data are just slightly noisier than the bulk data set, the majority of the bins falling between 50% and 150% of the mean.In contrast, the read abundance per bin from a single-cell MDA library clearly did not follow normal distribution.
The deviations were astonishingly large.A significant fraction of bins recovered 10 times more reads than the majority of bins (27) .Unfortunately, neither the authors nor manufacturers of the commercial kits providing DOP-PCR, MDA, or MALBAC suggested any statistic tools to address the issue.It is desirable to have a confidence score, similar to Phred scores used in sequencing, attached to every CNV call and associated SNV determinations.

Targeted analog sequencing libraries
Pellegrino et al. reported a single-cell targeted analogy sequencing library construction method based on droplets.The workflow consists of two steps: first, individual cells are encapsulated in droplets and lysed, then the droplets containing the genomes of individual cells are paired with molecular barcodes and PCR amplification reagents with a panel of primers targeting genes for amplification (28) .
By this single-cell approach, the authors identified TP53 as the more plausible founding, mutation, while bulk sequencing mistakenly inferred DNMT3A as the founding clone.The biggest advantage of this method is that many cells can be interrogated at low sequencing costs.However, as no CNV information is yielded, this method will miss a significant portion of oncogenic mutations (28) .

Digital sequencing libraries
Last year, we and two other labs reported constructing single-cell DNA libraries by using transposition directly on single-cell genomes (7,27,29) .However, similarity stops there.We took a further step by turning the methodology into a digital method, making it able to detect absolute ploidy, the absolute copy number of any gene, and point mutations at the single-cell level.
Transposon barcoded (TnBC) library prep involves six steps: (1) Isolation of single cells in a reaction vessel.The reaction vessel can be a PCR tube or a chamber in a microfluidic device, such as the Fluidigm® C1™ integrated fluidic circuit (IFC); (2) Lysing cells and removing histones from chromosome DNA.This was accomplished using proteinase K digestion.After digestion, proteinase K needs to be denatured, and the residue activity is further minimized by adding proteinase K inhibitor so that the transposase activity used in the next step will not be compromised; (3) Tagmentation of the single-cell genomic DNA with a saturating concentration of loaded transposase to make a primary library.The loaded transposases can be either Thermo Scientific™ MuSeek™ or Illumina ® Nextera ® .These loaded transposases are bioengineered to make NGS libraries.(4) Removal of transposase from tagmented DNA; (5) Amplification of primary library; (6) Addition of a sample barcode for each single-cell library.points of each fragment is used as a unique fragment index (UFI), which is represented as two numbers in parentheses in Figure 1.Although these two numbers were not known before sequencing, they are embedded.As demonstrated by the figure, each haploid generates a unique array of fragments, and they can be represented mathematically.It is clear that the absolute number of gene copies in a cell can be back-calculated by determining the number of unique arrays of fragments.This is why a TnBC library can be called a digital library.
The preparation process for a primary TnBC library is very similar to the regular NGS library prep method practiced by thousands of labs around the world (26) , except that a TnBC library is made from a single-cell genome.Because the DNA comes from a defined single cell, the resulting primary TnBC library carries the digital property.If the starting DNA comes from exactly 100 normal cells, saturated tagmentation would lead to 200 distinct arrays of fragments for each haploid on paper.Since recovering all 200 arrays of fragments by sequencing is impractical, the resulting library is practically non-digital.In common practice, DNA samples used for regular tagmentation libraries are not from a finite number of cells, so the libraries can hardly retain the digital features.
One of the great potentials of a TnBC library is that a sub-library of targeted regions can be enriched using pull-down probes (30) from an amplified TnBC library.Because sub-libraries retain the digital features, high-quality data of the targeted area can be obtained, paving the way to reduce the cost of singlecell analysis by magnitudes.
The ability to detect changes in absolute copy number and ploidy in the early stages of tumorigenesis will be beneficial because these changes could be harbingers of important states in cancer development (31, 32) .This will overcome the drawback of NGS noted by Gray (12) .Liu's lab reported that polyploid giant cancer cells (PGCCs) are quite universally observed in cancer biopsies, especially from late stage tissues (33- 35) .Given that PGCCs exhibit high drug resistance and extremely high tumor-forming tendencies in mouse models, single-cell DNA digital sequencing will give researchers a tool for molecular characterizations that no other method alone can offer.

Bioinformatics on single-cell digital sequencing libraries
Most existing bioinformatics tools have been developed for bulk sequencing, which is based on sampling from thousands to millions of identical copies.In addition, most of the tools are based on the hidden assumption that each cell has two copies of each gene (36) .As such, these tools can be good for cells that are homogeneous and diploid, but are of limited use for heterogeneous samples like cancer cells (1) .Mutations in genes of elevated copy numbers in subclones in a tumor are usually oversampled, and current bioinformatic tools do not differentiate the CNVs of all cells from CNVs that happen in a percentage of cells.Point mutation calls by most current bioinformatics tools do not consider copy number changes of the target genes.These shortcomings can explain why bulk sequencing may misinterpret the clonal structure of cancer and misidentify driver genes (28) .
For most single-cell DNA sequencing methods, DNA from single cells is amplified to a "bulk" level, and then libraries constructed from the amplified DNA is sequenced.Bioinformatics tools to analyze amplified single-cell genomes inherited similar issues because the assumptions do not match the realities of the genomic makeup of cancer cells.
Mutation calls on bulk libraries remain a difficult issue bioinformatically (2) .
In contrast, mutation calls on a TnBC library will be easier, more accurate, more precise, and quantitative, as each fragment has a UFI and the copy number of any region is an absolute digital.This benefit is not shared by other single-cell approaches.
Since amplification of DNA inevitably introduces errors, the detected variations in single-cell sequencing consist of biological mutations of cancer tissues and amplification errors.While no current bioinformatic tools were designed to tell the difference, UFI can help to differentiate the two.
For every two neighboring fragments from one haploid, the 3′ part of the UFI of the upstream fragment shares with the 5′ part of the UFI of the downstream fragment.Two neighboring fragments can be haplotyped.Since a cancerous cell may carry no copy or multiple copies of a gene, single-cell digital sequencing will give a full picture of all haploids within a functioning unit, information that is lost in other approaches (21,37,38) .

Promises and challenges
Single-cell digital sequencing bears promises and elicits challenges.By using TnBC libraries, genomes of cancer tissue will be delineated in unprecedented detail.Even for homogeneous normal tissues, we expect that new findings may challenge old theories.We may find that the homogeneous cells are not as homogeneous as we believed to be (39) .The fluidity of mammalian genome structure will be revealed with new details.
Although single-cell DNA sequencing has yielded more insightful information on cancer tissues than bulk sequencing, the cost associated with the single-cell approach is a major obstacle to wider application.Even as the sequencing cost of one genome may drop to $100, the cost for hundreds of single-cell genomes would be tens of thousands of dollars.By using pull down probes to extract targeted regions of TnBC libraries, the sequencing cost for the majority of useful and actionable information may drop by magnitudes, making the single-cell approach affordable in routine cancer diagnostics.
Conducting single-cell digital sequencing also brings challenges.
For example, TnBC library construction will require molecular biological tools such as transposons, primers, proteases, and polymerases to meet very stringent requirements.All reagents should be nuclease-free, since any nuclease contamination will destroy tiny, unique genetic material from cells.
Secondary and tertiary bioinformatic tools will be needed to handle the simple but unique data structure of TnBC libraries.Developing ways to combine single-cell analysis with pathological observation and clinical practice will require personnel armed with new knowledge, as well as new infrastructure.
Single-cell DNA sequencing will provide a powerful tool for diagnosing cancer and for guiding treatment.A sample obtained by biopsy provides genetic data on only one point in space and time.As surgery or therapy starts, ctDNA detection using NGS or digital PCR based on the mutation spectrum obtained from single-cell DNA sequencing will be a good complementary tool to monitor treatment progress and extend the use of mutation data obtained by single-cell DNA sequencing.One can foresee that single-cell DNA sequencing will be at center stage in the fight against cancers.I am optimistic that we will see wider and wider applications of single-cell digital sequencing in cancer research and diagnostics.The method will have numerous translational applications in the clinic, particularly in the area of molecular pathology and guidance of therapies.

Figure 1
Figure 1 explains how digitization was achieved during tagmentation to generate a primary TnBC library.Tagmentation starts with single-cell genomic DNA.A normal cell typically has two identical haploids, Haploid 1 and Haploid 2, with their coordination represented by the reference genome shown by the gray line under the box.A cancer cell may have an extra copy of the gene, which is represented by the dark line marked "Extra".Tagmentation would break Haploid 1 into an array of contiguous fragments, … H1i, H1j, H1k, H1l, H1m …, Haploid 2 into another array of contiguous fragments, … H2i, H2j, H2k, H2l …, and the extra copy into the third array with … Exi, Exj, Exk … .Because the tagmentation is a random process, even if the Haploid 1, Haploid 2 and Extra copy have the exact same sequence, they generate different sets of fragments.Each fragment is amplified, sequenced, and mapped to the reference genome.The combined start and end

Figure 1 .
Figure 1.Generating primary TnBC library with UFI for each fragment by saturated transposition of singlecell genome.All numbers are fictional."xxx" is a number that is not defined in the figure.