Rapid DNA Re-Identification for Cell Line Authentication and Forensics

DNA re-identification is used for a broad range of applications, ranging from cell line authentication to crime scene sample identification. However, current re-identification schemes suffer from high latency. Here, we describe a rapid, inexpensive, and portable strategy to re-identify human DNA called MinION sketching. Using data from Oxford Nanopore Technologies’ sequencer, MinION sketching requires only 3min of sequencing and ∼91 random SNPs to identify a sample, enabling near real-time applications of DNA re-identification. This method capitalizes on the vastly growing availability of genomic reference data for individuals and cancer cell lines. Hands-on preparation of the samples can be reduced to <1 hour. This empowers the application of MinION sketching in research settings for routine cell line authentication or in forensics. Software is available at https://github.com/TeamErlich/personal-identification-pipeline


62
Here, we report a portable, rapid, robust and inexpensive strategy for SNP-based human DNA re-63 identification using a MinION sequencer (produced by Oxford Nanopore Technologies, ONT), a 64 cheap and portable DNA sequencer that weights only 100grams and can be plugged into a laptop 65 computer. This device can be adopted easily in a standard laboratory. Our strategy, termed 66 'MinION sketching', exploits real-time data generation by sequentially analyzing extremely low 67 coverage shotgun-sequencing data from a sample of interest and comparing observed variants to a 68 reference database of common SNPs (Figure 1). We specifically sought a strategy that does not 69 require PCR to eliminate the latency introduced by DNA amplification and to increase portability 70 and miniaturization. However, this poses two technical challenges. First, MinION sequencing 71 exhibits a high error rate of 5-15% (Ip et al., 2015), which is two orders of magnitude beyond the 72 expected differences between any two individuals. Second, MinION sketching produces shotgun-73 sequencing data that only covers a fraction of the human genome due to the limited capacity of a 74 MinION flow-cell. As such, the extremely low coverage dictates that each locus is covered by up 75 to one sequence read, which nullifies the ability to enhance the signal by integrating multiple reads or observing both alleles at heterozygous loci. Taken together, these challenges translate to 77 a noisy identification task where the available genotype data only provide a mere sketch of the 78 actual genomic data.

79
To address these challenges, we developed a Bayesian algorithm that computes a posterior 80 probability that the sketch matches an entry in the reference database (H exact ), or has no match to 81 the data data, taking into account each marker's allele frequency, and the prior probability that a 82 sample matches an entry in the reference database. The Bayesian approach sequentially updates 83 the posterior probability with every new marker that is observed until a match is found.

84
Collectively, our method can identify a sample, without PCR amplification, yet with very high 85 probability despite the low coverage and the high error rate of the MinION.

88
We sought to test our strategy using a large-scale reference database and in various technical 89 scenarios in order to benchmark our re-identification method for real-life scenarios. To this end, 90 we first constructed a large-scale reference database of genomic datasets to stress the specificity 91 of our method. This reference database comes from the DNA.Land project (Erlich, 2015) and 92 contains 31,000 genome-wide genotyping array files of individuals tested by Direct-to-Consumer 93 companies such as 23andMe, AncestryDNA, and FamilyTreeDNA ( Figure 2A)

95
These scenarios included either extracting the DNA from a spit kit or tissue culture, testing either 96 the R7 chemistry or the newer R9 chemistry, and re-identifying samples that were derived from 97 different ethnic backgrounds. The genetic reference file for each of these samples was included in 98 our database.

99
We found that the MinION sketching procedure re-identified human DNA with high accuracy 101 after minutes of operation. After only 13 minutes of sketching using the R7 chemistry, the tested various prior probabilities of identifying the YE001 sketch. We found that the initial 117 selection of the prior probability had no effect on the matching ability and only slightly increased 118 the time required to achieve a high-confidence match. Even with a prior probability that considers 119 a database around a million times bigger than the world's population (10 -15 ), the posterior 120 probability reached 99.9% with only 25 minutes of sketching YE001 (Figure supplement 2).

122
Moving to the new R9 chemistry provided even faster re-identification results. We sketched 123 samples of a Northern European female (SZ001) and a Northern-Italian-Ashkenazi male (JP001) 124 using the R9 chemistry. We were able to re-identify these two samples using only 98-134 SNPs  is <56% it is considered unrelated or contaminated (Reid et al., 2013). To this end, we re-150 analyzed the data from the THP1 experiment but without resolving the barcodes, essentially 151 reflecting 50% contamination. The algorithm correctly showed a 0% match probability to the THP1 reference file or any other cell line in the database ( Figure 3B). We further explored the 153 effect of the faction of contamination on matching the THP1 reference file. By sampling from the 154 above data in different proportions, we found that the algorithm correctly rejects a match for 155 contamination levels above 20% (Figure supplement 3). This shows the that the algorithm will

159
Lastly, we aimed to explore a sample preparation strategy that requires minimal hands-on time.

160
To this end, we utilized a simple protocol to extract DNA using the rapid transposase-mediated 161 fragmentation and adaptor ligation kit provided by ONT. This method generates 1D reads, where 162 only one of the two strands passes through the nanopore, resulting in reads with a higher error 163 rate (Table supplement 5). The advantage of this method is the speed and convenience of the 164 preparation protocol. In only 55 minutes, we were able to extract DNA and produce a ready-to-    nanopore-based sequencer that will be plugged into a cellphone (Yong, 2016). With this 230 invention, MinION sketching can eventually promote a range of futuristic Internet of (living)

231
Things applications that will use DNA as a means for biometric authentication.

233
MinION sketching provides a rapid method for cell authentication and sample re-identification.

234
We developed and implemented a Bayesian method that allows matching error-prone MinION

251
The posterior probability of the matching outcome for the -th sample is: where ! is the prior probability for the matching status of i-th sample and is specified by the 255 user.

256
The likelihood is approximated using the following equation: The likelihood of an exact match given the data of the -th marker, ( ! | ! = ), is given by the 258 following matrix: where the rows denote the genotype of the -th sample for the -th marker as observed in the

266
The likelihood of a mismatch given the data of the k-th marker, ( ! | ! = ), basically 267 corresponds to observing the allele ! in a random person from the population. This probability 268 is the sum of two processes: (i) the random person has the same allele as ! and the observation is errorless or (ii) the random person does not have the same allele as ! but a sequencing error 270 flipped the observed allele. Therefore:

273
where ( ! ) denotes the frequency of the observed allele in the population.

274
Finally, the evidence, is given by:

307
Availability of the data: The code for our method is available on 308 https://github.com/TeamErlich/personal-identification-pipeline. We also include a reference 309 database for the CCLE cell line repository for fast re-identification.

374
The prior probability for a match was set to 10 -5 . The rapid library protocol was tested in the lab. The

375
MinION sketch generated from sample SZ001. The library was prepared in 55 minutes in the laboratory.

502
We used the SQK-RAD001 kit to prepare the DNA library. FRM (2.5µl, ONT) was added to the DNA 503 sample (20µl) and incubated for 1 min at 30 o C. Then, 1µl RAD (ONT) plus 0.2µl ligase was added and the 504 mixture was incubated for 10 minutes.

506
The R9 flowcell was prepared by applying two times 500ul priming mix (RBF 1x). The library was then 507 added to the flowcell without a purification step.

509
The barcoding protocol was executed according to manufacturer's instructions for native barcoding kit I 510 (EXP-NBD002) in conjunction with Nanopore Sequencing kit (SQK-NSK007) with some modifications 511 (Table S1,

530
We used Poretools [ (Loman & Quinlan 2014)] to extract the FASTQ data and time stamps from the local 531 files. Only reads with an average base quality greater than 9 were used for the downstream analysis. Next,

532
we aligned the files to hg19 using bwa-mem (v0.7.14)(Li 2013) using the command "bwa mem -V -x minimize the effects of sequencing error, we considered only MinION read bases that matched the common 538 SNP alleles in dbSNP. For example, if at position chr1:10,000 the MinION reported "A" and dbSNP 539 reported a variant "C/G", then we treated this position as a sequencing error. The R7 chemistry run with 540 NA12890 generated 4920 variants after one hour of MinION sequencing, of which 7.7% were rejected after 541 filtering for common SNPs. Intersecting these with the reference file and analyzing the true error from the 542 matched SNPs resulted in 8.9% mismatches. This contrasts with the R9 chemistry, which only resulted in 543 2% true mismatches (Table S3-5).

545
The Bayesian model was integrated in a Python script, in order to match between the MinION data and 546 each entry in the database. To accelerate the search, we implemented the following procedure: (i) if the 547 posterior probability drops below 10 -9 , the script concludes that the database entry does not match and 548 moves to the next entry (ii) the script uses only up to one hour of data to determine the posterior of a 549 sample.

551
As a default setting, we used a prior probability of 10 -5 for exact matching. The only exception was Figure   552 supplement 2 (YE001), where we employed a range of prior probabilities. As a default setting, we used the 553 computed error rate from each read as the in our Bayesian.

557
For the simulations we took reads from exp. 4 and 5 ( Table S1). The total number of reads was set to 3000 558 and a random number of reads that represents the percentage proportion were selected. For example, for 559 50% contamination we took 1500 random reads from experiment 4 and 1500 random reads from 560 experiment 5. These were pooled together and again shuffled to simulate a mix. This process was repeated 561 five times for each contamination fraction. The resulting pooled file was processed using our pipeline and 562 matched to the reference file of the corresponding MinION sketch (either THP1, or JP001).