Mutation assay using single-molecule real-time (SMRTTM) sequencing technology

We present here a simple, phenotype-independent mutation assay using a PacBio RSII DNA sequencer employing single-molecule real-time (SMRT) sequencing technology. Salmonella typhimurium YG7108 was treated with the alkylating agent N-ethyl-N-nitrosourea (ENU) and grown though several generations to fix the induced mutations, the DNA was extracted and the mutations were analyzed by using the SMRT DNA sequencer. The ENU-induced base-substitution frequency was 15.4 per Megabase pair, which is highly consistent with our previous results based on colony isolation and next-generation sequencing. The induced mutation spectrum (95% G:C → A:T, 5% A:T → G:C) is also consistent with the known ENU signature. The base-substitution frequency of the control was calculated to be less than 0.12 per Megabase pair. A current limitation of the approach is the high frequency of artifactual insertion and deletion mutations it detects. Ultra-low frequency base-substitution mutations can be detected directly by using the SMRT DNA sequencer, and this technology provides a phenotype-independent mutation assay.


Introduction
Mutation assays capable of detecting somatic mutations at very low frequencies are important in the areas of environmental mutagenesis, carcinogenesis, epidemiology, and regulatory science. They are especially important in the context of safety evaluation of newly developed drugs or industrial chemicals. Although many mutation assays have been developed, most rely on some kind of phenotypic selection, which involves time-consuming procedures and is potentially biased. We previously reported a phenotypefree mutation assay using next-generation DNA sequencing [1]. In that study, we treated a Salmonella typhimurium strain with a mutagen to induced and fix mutations, followed by colony isolation and whole-genome sequencing of the colonies. The induced mutations were successfully detected in silico using bioinformatics software. That strategy is summarized in Fig. 1 and named the 'Colony-NGS method'. Although the approach is simple and reliable, difficulties still remain when it is applied to mammalian cells. This is because: 1) the colony-isolation step is much more technically challenging in the case of mammalian cells compared to bacterial cells, and 2) the mammalian genome is diploid and hundreds of times larger than the bacterial genome, which limits deep coverage in sequencing. Furthermore, the Colony-NGS method is not applicable to bio-monitoring of somatic mutations in tissues of experimental animals or clinical specimens from patients because it is impossible to do the colony isolation from those sources.
Recently, 'Duplex Sequencing' methodologies, which enable detecting a single mutation among >1 × 10 7 nucleotides by using a general next-generation DNA sequencing (NGS) technology, have been developed [2,3]. This is a very promising strategy for application to biomonitoring of somatic mutations. However, here in this paper we demonstrate an alternative approach by using single-molecule real-time sequencing.
The PacBio RS II DNA sequencer (Pacific Biosciences, Inc.) is a recent innovation [4] based on a singlemolecule real-time (SMRT) technology. Since it is able to read the sequence of a single DNA molecule, it can in principle detect the mutations present in the molecule just by sequencing it accurately, as summarized in Fig. 1 (named the 'SMRT method') [5]. A significant advantage of this strategy is that the colony isolation step is unnecessary, so that the approach should be applicable to any cell line and specimen from experimental animals, patients and environmental animals.
However, a drawback of this technology is the accuracy of the sequencing data it generates. At present, the error rate in raw reads of the PacBio sequencer is exceedingly high (~15%). To help overcome this problem, the 'SMRTbell TM template' , in which single-stranded DNA loops are ligated to both ends of a double-stranded DNA, is used to direct sequencing of the same DNA molecule repeatedly [6]. The greater the number of repeat reads so as to generate a consensus read of multiple sub-reads from the same single circular DNA templatei.e., a circular consensus sequence (CCS) readthe more accurate the sequencing result [7]. In this study, we validated that we can detect ultra-low frequency mutations by using the SMRT method with the CCS strategy.

Mutagen exposure and mutation fixation
The exposure method followed the Ames test 20-min pre-incubation procedure [9]. The YG7108 strain was cultured overnight at 37°C in nutrient broth (No.2, OXOID) containing 25 μg/mL kanamycin and 10 μg/mL chloramphenicol. Phosphate buffer (0.5 mL), DMSO or 2.5 mg/mL ENU (0.1 mL) and the overnight culture (0.1 mL) were mixed in a tube in that order and incubated for 20 min at 37°C with gentle shaking at 100 rpm. A 1-μL portion was added into 10 mL of LB medium and cultured at 37°C for 13 h to fix mutations, after which DNA was extracted. The rest of the mixture was poured onto a minimum agar plate in 2 mL of 0.6 % soft agar and incubated for two days at 37°C, following which the revertant colonies were counted.

Preparation of SMRTbell TM templates and sequencing
The genomic DNA samples (5 μg each) were sheared to 50-1000 bp (average 280 bp) fragments by using a Covaris Shearing Device, and used to construct a PacBio

Results
The test strain Salmonella typhimurium YG7108, which is highly sensitive to alkylating agents, was treated with ENU ( Fig. 2a) or its solvent DMSO, followed by dilution and growth overnight in LB medium to fix mutations. Genomic DNA was extracted from the overnight culture. SMRTbell templates were prepared from the DNA samples, with an average insertion size of 280 bp. Note that no PCR amplification step was carried out during preparation of the SMRTbell templates, which is essential to minimize the occurrence of artifactual mutations. The templates were subjected to the sequencing reaction Fig. 2 Detection of mutations, DNA damage and mismatches by mapping of raw reads of the SMRT sequencer. a Example of ENU induction of an alkylated base (O 6 -ethyl-guanine) in genomic DNA, which will induce a G to A mutation after the 2 nd round of replication. b Examples of mapped reads. In cases of a real mutation, the same base is clearly called in both the forward and reverse reads. In cases of DNA damage, one strand is mapped clearly but the other strand is not. In cases of mismatch, both the forward and reverse reads are mapped clearly but different bases are called between the forward and reverse reads in the PacBio RS II platform, and fastq files were generated from the raw data (contains all the sequence information of multiple sub-reads) and CCS data (contains only the consensus sequence). The threshold of the CCS was a pass time (the number of times the same molecule was repeatedly read) of 10 and 99% accuracy. The CCS-fastq files were imported to CLC Genomics Workbench software (ver.7). In total, 8.09 and 8.56 Mbp of the sequence data were obtained from the control and ENU-treated samples, respectively. The CCS reads were mapped to the reference sequence of Salmonella typhimurium and the point mutations were detected in silico. Improbably large numbers of insertions and deletions were called in both the control (405 insertions and 424 deletions) and ENU-treated (367 insertions and 1276 deletions) samples, respectively (Table 1). We had previously analyzed mutations induced in the same bacterial strain with the same exposure protocol by isolating colonies and carrying out whole-genome sequencing. In that previous study, we analyzed the entire genome of each of 4 clones (4.8 Mbp of Salmonella genome × 4 clones = 19.6 Mbp search region), but did not detect any insertions and deletions in either the control or ENUtreated samples (unpublished observations). Thus we concluded that the insertions and deletions called in this present study are not reliable and most probably artifacts. In the case of base substitutions, however, 19 and 160 mutations were called in the control and ENUtreated samples, respectively (Table 1). While these frequencies are consistent with the results of our previous study, they are still higher than the estimated values. Thus we decided to proceed with a confirmation step regarding the base substitutions.
Next, we obtained sequence IDs of the CCS reads in which the base substitutions were called at the first screening. Then we searched the sequence IDs in the raw fastq files and extracted the corresponding information of the sequence IDs, and made new fastq files which contained the raw repeated sequence data of the molecules in which the base substitution was possibly present. The newly edited fastq files were mapped to the same Salmonella reference sequence. Typical examples of mapped raw reads are shown in Fig. 2b. In the sequencing reaction using the SMRTbell template, the plus     and minus strands of a double-stranded DNA molecule are read alternately, thus almost equivalent numbers of forward and reverse reads were obtained. In cases of real mutations, the same base substitutions will be called in both the forward and reverse reads. In cases where different base substitutions were called between the forward and reverse reads, these must be templates bearing a mismatch. In cases where a specific base was clearly called for on one strand but a variety of bases was called for the opposite strand, this may indicate the existence of persistent DNA damage. After carefully checking the raw data, the base substitution mutations called in Table 1 were counted again and shown in Tables 2, 3 and 4. After recalculation, the numbers of 'real' base substitution mutations were 0 and 132 in the control and ENU-treated samples, respectively ( Table 4). The rest were likely due to mismatches, DNA damage, SNPs that the strain originally possessed, calls at the edges of the mapped read which did not have sufficient coverage, and so on.
We compared the mutation data by this method (SMRT method) with our previous result from colony isolation and whole-genome sequencing (Colony-NGS method). In the ENU-treated samples, the mutation frequencies estimated by the SMRT method (15.4/Mbp) and the Colony-NGS method (12.7/Mbp) were very similar and not significantly different by the binomial test (Fig. 3a). The mutation spectrum obtained by the SMRT method showed that 95% were G:C → A:T transitions and 5% were A:T → G:C transitions (Table 3 and   Fig. 3c). This mutation spectrum is well consistent with the ENU signature shown in a previous report [10] and our previous data obtained by the Colony-NGS method (unpublished observations). As for the control (DMSO treated) samples, no mutation was observed in both the SMRT and Colony-NGS methods, thus the mutation frequency was calculated as less than 0.12 per Mbp (1 mutation/8.09 Mbp) and less than 0.05 per Mbp (1 mutation/19.6 Mbp), respectively (Fig. 3a).

Discussions
In this paper, we successfully detected ultra-low frequency base substitution mutations by using a singlemolecule real-time sequencer with the SMRTbell strategy. In principle, this strategy is applicable to any DNA samples such as from bacteria, cell lines, tissues of experimental animals, specimens from patients, and enables us to quantify the mutation frequency and the mutation signature of such DNA samples. The significant merit for using SMRTbell strategy is that we can sequence each plus and minus strand of a double stranded DNA, thus we are able to distinguish 'real mutations' from 'mismatches' or 'DNA damages'. Intriguingly, we could detect not only fixed mutations but also mismatches in the Salmonella DNA. In this current procedure, a half of the total mismatches are expected to be detected. From our data, the occurrence of the mismatches in the Salmonella genome was roughly estimated as 8 -10. However, to quantify mismatches absolutely, a new bioinformatics tool should be developed. We also detected 4 possible 'DNA damages' only in the ENU-treated sample (Table 4). In Table 3, the raw read judged as 'Damage' seems to have lower coverage number than 'mutation' or 'mismatch'. This would reflect the presence of the DNA damages in the SMRTbell templates. Note that, the current procedure is not designed for detection of the DNA damages, thus the detected number would be far less than that of real DNA damages.
The background mutation frequency of the SMRT method in this study was less than 0.12 per Mbp which was comparable to the background level of 'Duplex Sequencing' methodologies [2,3]. The background level would depend on the threshold of pass time and accuracy of the CCS. The threshold values used in this study were the most strict values in the current version of Pac-Bio's instrument control and SMRT Analysis software. The real mutation frequency of the control sample was estimated by combining the Colony-NGS and Ames assay results. In the Ames assay using the same exposure procedure, the mutation frequency of the control sample was 1/685 of that of the ENU-treated sample (Fig. 3b), thus the mutation frequency of the control sample was estimated as 12.7/685 = 0.02 per Mbp. Therefore, more sequencing data (at least 50 Mbp) are required to detect mutations in the control sample.
As for insertion and deletion type mutations, this strategy cannot be used at present because of the very high background level of indels. The reason why more deletions were observed in the ENU-treated sample may be because remaining DNA damages influenced the sequence reaction. Ongoing improvements to the hardware and software of the SMRT sequencer and to the bioinformatics of mutation detection will likely overcome this problem in the near future.