A new massively parallel nanoball sequencing platform for whole exome research

Background Whole exome sequencing (WES) has been widely used in human genetics research. BGISEQ-500 is a recently established next-generation sequencing platform. However, the performance of BGISEQ-500 on WES is not well studied. In this study, we evaluated the performance of BGISEQ-500 on WES by side-to-side comparison with Hiseq4000, on well-characterized human sample NA12878. Results BGISEQ demonstrated similarly high reproducibility as Hiseq for variation detection. Also, the SNVs from BGISEQ data is highly consistent with Hiseq results (concordance 96.5%~ 97%). Variation detection accuracy was subsequently evaluated with data from the genome in a bottle project as the benchmark. Both platforms showed similar sensitivity and precision in SNV detection. While in indel detection, BGISEQ showed slightly higher sensitivity and lower precision. The impact of sequence depth and read length on variation detection accuracy was further analyzed, and showed that variation detection sensitivity still increasing when the sequence depth is larger than 100x, and the impact of read length is minor when using 100x data. Conclusions This study suggested that BGISEQ-500 is a qualified sequencing platform for WES.


Background
The launch of the Roche 454 sequencer [1] opened the era of next-generation sequencing (NGS). Compared with the traditional Sanger sequencing technology [2], NGS has significantly larger throughput and lower per-base cost. Taking these advantages, researchers can analyze the information of all the genes in one project, rather than doing gene-by-gene studies. Researchers can obtain the information on the whole genome, proteincoding exons, or other specified regions by performing whole genome sequencing (WGS), whole exome sequencing (WES) [3] or target region sequencing (TRS), respectively. As an easy to interpret, known functional impacts, and relatively low-cost technology comparing with WGS, WES is widely used in human genetics research nowadays.
In the short history of NGS era, five major sequencing platforms have emerged: Roche 454, Illumina Hiseq series (GA, Hiseq, Miseq, X) [4], SOLiD [5], Complete Genomics [6], and Ion Torrent [7]. These platforms use different mechanics and have their specific advantages and disadvantages [8]. After years of technology evolution and competition, Hiseq becomes the most widely used sequencing platform. In 2015, BGI and Complete Genomics jointly announced a new next-generation sequencer, BGISEQ-500 [9]. However, its performance on WES has not yet been well evaluated by the scientific community.
We evaluated the performance of BGISEQ-500 on WES by parallel comparison with Hiseq 4000 on the well-characterized human sample NA12878. We compared the concordance of variation detected between the sequencing platforms, and their variation detection accuracy with the reference variation dataset from the genome in a bottle project (GIAB) [10]. We found that BGISEQ-500 has comparable reproducibility and competitive variation detection accuracy to Hiseq 4000.

Data production
The DNA of NA12878 was used in this study. Agilent SureSelect Kit v5 (50.4 Mb) was used for exome capture. The sequencing strategy was pair-end 150 bp for Hiseq4000 and pair-end 100 bp for BGISEQ-500. The DNA was sequenced to >100x on both Hiseq 4000 and BGISEQ-500 platform. (Methods) Each platform sequenced four replicable libraries, resulting in eight datasets in total. For comparison, each dataset was down-sampled to 100x. BGISEQ showed higher exome capture efficiency (faction of bases on target, BGISEQ 72% vs. Hiseq 58%, fraction of reads on target, BGISEQ 83% vs. Hiseq 71%), therefore it requires less sequencing data to reach the same sequencing depth. BGISEQ also showed slightly lower duplication rate than Hiseq (7.0% vs 7.6%). For each dataset, > 99.6% bases of the target region are covered by at least one read, and > 96% bases of the target region are covered with > = 20 reads, indicating that the whole target region is comprehensively and uniformly captured on both platforms (Table 1, Fig. 1).

Variation detection
The variation detection was processed under the guidelines from the Genome Analysis Toolkit (see Methods for details) [11,12]. Only the target region was used for variation detection. Roughly 41 thousand single nucleotide variants (SNVs) were detected from each dataset, including 19 thousand inside the protein coding region, and~9.4 thousand could lead to protein change ( Table 2). The BGI-SEQ datasets generated slightly fewer SNVs than Hiseq datasets. About 99.7% of detected SNVs could be found in dbSNP142 [13]. The transition/transversion ratio (Ti/Tv) on whole target region and on the exonic regions is 2.56 and 3.09, and the corresponding heterozygous to homozygous variation ratio (het/hom) is 1.64 and 1.52, respectively. (Table 2) These metrics from our datasets are comparable to other sources [10,14].
Roughly 3.5 thousand insertion/deletions (indels) were detected, approximately 470 out of which lie on coding regions. BGISEQ detected slightly more heterozygous indels than Hiseq, resulting in higher het/hom ratio (1.46 vs. 1.31). ( Table 2) Around 95% of indels have been previously reported in dbSNP142. The indels from these datasets also showed similar length distribution (Fig. 2).

Variation concordance
It has been noticed that the repetitive sequence in the genome could lead to ambiguity of short fragment alignment, which subsequently leads to false variation detection results. This could be a major cause of the SNV detection errors. [15] Using the genome mappability score [15],~2.3% of the target region was identified with alignment uncertainty. These regions were eliminated, and only the mappable regions were used hereafter.
The SNVs and indels from these four datasets were compared against each other separately, and Jaccard similarity was used to measure the concordance between datasets. It is showed that SNV results have 97.6% intra-platform concordance and 96.7% inter-platform concordance, and BGI-SEQ has slightly higher intra-platform concordance than Hiseq. (Figure 3) The high intra-platform concordance indicated qualified reproducibility of each platform. Moreover, BGISEQ has excellent inter-platform concordance with Hiseq, suggesting that BGISEQ could substitute Hiseq in many application fields where SNVs are the primary focus.
For indel, the intra-platform concordance is 82.3% for BGISEQ and 83.6% for Hiseq, and the inter-platform concordance is 81.7%. Indels with exact the same position and alternate alleles were considered as concordant,   regardless of their genotypes. This concordance level is lower than SNV's, as expected, because it is harder to detect indels from short reads using current methods [16]. It is possible that there could be different concordance metrics for indels, depending on how position and genotype concordance is defined. When only the location is considered, regardless of the genotypes, the intra-platform concordance increased to 87.7% for BGISEQ and 87.4% for Hiseq, and the inter-platform concordance increased to 85.1%. On the other hand, if we restrict concordance sites as exact genotype match, the intra-platform concordance decreased to 79.4% for BGISEQ and 81.3% for Hiseq, and the inter-platform concordance decreased to 76.5%. These results suggest that different datasets and platforms have better agreement on indel location, but diverge on zygosity and genotypes.

Variation accuracy
As the sequenced sample NA12878 had been well characterized by genome in the bottle project (GIAB) [10], the genotype result from GIAB was used as the reference to benchmark variants accuracy. Only the high confidence regions from the GIAB dataset was used for evaluation. Sensitivity and precision were used during the evaluation. Variation detected by both test data and GIAB were considered as true positive if they have exactly the same positions, regardless of whether they have the same genotypes. Overlapping indel positions were not considered as the same one unless they have the same start and end positions. Regarding SNV detection, 35,210 SNVs were found in the GIAB dataset, and the sensitivity and precision from datasets is 99.0 and 99.4%, respectively. Both platforms have excellent SNV detection accuracy. Furthermore, within 196 false positive and 356 false negative sites per sample, 126 (64%) and 204(57%) are concordant in all 8 samples, respectively, indicating errors in the reference set or systemic exome sequencing bias. Taking this into consideration, the actual accuracy from these platforms should be higher, and the difference between datasets on these loci is minimal.
For indel detection, 2650 indels are found in the GIAB dataset, and the sensitivity is 93.8 and 92.6%, and the precision 97.2 and 98.0% for BGISEQ and Hiseq, respectively. BGISEQ showed higher sensitivity but lower precision than Hiseq. Thirteen false positive and 42 false negative sites are concordant across samples, contributing to 22.9 and 23.6% total false positives and false negatives on average. To achieve better indel detection performance is still challenging for exome sequencing and GATK pipeline.

Impact of sequence depth and read length
To analyze the sequence depth effects to variation detection, the raw data were down-sampled to various sequence depths (20x, 30x, 50x, 70x, 100x, 150x). The SNV detection sensitivity increases with increased sequence depth, and it plateaus after the sequence depth exceeds 100x (Fig. 4). The increase in sensitivity may be due to better coverage of the target with increased sequence depth. On the other hand, the SNV detection precision stays constant while sequence depth increased, showing that the model has reached its limit when the depth is greater than 20x. For indel detection, the sensitivity increases while sequence depth increasing, as expected. But it does not reach a plateau even when the sequence depth is as high as 150x, showing that additional data is required for a better indel detection.
By truncating read length to 100 bp, four additional Hiseq pair-end 100 bp (PE100) datasets were generated from Hiseq datasets respectively, with the same sequence depth used in the above evaluation. Compared to Hiseq PE150 datasets, PE100 datasets showed similar precision and sensitivity on SNV detection. On indel detection, PE150 and PE100 have similar precision while PE100 showed slightly higher sensitivity ( Table 3). The result suggests that the read length has no significant Fig. 3 Concordance of variation detection. The Jaccard similarity for variation detection results from datasets was calculated for SNV (top-left triangle) and indel (bottom-right triangle) separately. SNV detection showed excellent intra-and inter-platform concordance, while indel detection showed inferior concordance. The inter-platform concordance is slightly lower than intra-platform concordance impact on variation detection accuracy at this sequencing depth level.

Discussion
The reproducibility of a method and its consistency with other available methods are essential for its application in academic and clinical scenarios. The aim of this study was to evaluate these characteristics of BGISEQ-500 in WES studies. Considering that WES is widely used in human genetics research and approaching clinical use, the validation of this newly established platform is crucial.
By comparison with Hiseq4000 data, this study showed BGISEQ could achieve comparable coverage to Hiseq for exome capture and sequencing procedure. For variation detection, both platforms have high and comparable reproducibility, although the reproducibility of indels is lower than of SNVs. Furthermore, the inter-platform concordance is commensurate with intra-platform concordance on SNV detection. This indicates that BGISEQ WES is capable of the applications which Hiseq WES SNV data has been tested and verified for.
For data usage, BGISEQ showed higher exome capture efficiency, and requires about 25% less data than Hiseq to reach the same sequencing depth. If the sequencing cost per gigabase is comparable between the two platforms, BGISEQ will have a lower cost per sample to obtain the same amount of effective data. Furthermore, smaller dataset also means less computational resources and runtime for bioinformatics analysis.
Unlike former Complete Genomics sequencers, the raw data generated from BGISEQ-500 is in fastq format, the de facto standard format for NGS data. As a result, the data is acceptable to most of the commonly used analysis software. This allows scientists to manipulate the data by themselves, to adapt the cutting edge analysis methods, and to compare results with other data easily. As an illustration, the commonly used bwa-GATK pipeline was applied to both Hiseq and BGISEQ data It is important to note that BGISEQ is still in its rapidly evolving phase and the amount of data generated from BGISEQ is limited. Therefore, the results showed in this study could be limited by its sample size and the current state of the instruments. The full evaluation and validation of this platform requires more data from research and clinical scenarios.

Conclusion
In this study, we evaluated the performance of BGISEQ-500 on WES and compared it with Hiseq4000 WES and GIAB high-confidence variation dataset. BGISEQ showed high reproducibility and concordance in intra-and inter-sequencing platforms. Both BGISEQ and Hiseq platforms demonstrated adequate variation detection accuracy on the benchmark region. These results suggest that BGISEQ-500 is a qualified sequencing platform for WES.

Data production
The DNA of NA12878 was acquired from Coriell Institute (Catalog ID NA12878). Agilent SureSelect Kit v5 was used for exome capture. The library construction and sequencing procedure on BGISEQ-500 were as described in BGI-SEQ sequencing section, with a~170 bp insert size and pair-end 100 bp sequencing strategy. The procedure on Hiseq4000 followed the manufacturer instructions with a 250~300 bp insert size and pair-end 150 bp sequencing strategy. The DNA was sequenced to >100x on both platforms. Each platform sequenced four replicable libraries. Each dataset was randomly down-sampled to 100x for comparison.
BGISEQ sequencing DNA preparation 1 μg DNA (Qubit quantified) was sheared by Covaris and double selected with Ampure XP beads to acquire Single strand DNA (ssDNA) circle construction 300 ng Post-PCR products were denatured at 95°C for 3 min (with heated lid at 105°C) and transferred to 4°C quickly to make a single strand DNA circle (ssDNA circle). After heat denaturation, the splint oligo binds to the adapters on both ends of a single Post-PCR product, guiding both ends of the single strand to adjacent positions. The following ligation reaction using T4 DNA ligase at 37°C for 30 min helps connect the adjacent bases on different ends of the single strand with a phosphodiester bond to complete the circularization. An enzyme digestion using Exo I and Exo III at 37°C, 30 min was implemented to eliminate uncirculated DNA. The libraries were purified with 168 μL of Ampure XP beads and quantified with Qubit BR ssDNA kit. The resulting ssDNA circle is the final library.

Make DNA nanoballs (DNBs)
DNBs were generated from the ssDNA circle using rolling circle amplification (RCA) to enhance the fluorescent signals in the sequencing process [6]. Primer mix bind to 6 ng ssDNA circles at 95°C for 1 min, 65°C for 1 min, and 40°C for 1 min. 40 μL Phi29 DNA polymerase and 4 μL SSB was added for RCA reaction at 30°C for 30 min, then quickly transferred to 4°C. The reaction was ended completely by 20 μl stop buffer, generating even-sized DNBs which will have similar fluorescent intensity in the sequencing process to ensure the signal chastity. Compared to PCR amplification, RCA has no PCR error accumulation and no PCR bias because the original ssDNA circle is the only template during the entire amplification process [6]. Unlike emulsion or bridge PCR, the rolling-circle amplification does not require precise titration of template concentrations in situ and circumvents stochastic inefficiencies.

Loading and sequencing
The DNBs were combined with 1/4 volume of DNB loading buffer and an appropriate amount of PBS buffer to a total volume of 140 μL, and placed on the loader machine. The DNBs were loaded onto the flow cell in which DNB binding sites are patterned nano-arrays. Sequencing data were generated with pair-end 100 bp sequencing strategy on the BGISEQ-500 platform.

Variation concordance analysis
Regions with genome mappability scores (GMS) [15] less than 1 were excluded from the evaluation. Jaccard similarity (number of sites where both datasets detected as SNV divided by the number of sites where at least one dataset is detected as SNV) was used to measure the concordance between datasets.

Variation accuracy estimation
The genotype result from GIAB was used as the reference to benchmark variation accuracy. Only the GIAB high confidence regions with GMS equal to 1 was used. During the evaluation, precision (true positive/(true positive + false positive)) instead of specificity (true negative/(true negative + false positive)) was used because true negative dominated the dataset. Variation loci detected in both test data and GIAB were considered as true positive.