Shotgun metagenome data of a defined mock community using Oxford Nanopore, PacBio and Illumina technologies

Sevim, Volkan; Lee, Juna; Egan, Robert; Clum, Alicia; Hundley, Hope; Lee, Janey; Everroad, R. Craig; Detweiler, Angela M.; Bebout, Brad M.; Pett-Ridge, Jennifer; Göker, Markus; Murray, Alison E.; Lindemann, Stephen R.; Klenk, Hans-Peter; O’Malley, Ronan; Zane, Matthew; Cheng, Jan-Fang; Copeland, Alex; Daum, Christopher; Singer, Esther; Woyke, Tanja

doi:10.1038/s41597-019-0287-z

Download PDF

Data Descriptor
Open access
Published: 26 November 2019

Shotgun metagenome data of a defined mock community using Oxford Nanopore, PacBio and Illumina technologies

Scientific Data volume 6, Article number: 285 (2019) Cite this article

20k Accesses
62 Citations
45 Altmetric
Metrics details

Subjects

Abstract

Metagenomic sequence data from defined mock communities is crucial for the assessment of sequencing platform performance and downstream analyses, including assembly, binning and taxonomic assignment. We report a comparison of shotgun metagenome sequencing and assembly metrics of a defined microbial mock community using the Oxford Nanopore Technologies (ONT) MinION, PacBio and Illumina sequencing platforms. Our synthetic microbial community BMock12 consists of 12 bacterial strains with genome sizes spanning 3.2–7.2 Mbp, 40–73% GC content, and 1.5–7.3% repeats. Size selection of both PacBio and ONT sequencing libraries prior to sequencing was essential to yield comparable relative abundances of organisms among all sequencing technologies. While the Illumina-based metagenome assembly yielded good coverage with few misassemblies, contiguity was greatly improved by both, Illumina + ONT and Illumina + PacBio hybrid assemblies but increased misassemblies, most notably in genomes with high sequence similarity to each other. Our resulting datasets allow evaluation and benchmarking of bioinformatics software on Illumina, PacBio and ONT platforms in parallel.

Measurement(s)	metagenomic data • sequence_assembly
Technology Type(s)	ONT MinION • Illumina sequencing • PacBio RS II
Factor Type(s)	sequencing platform
Sample Characteristic - Organism	Bacteria
Sample Characteristic - Environment	mock community

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.10260740

Elucidation of genes enhancing natural product biosynthesis through co-evolution analysis

Article 12 April 2024

Xinran Wang, Ningxin Chen, … Xiaozhou Luo

Nanopore sequencing technology, bioinformatics and applications

Article 08 November 2021

Yunhao Wang, Yue Zhao, … Kin Fai Au

Single-cell RNA-seq of the rare virosphere reveals the native hosts of giant viruses in the marine environment

Article 11 April 2024

Amir Fromm, Gur Hevroni, … Assaf Vardi

Background & Summary

Accurate microbial community representation based on cultivation-independent genome sequencing methods has been one of the major challenges in microbial ecology and genomics since the onset of shotgun metagenome sequencing. Existing sequencing technologies display platform-specific biases depending on run mode and chemistry. These biases affect read length, data throughput, GC coverage bias, error rates, and the ability to resolve repetitive genomic elements^1,2,3. The Oxford Nanopore Technology (ONT) MinION is the first commercially available sequencer that uses nanopores. In the MinION, nanopore sequencing discriminates individual nucleotides by measuring the change in electrical conductivity as DNA molecules pass through a biological pore⁴. The ONT MinION is a portable sequencing device generating maximum read lengths in excess of 100 kb with the potential to span long repeats, and at comparably low cost and high-speed (our test runs yielded 10–50 Gb in 48 hours). To date most published studies using the MinION technology focus on (i) whole genome sequencing (WGS) of organisms with existing reference genomes and on (ii) validating or resolving difficult regions or screens of target genes/gene regions in viral^{5,6,7,8,9,10,11,12}, bacterial^{5,6,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28}, and eukaryotic^{29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44} genomes. Laver et al. compared ONT performance for three bacterial strains with % GC of ~29–71% and showed that the strain with highest % GC was underrepresented in the sequencing reads⁴⁵. Various genome assemblies were shown to improve in hybrid approaches with Illumina reads³⁰ and reached 99.5% nucleotide identity for a de novo assembly of E. coli¹³. To our knowledge, only two ONT shotgun metagenome studies exist, one of an environmental sample in which DNA was fragmented to ~510–840 bp and the resulting 2D reads (0–1200 bp) were mapped against a database of 400 bp gene fragments⁴⁶, and the other of various low complexity mock communities comparing different long read classification tools⁴⁷. To date, there has not been an ONT shotgun metagenome study that evaluates its long reads in the context of mapping accuracy, assembly contiguity, and overall community representation.

We used a defined community (composed of a pool of separately extracted DNAs), BMock12, that includes 12 bacterial strains belonging to two phyla (Actinobacteria and Flavobacteria) and 2 proteobacterial classes (Alpha- and Gammaproteobacteria). Genomes from these taxa represent a breadth of genome sizes and range from low to high % GC with variable repeat fractions. Bmock12 includes three actinobacterial genomes of the genus Micromonospora characterized by high %GC content and high average nucleotide identity (ANI), which present challenges for assembly (Fig. 1, Table 1). Shotgun sequencing performance on ONT MinION was compared to other state-of-the-art platforms, Pacific Biosciences RS-II and Illumina HiSeq. 2500 (Table 2). Interestingly, we noticed a major impact of input DNA size selection during library preparation on the length distribution of mapped reads in ONT data, favoring the sequencing of shorter reads, which also resulted in a slightly skewed community structure (Figs. S1, S2). After size selection and removal of reads <10 kb, relative abundances of each organism were found to be comparable across all sequencing technologies, and equally correlated to molarity (Fig. 2, Tables S1, S2 and S3). Average % identity of both ONT and PacBio mapped reads was 85.9% (Figs. S3, S4). A negligible number of reads were mapped to M. coxensis, likely due to low input DNA concentration or quality, or as a result of pipetting error and/or inaccuracies in DNA quantification as was observed previously⁴⁸. Therefore, this organism was omitted from the remainder of the analysis. Other disagreements between the distributions of % mapped bases and DNA molarity are likely due to these same noise factors.

Table 1 All genomes are available as improved high-quality drafts in the IMG database. See Fig. S1 for detailed statistics.

Full size table

Table 2 Run information and statistics for each sequencing platform. Average quality score for Illumina reads was 35.3. Percent identity was calculated as E/(E + I + D + S), where, E, I, D, S represent exact matches, insertions, deletions and substitutions respectively.

Full size table

Although reads <10 kb were removed from ONT and PacBio datasets, the distribution of read lengths peaked at ~12 kb in ONT vs. ~5 kb in PacBio data, because PacBio sequences generally tend to favor shorter DNA molecules⁴⁹ and likely because size selection for ONT was more successful (Fig. S5). The length distribution of reads mapped to each organism was found to be nearly the same within each sequencing platform (Fig. S6). PacBio and ONT reads displayed comparable distribution patterns of % genome coverage over sequencing depth (Figs. 3 and S7), and in contrast to Illumina reads, they did not show any notable GC bias (Fig. S8). Illumina sequences have previously been described to discriminate against GC-poor and GC-rich genomes and DNA regions^50,51,52. Read mapping errors were mostly substitutions and deletions and, to a lesser degree, insertions for ONT, whereas PacBio errors were dominated by insertions (Figs. S9, S10).

Metagenome assembly was performed using (1) only Illumina reads, (2) Illumina and PacBio reads, or (3) Illumina and ONT reads. Illumina-only assemblies performed well and yielded at least 92.6% reference coverage (Table 3). 6 out of 11 Illumina-only genome assemblies displayed fewer misassemblies than the hybrid assemblies, which is likely due to the increased error rate in long reads. Misassemblies in hybrid assemblies were particularly high for the two Halomonas spp., which shared 99% ANI, indicating that hybrid assemblies might generally be challenged by the presence of strains of the same species, or more generally with high % ANI to each other. In the case of the two Marinobacter spp., which shared 85% ANI, only one of the two genomes generated few misassemblies in the hybrid assemblies (Tables 3 and S4). For all genomes, except that of Proprionibacter bacterium, contiguity improved greatly in the hybrid assemblies. In some hybrid assemblies, the total number of contigs was reduced by an order of magnitude. Illumina + ONT assemblies were less fragmented than Illumina + PacBio assemblies due to the longer average read lengths of the ONT reads (Fig. S11). ANI between genome pairs was the main factor determining the assembly quality (Table S4). Genomes that are closely related to others (particularly two Halomonas strains with 99% ANI) yielded lower quality assemblies (Table S5). This effect of strain heterogeneity on metagenome assembly has been previously reported through extensive benchmarking⁵³. Similarly, genomes with high repeat content (Psychrobacter, Cohaesibacter, and both Marinobacter species) resulted in more fragmented assemblies as compared to others. Reference coverage was the same or better in hybrid assemblies with the exception of Halomonas sp. HL-4 (Table 3). Total aligned length was comparable between all sequencing technologies (Table S4). Genomes pairs with relatively high ANI (two Halomonas strains, Marinobacter sp. LV10R510-8, Marinobacter sp. LV10MA510-1, M. echinaurantiaca and M. echinofusca) displayed assembly lengths larger than their references, which resulted from contigs that mapped to more than one reference genome.

Table 3 Assembly statistics. NGA50 is the length of the shortest in the set of blocks of that length or longer covers at least 50% of the reference genome after alignment. Blocks are parts of contigs split at misassembly events.

Full size table

While arriving at the true community composition of complex microbiomes will remain challenging, current advancements in sequencing protocols have resulted in reduced bias, improved resolution, and more predictable error. Metagenomic sequence data from defined samples, such as MBARC-26⁵⁴, HMP⁵⁵, and the BMock12 data described here are critical to not only assess new or modified wet lab protocols⁵⁶ and performance of sequencing platforms⁵⁷, but also downstream analytical tools and pipelines used to derive biological insights from metagenome datasets^53,58. While ONT had been primarily used for WGS for organisms with existing reference genomes, and hybrid assemblies as well as diagnostics, our study shows that shotgun metagenome data generated on the MinION yields community representation and improved genome assembly contiguity that is comparable to that of the Illumina-PacBio hybrid assembly contiguity (Table 4). As sequencing accuracy and throughput reliability improve and with the development of long read assemblers, this platform is headed towards stand-alone long-read assemblies that are suitable for accurate representations of microbial community structure and predicted function in complex environmental samples.

Methods

Cultivation and DNA extraction

Cultures of Micromonospora coxensis DSM 45161, Micromonospora echinaurantiaca DSM 43904, and Micromonospora echinofusca DSM 43913 were grown aerobically in DSMZ medium 65 Gym Streptomyces Medium (https://www.dsmz.de/?id=441) (DSMZ, Braunschweig, Germany) at 28 °C. Genomic DNA was isolated using the MasterPure Gram Positive DNA Purification Kit (Epicentre, Madison, WI) following the standard protocol provided by the manufacturer but modified by incubating on ice overnight on a shaker and the use of an additional 1 µl proteinase K.

Cultures of Halomonas sp. HL-4 and Halomonas sp. HL-93 were grown aerobically in Hot Lake Heterotroph (HLH) medium⁵⁹ at 30 °C. Genomic DNA was isolated using phenol-chloroform extraction as previously described⁶⁰.

Cultures of Thioclava sp. ES.032, Propionibacteriaceae bacterium ES.041, Cohaesibacter sp. ES.047, and Muricauda sp. ES.050 were grown aerobically on modified PE agar plates⁶¹. Biomass from 1–2 plates was scraped and genomic DNA was isolated using the Qiagen bacterial extraction protocol for the Genomic-tip 500/G kit (Qiagen, Germantown, MD), with minor modifications. Briefly, in addition to the buffer B1, proteinase K and RNase additions, an enzyme cocktail composed of 500 ml achromopeptidase (10 U/ml), 500 ml lysostaphin (0.2 U/ml), 500 ml of lysozyme (100 mg/ml) and 1 ml mutanolysin (1 U/ml) was added to the samples. Samples were placed on a shaker and incubated at 37 °C overnight to lyse the cells. Genomic DNA was extracted the next day using the genomic-tips 500/G, as per the manufacturer’s instructions.

The Marinobacter and Psychrobacter strains isolated from Antarctic Lake Vida (Marinobacter sp. LV10R510-8, Marinobacter sp. LV10MA510-1, and Psychrobacter sp. LV10R520-6) were grown aerobically in R2A media (Difco) with 5% NaCl (25 mL each) under non-shaking conditions at 10 °C. Cells were pelleted by centrifuging for 5 minutes at 12,000 × g. High molecular weight genomic DNA was isolated following Ausubel⁶². Briefly, cells were resuspended in TE buffer with 10% SDS and proteinase K (final concentration) then following 1 hr. incubation at 37 °C, CTAB (hexadecyltrimethylammonium bromide)/NaCl was added to extract the nucleic acids, and chloroform: isoamylalcohol was used to purify the preparation. The crude extract was digested with RNAse and then the HMW gDNA was precipitated in isopropanol, and following drying, the pellet was resuspended in TE.

All DNA extracts were checked for quality and quantified using a Qubit fluorometer (Invitrogen, Carlsbad, CA) and visually by quantitative gel. Samples were pooled at varying ratios from 1.6–16.2% to generate the mock community (Table 1).

Library creation and sequencing

For Illumina library creation, 100 ng of genomic DNA, brought up to a total of 100 μl in TE, was sheared to 300 bp using the Covaris LE200 (Covaris, Inc. Woburn, MA, USA) and size-selected using SPRI beads (Roche Holding AG, Basel, Switzerland): 60 μl of beads were added to 100 μl of sample. The sample was then incubated at room temperature (RT) for 5 min. Beads were pelleted using a magnetic particle concentrator (MPC) (Thermo Fisher Scientific, South San Francisco, CA, USA) until liquid was clear. The supernatant was removed and transferred to a new tube. AMPure XP (30 μl) beads were then added for the second bead size selection. The mixture was pulse vortexed, quickly spun and incubated at RT for 5 min. Beads were pelleted using an MPC until liquid was clear. The supernatant was then discarded without disturbing the beads and 200 μl of freshly prepared 75% ethanol (EtOH) was added, followed by a 30 s incubation to wash the beads. EtOH was discarded before the EtOH wash step was repeated twice. Afterwards, the sample was placed on a thermocycler (Eppendorf, Hamburg, Germany) with the lid open and incubated at 37 °C until the beads were dry and residual EtOH had evaporated. The beads were re-suspended in 53 μl of EB buffer (Qiagen, Redwood City, CA, USA), vortexed, quickly spun and incubated at RT for 1 min. Beads were pelleted using an MPC until liquid was clear and then 50 μl of supernatant was transferred to a new tube. The fragments were treated with the Kapa Library Preparation Kit ORIGIN (Kapa Biosystems, Wilmington, MA, USA) for the following steps: For end-repair 26 μl MilliQ water, 9 μl 10X End Repair Buffer, and 5 μl End Repair Enzyme were combined in a 1.5 ml tube. The cocktail was vortexed and quickly spun, stored on ice, and then 40 μl was added to the 50 μl DNA sample. The mixture was vortexed and quickly spun, before incubation at 30 °C for 30 min in a thermocycler (Eppendorf, Hamburg, Germany). After incubation, 126 μl of AMPure XP beads (Beckman Coulter, Brea, CA, USA) were added to 90 μl of End Repair sample, pulse vortexed, quickly spun, and incubated at RT for 5 min. Beads were pelleted using an MPC until liquid was clear. The supernatant was then discarded without disturbing the beads. The beads were washed twice with 200 μl of freshly prepared 75% EtOH with an incubation time of 30 s. After washing, the sample was incubated at 37 °C in a thermocycler with the lid open until residual EtOH had evaporated. For DNA resuspension, 17.5 μl of EB buffer was added. The sample was vortexed, quickly spun, and incubated at RT for 1 min, before beads were pelleted on an MPC. 15 μl of supernatant was then transferred to a new tube.

For A-tailing, 9 μl of MilliQ water, 3 μl of 10X A-Tailing Buffer and 3 μl of A-Tailing Enzyme were combined in this order in a 1.5 ml tube. The cocktail was vortexed and quickly spun, then 15 μl of the A-Tailing cocktail was added to the 15 μl sample. The mixture was vortexed and quickly spun before incubating the samples in a thermocycler at 30 °C for 30 min, followed by 5 min at 70 °C.

Adapter ligation was performed immediately thereafter: 9 μl of 5X Ligation Buffer and 5 μl of ligase were combined in a 1.5 ml tube. The mixture was pulse vortexed and quickly spun before adding 14 μl of adapter ligation cocktail to the 30 μl sample; 1 μl of 18 μM adapter was then added to the ligation mixture for a final concentration of 400 nM. The mixture was incubated in a thermocycler at 20 °C for 15 min. After adapter ligation, 5 μl of EB Buffer was added to 45 μl of adapter-ligated sample. The sample was size-selected and washed twice with 45 μl of AMPure XP beads as described previously. After the first clean-up step, the sample was resuspended with 52 μl of EB Buffer and 45 μl of supernatant was transferred to a clean tube. After the second clean-up step, the sample was eluted with 25 μl of EB Buffer and 23 μl of supernatant was transferred to a clean tube. The sample was quality-controlled and quantified using an Agilent Bioanalyzer 2100 High Sensitivity Kit.

The prepared Illumina library was further quantified using KAPA Biosystem’s next generation sequencing library qPCR kit (Roche Holding AG, Basel, Switzerland) and run on a Roche Light Cycler 480 real-time PCR instrument according to the manufacturer’s guidelines (Roche Holding AG, Basel, Switzerland). The quantified library was then prepared for sequencing on the Illumina HiSeq sequencing platform (Illumina, Inc., San Diego, CA, USA). First, the TruSeq paired-end cluster kit, v3, and Illumina’s cBot instrument were used to generate a clustered flowcell for sequencing (Illumina, Inc., San Diego, CA, USA). Sequencing of the flowcell was performed on the Illumina HiSeq 2500 sequencer using a TruSeq SBS sequencing kit 200 cycles, v4, following a 2 × 150 indexed run recipe (Illumina, Inc., San Diego, CA, USA) (Table 2). This resulted in 426,735,646 raw reads.

For PacBio library creation, an unamplified library was generated using Pacific Biosciences standard template preparation protocol for creating >10 kb libraries. gDNA (10 μg) was sheared using Covaris g-Tubes to generate >10 kb fragments (Covaris, Inc., Woburn, MA, USA). The sheared DNA fragments were then prepared according to the Pacific Biosciences SMRTbell template preparation kit guidelines (Pacific Biosciences, Menlo Park, CA, USA). Briefly, DNA fragments were treated with DNA damage repair mix, end-repaired, and 5′ phosphorylated. PacBio hairpin adapters were then ligated to the fragments to create SMRTbell templates for sequencing. The SMRTbell templates were purified using exonuclease treatments and size-selected using the Sage Science BluePippin instrument with a 10 kb lower cutoff depending on DNA quality.

PacBio sequencing primers were annealed and v. P6 sequencing polymerase was bound to the SMRTbell templates. The prepared SMRTbell template libraries were then sequenced on a Pacific Biosciences RSII sequencer using v. C4 chemistry and 1 × 240 min sequencing movie run times (Pacific Biosciences, Menlo Park, CA, USA).

For the size-selected ONT library, 10 µg of gDNA was used and quality controlled using FA12 DNA QC. The DNA was sheared using Covaris g-Tubes to generate >10 kb fragments (Covaris, Inc., Woburn, Ma, USA). The sheared DNA fragments were then size selected using the Sage Science BluePippin instrument with a 10 kb lower cutoff. After clean-up, DNA was repaired and end-prepared using the NEBNext FFPE DNA Repair kit (New England BioLabs, Ipswich, MA, USA) with the following changes to the manufacturer’s protocol: The reaction volume was doubled to 120 µl, incubation was performed at 20 °C for 20 minutes and at 65 °C for 20 minutes. AMPure XP beads (120 µl) were added to the repaired DNA and incubated at RT for 30 minutes on a Hula mixer, followed by two washes with 70% EtOH. Beads were then resuspended with 61 µl of nuclease-free (NF) water and incubated at RT for 30 minutes on a Hula mixer; 61 µl of the eluate was then transferred into a clean 1.5 ml Eppendorf tube. The resulting DNA was quantified using the Qubit HS DNA kit.

Adapter ligation and clean-up was performed using the Ligation Sequencing Kit SQK-LSK109 (Oxford Nanopore Technologies, Oxford, United Kingdom) with a slightly changed protocol: Ligation buffer, NEBNext Quick T4 DNA ligase, and adapter mix were added to the repaired DNA and incubated at RT for 10 minutes and then overnight at 4 °C. The ligated sample was purified using 100 µl of AMPure XP beads during a 30 minute incubation at RT on the Hula mixer, two bead washing steps using the kit-provided wash buffer and resuspension of the beads in 40 µl of elution buffer at RT for 30 minutes on the Hula mixer; 40 µl of the eluate was then transferred into a clean 1.5 ml tube.

The library was then sequenced on a MinION using R9.4.1 flow cell sequencing chemistry (Table 2). This resulted in 187,507 Pass-1D reads that were processed using the MinKNOW software version 1.13.1.

For the non-size-selected ONT library, 5 μg of gDNA was used to create the ONT library. The DNA was sheared using Covaris g-tubes to generate >10 kb fragments (Covaris Inc., Woburn, MA USA). The sheared DNA was repaired using the NEBNext FFPE Repair Mix (New England BioLabs, Ipswich, MA USA) according to the manufacturer’s instructions. AMPure XP beads (62 μl) were added to the FFPE-repair reaction and incubated at RT for 30 minutes on a Hula mixer, followed by two washes with 70% EtOH. Beads were then resuspended with 93 μl of NF water and incubated for 30 minutes at room temperature on a Hula mixer; 90 μl of the eluate was then transferred to a clean 1.5 mL Eppendorf tube. The resulting DNA was quantified using the Qubit HS DNA kit.

The fragmented and repaired DNA underwent end repair and A-tailing using the NEBNExt End Repair/dA-Tailing Module (New England BioLabs) with the following changes to the manufacturer’s protocol: The reaction volume was doubled to 120 μl, incubation was performed at 20 °C for 20 minutes and at 65 °C for 20 minutes. AMPure XP beads (120 μl) were added to the end-prep reaction and incubated for 30 minutes at room temperature on a Hula mixer, followed by two washes with 70% EtOH. Beads were then resuspended in 31 ul of NF water and incubated for 30 minutes at room temperature on a Hula mixer; 61 μl of the eluate was then transferred to a clean 1.5 mL Eppendorf tube. The resulting DNA was quantified using the Qubit HS DNA kit.

Adapter ligation and clean-up was performed using the SQK-LSK108 kit (Oxford Nanopore Technologies, Oxford, United Kingdom) with the following changes to the manufacturer’s protocol: The ligation reaction was incubated at room temperature for 10 minutes and then overnight at 4 °C. The ligated samples were purified using 40 μl of AMPure XP beads, incubated for 30 minutes at room temperature on a Hula mixer followed by two washes using the kit-provided wash buffer. The beads were resuspended in 15 μl of the kit-provided elution buffer and then incubated for 30 minutes at room temperature on a Hula mixer; 15 μl of the eluate was then transferred to a clean 1.5 mL tube and quantified using the Qubit HS DNA kit.

The library was then sequenced on a MinION using the R9.4 flow cell sequencing chemistry and resulted in 144,976 reads.

Sequence QC

BBDuk (filterk = 27 trimk = 27; https://sourceforge.net/projects/bbmap/) was used to remove Illumina adapters, known Illumina artifacts, and phiX, and to quality-trim both ends to Q12 from the Illumina library. Reads were discarded if they contained more than one ‘N’, or had quality scores (before trimming) averaging less than 8 over the read, or had a length under 40 bp after trimming. The remaining reads were mapped to a masked version of human HG19, dog, cat, and mouse with BBMap (https://sourceforge.net/projects/bbmap/), discarding all hits over 93% identity. This process yielded 422,896,888 filtered reads (Table 2). Quality filtering of PacBio sequences were performed using SMRT Portal v2.3.0, setting minimum subread length to 50, minimum polymerase read quality to 75, minimum polymerase read length to 50, and control spike-in was removed using pbalign with parameters minAccuracy = 0.75 minLength = 50. Filtering yielded 389,806 subreads. ONT basecalling was performed using Albacore basecaller v2.3.1 selecting only the pass-1D reads.

Read Mapping and repeat region identification

Illumina, PacBio, and ONT reads were mapped to reference genomes using bwa v0.7.15 (http://bio-bwa.sourceforge.net/) with default parameters for Illumina. Parameters -x pacbio and -x ont2d were specified for PacBio and ONT reads, respectively. The number of reads that mapped to Micromonospora coxensis was negligible. The distribution of reads that mapped to each organism, as well as numbers of reads that did not map to any organism, are given in Table S1. Reference sequences were downloaded from IMG on June 27, 2017. IMG IDs for references are listed in Table 1. Repeats in genomes were found using repeat-match tool from MUMmer package v3.23⁶³, specifying parameter -n25.

Assembly and assembly quality assessment

For the assembly, we first performed error correction on Illumina reads using bfc version r181 with parameters -1 -s 10 g -k 21 -t 10⁶⁴. Unpaired reads were removed from the library subsequently. Error-corrected reads were then assembled using SPAdes v3.12.0⁶⁵ with parameters -m 120 –only-assembler -k 33,55,77,99,127 –meta. For the hybrid assemblies, ONT and PacBio reads were supplied to the assembler via–nanopore and–pacbio parameters. Long reads were not error corrected as recommended in the SPAdes manual. Assembly statistics were generated using metaquast from Quast 4.6.3⁶⁶ package using default parameters.

Data post-processing

Depth of coverage plots in Figs. 3 and S7 were produced using bedtools genomecov⁶⁷. Illumina insert size distribution in Fig. S6 was obtained using picard CollectInsertSizeMetrics⁶⁸. We used jgi_summarize_bam_contig_depths (bitbucket.org/berkeleylab/metabat) with parameter–percentIdentity 70 to produce GC coverage plots in Fig. S8. Percent identity distributions in Figs. S3, S4, error rates in Fig. S9, and distributions in Fig. S10 were generated using jgi_summarize_bam_contig_depths (bitbucket.org/berkeleylab/metabat). Figures S11 and S12 were produced from Metaquast output.

The bash scripts used for QC, mapping, assembly and post-processing are available at https://bitbucket.org/volkansevim/bmock12/src/master/.

Data Records

Shotgun sequences generated on the Illumina, ONT, and PacBio platforms are publicly available through NCBI and details are listed in Supplementary Table 6: SRA Accessions SRX5161985⁶⁹ (ONT no size selection), SRX4901586⁷⁰ (ONT 10 kb size selection), SRX4901584⁷¹ & SRX4901585⁷² (PacBio 10 kb size selection; two libraries were combined for analysis), SRX4901583⁷³ (Illumina). Assemblies have been deposited at NCBI Assembly under the accessions GCA_003957615.1⁷⁴ (PacBio + Illumina hybrid), GCA_003957625.1⁷⁵ (ONT + Illumina hybrid), and GCA_003957645.1⁷⁶ (Illumina only).

Technical Validation

To assess the quality of genomic DNA received, we used the PicoGreen assay and the Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA, USA). Each sample was quantified in quadruplicate.

References

Roberts, R. J., Carneiro, M. O. & Schatz, M. C. The advantages of SMRT sequencing. Genome Biology 14, 405 (2013).
Article PubMed PubMed Central Google Scholar
Minoche, A. E., Dohm, J. C. & Himmelbauer, H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biology 12, R112 (2011).
Article CAS PubMed PubMed Central Google Scholar
Laver, T. et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomolecular Detection and Quantification 3, 1–8 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kasianowicz, J. J., Brandin, E., Branton, D. & Deamer, D. W. Characterization of individual polynucleotide molecules using a membrane channel. PNAS 93, 13770–13773 (1996).
Article ADS CAS PubMed PubMed Central Google Scholar
Kilianski, A. et al. Bacterial and viral identification and differentiation by amplicon sequencing on the MinION nanopore sequencer. GigaSci 4(12), https://doi.org/10.1186/s13742-015-0051-z (2015).
Karamitros, T. & Magiorkinis, G. A novel method for the multiplexed target enrichment of MinION next generation sequencing libraries using PCR-generated baits. Nucleic Acids Res 43(22), e152, https://doi.org/10.1093/nar/gkv773 (2015).
Article PubMed PubMed Central CAS Google Scholar
Sauvage, V. et al. Early MinION^TM nanopore single-molecule sequencing technology enables the characterization of hepatitis B virus genetic complexity in clinical samples. PLoS One 13 (2018).
Article PubMed PubMed Central CAS Google Scholar
Mikheyev, A. S. & Tin, M. M. Y. A first look at the Oxford Nanopore MinION sequencer. Molecular Ecology Resources 14, 1097–1102 (2014).
Article CAS PubMed Google Scholar
Theuns, S. et al. Nanopore sequencing as a revolutionary diagnostic tool for porcine viral enteric disease complexes identifies porcine kobuvirus as an important enteric virus. Sci Rep 8 (2018).
Yamagishi, J. et al. Serotyping dengue virus with isothermal amplification and a portable sequencer. Sci Rep 7 (2017).
Wang, J., Moore, N. E., Deng, Y.-M., Eccles, D. A. & Hall, R. J. MinION nanopore sequencing of an influenza genome. Front. Microbiol. 6 (2015).
Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature Methods 12, 733–735 (2015).
Article CAS PubMed Google Scholar
Li, C. et al. INC-Seq: accurate single molecule reads using nanopore sequencing. Gigascience 5 (2016).
Quick, J., Quinlan, A. R. & Loman, N. J. A reference bacterial genome dataset generated on the MinION^TM portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014).
Article PubMed PubMed Central CAS Google Scholar
Ashton, P. M. et al. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nature Biotechnology 33, 296–300 (2015).
Article CAS PubMed Google Scholar
Ip, C. L. C. et al. MinION Analysis and Reference Consortium: Phase 1 data release and analysis. F1000Res 4, 1075, https://doi.org/10.12688/f1000research.7201.1 (2015).
Article PubMed PubMed Central Google Scholar
Deschamps, S. et al. Characterization, correction and de novo assembly of an Oxford Nanopore genomic dataset from Agrobacterium tumefaciens. Scientific reports 6, 28625 (2016).
Mitsuhashi, S. et al. A portable system for rapid bacterial composition analysis using a nanopore-based sequencer and laptop computer. Scientific reports 7(1), 5657 (2017).
Xia, Y. et al. MinION Nanopore Sequencing Enables Correlation between Resistome Phenotype and Genotype of Coliform Bacteria in Municipal Sewage. Frontiers in microbiology 8, 2105 (2017).
Judge, K., Harris, S. R., Reuter, S., Parkhill, J. & Peacock, S. J. Early insights into the potential of the Oxford Nanopore MinION for the detection of antimicrobial resistance genes. J Antimicrob Chemother 70, 2775–2778 (2015).
Article CAS PubMed PubMed Central Google Scholar
Votintseva, A. A. et al. Same-Day Diagnostic and Surveillance Data for Tuberculosis via Whole-Genome Sequencing of Direct Respiratory Samples. J Clin Microbiol 55, 1285–1298 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hyeon, J.-Y. et al. Quasimetagenomics-Based and Real-Time-Sequencing-Aided Detection and Subtyping of Salmonella enterica from Food Samples. Appl. Environ. Microbiol. 84(4), e02340-17 (2018).
Hu, J. et al. Diversified Microbiota of Meconium Is Affected by Maternal Diabetes Status. PloS one 8, e78257 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Lemon, J. K., Khil, P. P., Frank, K. M. & Dekker, J. P. Rapid Nanopore Sequencing of Plasmids and Resistance Gene Detection in Clinical Isolates. J Clin Microbiol 55, 3530–3543 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sanderson, M. A., Adler, P. R., Boateng, A. A., Casler, M. D. & Sarath, G. Switchgrass as a biofuels feedstock in the USA. Canadian Journal of Plant Science 86, 1315–1325 (2006).
Article Google Scholar
Quainoo, S. et al. Whole-Genome Sequencing of Bacterial Pathogens: the Future of Nosocomial Outbreak Analysis. Clin Microbiol Rev 30, 1015–1063 (2017).
Article CAS PubMed PubMed Central Google Scholar
Quick, J. et al. Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella. Genome Biol 16(1), 114 (2015).
Fraiture, M.-A. et al. Nanopore sequencing technology: a new route for the fast detection of unauthorized GMO. Scientific reports 8(1), 7903 (2018).
Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 25, 1750–1756 (2015).
Article CAS PubMed PubMed Central Google Scholar
Norris, A. L., Workman, R. E., Fan, Y., Eshleman, J. R. & Timp, W. Nanopore sequencing detects structural variants in cancer. Cancer Biology & Therapy 17, 246–253 (2016).
Article CAS Google Scholar
Hoang, P. N. T. et al. Generating a high-confidence reference genome map of the Greater Duckweed by integration of cytogenomic, optical mapping and Oxford Nanopore technologies. The Plant Journal 96, 670–684 (2018).
Article CAS PubMed Google Scholar
Tyson, J. R. et al. MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans reference genome. Genome Res 28, 266–274 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wei, X., Shao, M., Gale, W. & Li, L. Global pattern of soil carbon losses due to the conversion of forests to agricultural land. Scientific reports 4, 4062 (2014).
Pomerantz, A. et al. Real-time DNA barcoding in a rainforest using nanopore sequencing: opportunities for rapid biodiversity assessments and local capacity building. GigaScience 7(4), giy033 (2018).
Runtuwene, L. R. et al. Nanopore sequencing of drug-resistance-associated genes in malaria parasites, Plasmodium falciparum. Scientific reports 8(1), 8286 (2018).
Hargreaves, A. D. & Mulley, J. F. Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing. PeerJ 3, e1441 (2015).
Zaaijer, S. & Erlich, Y. Using mobile sequencers in an academic classroom. eLife 5 (2016).
Lindberg, M. R. et al. A Comparison and Integration of MiSeq and MinION Platforms for Sequencing Single Source and Mixed Mitochondrial Genomes. PLoS One 11(12), e0167600 (2016).
Article PubMed PubMed Central CAS Google Scholar
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36, 338–345 (2018).
Article CAS PubMed PubMed Central Google Scholar
Jansen, H. J. et al. Rapid de novo assembly of the European eel genome from nanopore sequencing reads. Scientific reports 7(1), 7213 (2017).
Liem, M. et al. De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing. F1000Research 6 (2018).
Volden, R. et al. Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA. Proc Natl Acad Sci USA 115, 9726–9731 (2018).
Article PubMed CAS PubMed Central Google Scholar
Parker, J., Helmstetter, A. J., Devey, D., Wilkinson, T. & Papadopulos, A. S. T. Field-based species identification of closely-related plants using real-time nanopore sequencing. Scientific reports 7(1), 8345 (2017).
Laver, T. et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomolecular Detection and Quantification 3, 1–8 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hu, Y. O. O. et al. Stationary and portable sequencing-based approaches for tracing wastewater contamination in urban stormwater systems. Scientific reports 8(1), 11907 (2018).
Brown, B. L., Watson, M., Minot, S. S., Rivera, M. C. & Franklin, R. B. MinION^TM nanopore sequencing of environmental metagenomes: a synthetic approach. Gigascience 6, 1–10 (2017).
Article PubMed PubMed Central CAS Google Scholar
Nakayama, Y., Yamaguchi, H., Einaga, N. & Esumi, M. Pitfalls of DNA Quantification Using DNA-Binding Fluorescent Dyes and Suggested Solutions. PLoS One 11(3), e0150528 (2016).
Ardui, S., Ameur, A., Vermeesch, J. R. & Hestand, M. S. Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res 46, 2159–2168 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dohm, J. C., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36, e105 (2008).
Article PubMed PubMed Central CAS Google Scholar
Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Hillier, L. W. et al. Whole-genome sequencing and variant discovery in C. elegans. Nature Methods 5, 183–188 (2008).
Article CAS PubMed Google Scholar
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nature Methods 14, 1063–1071 (2017).
Article CAS PubMed PubMed Central Google Scholar
Singer, E. et al. Next generation sequencing data of a defined microbial mock community. Scientific Data 3, 160081 (2016).
Article ADS PubMed PubMed Central Google Scholar
Consortium, T. H. M. P. A framework for human microbiome research. Nature 486, 215–221 (2012).
Article ADS CAS Google Scholar
Bowers, R. M. et al. Impact of library preparation protocols and template quantity on the metagenomic reconstruction of a mock microbial community. BMC Genomics 16(1), 856 (2015).
Singer, E. et al. High-resolution phylogenetic microbial community profiling. The ISME Journal 10, 2020–2032 (2016).
Article PubMed PubMed Central Google Scholar
Bushnell, B., Rood, J. & Singer, E. BBMerge – Accurate paired shotgun read merging via overlap. PloS one 12, e0185056 (2017).
Article PubMed PubMed Central CAS Google Scholar
Cole, J. K. et al. Phototrophic biofilm assembly in microbial-mat-derived unicyanobacterial consortia: model systems for the study of autotroph-heterotroph interactions. Front. Microbiol. 5 (2014).
Moore, D. D. & Dowhan, D. Preparation and Analysis of DNA. Current Protocols in Molecular Biology (1995).
Hanada, S., Hiraishi, A., Shimada, K. & Matsuura, K. Chloroflexus aggregans sp. nov., a Filamentous Phototrophic Bacterium Which Forms Dense Cell Aggregates by Active Gliding Movement. International Journal of Systematic and Evolutionary Microbiology 45, 676–681 (1995).
CAS Google Scholar
Ausubel, F. M. et al. Current Protocols in Molecular Biology. 1 (John Wiley & Sons, Inc, 1994).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biology 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Li, H. BFC: correcting Illumina sequencing errors. Bioinformatics 31, 2885–2887 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bankevich, A. et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19, 455–477 (2012).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Article CAS PubMed PubMed Central Google Scholar
Quinlan, A. R. BEDTools: the Swiss‐army tool for genome feature analysis. Current protocols in bioinformatics 47(1), 11–12 (2014).
Article PubMed Central Google Scholar
Broad Institute. Picard Toolkit. http://broadinstitute.github.io/picard/; (GitHub Repository Broad Institute, 2019).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRX5161985 (2019).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRX4901586 (2019).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRX4901584 (2019).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRX4901585 (2019).
NCBI Sequence Read Archive, https://identifiers.org/ncbi/insdc.sra:SRX4901583 (2019).
Sevim, V. et al. GenBank, https://identifiers.org/insdc:RKMI00000000 (2019).
Sevim, V. et al. GenBank, https://identifiers.org/insdc:RKMJ00000000 (2019).
Sevim, V. et al. GenBank, https://identifiers.org/insdc:RJWC00000000 (2019).

Download references

Acknowledgements

The authors gratefully acknowledge the help of Gabi Poetter, DSMZ, for growing cells of DSM 43904, DSM 43913 and DSM 45161 and of Meike Doeppner, DSMZ, for DNA extraction and quality control. Work conducted at LLNL was performed under DOE Award SCW1039 and Contract No. DE-AC52-07NA27344. This work was conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, and was supported under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, 94598, USA
Volkan Sevim, Juna Lee, Robert Egan, Alicia Clum, Hope Hundley, Janey Lee, Ronan O’Malley, Matthew Zane, Jan-Fang Cheng, Alex Copeland, Christopher Daum, Esther Singer & Tanja Woyke
NASA Ames Research Center, Exobiology Branch, Moffett Field, CA, 94035, USA
R. Craig Everroad, Angela M. Detweiler & Brad M. Bebout
Bay Area Environmental Research Institute, Moffett Field, CA, 94035, USA
Angela M. Detweiler
Lawrence Livermore National Laboratory, Nuclear and Chemical Science Division, 7000 East Ave, Livermore, CA, 94550-9234, USA
Jennifer Pett-Ridge
Leibniz-Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH, Inhoffenstraße 7B, 38124, Braunschweig, Germany
Markus Göker
Desert Research Institute, Division of Earth and Ecosystem Sciences, 2215 Raggio Pkwy, Reno, NV, 89512, USA
Alison E. Murray
Purdue University, 610 Purdue Mall, West Lafayette, IN, 47907, USA
Stephen R. Lindemann
Newcastle University, School of Natural and Environmental Sciences, Ridley Building 2, Newcastle upon Tyne, NE1 7RU, UK
Hans-Peter Klenk
Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA, 94720, USA
Esther Singer

Authors

Volkan Sevim
View author publications
You can also search for this author in PubMed Google Scholar
Juna Lee
View author publications
You can also search for this author in PubMed Google Scholar
Robert Egan
View author publications
You can also search for this author in PubMed Google Scholar
Alicia Clum
View author publications
You can also search for this author in PubMed Google Scholar
Hope Hundley
View author publications
You can also search for this author in PubMed Google Scholar
Janey Lee
View author publications
You can also search for this author in PubMed Google Scholar
R. Craig Everroad
View author publications
You can also search for this author in PubMed Google Scholar
Angela M. Detweiler
View author publications
You can also search for this author in PubMed Google Scholar
Brad M. Bebout
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Pett-Ridge
View author publications
You can also search for this author in PubMed Google Scholar
Markus Göker
View author publications
You can also search for this author in PubMed Google Scholar
Alison E. Murray
View author publications
You can also search for this author in PubMed Google Scholar
Stephen R. Lindemann
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Peter Klenk
View author publications
You can also search for this author in PubMed Google Scholar
Ronan O’Malley
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Zane
View author publications
You can also search for this author in PubMed Google Scholar
Jan-Fang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Alex Copeland
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Daum
View author publications
You can also search for this author in PubMed Google Scholar
Esther Singer
View author publications
You can also search for this author in PubMed Google Scholar
Tanja Woyke
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.C.E., A.M.D., B.M.B., M.G., A.M., S.R.L., H.-P.K. grew various isolates and extracted the DNA. Ja.L. created the mock community pool. Ju.L., H.H., R.O., M.Z. and C.D. generated the sequence data. V.S., R.E. and A.C. performed Q.C., read mapping and submitted the sequence data to the database. V.S. created the Figures and Tables. E.S., V.S. and T.W. wrote the manuscript.

Corresponding authors

Correspondence to Esther Singer or Tanja Woyke.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figures

Supplementary Tables

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.

Reprints and permissions

About this article

Cite this article

Sevim, V., Lee, J., Egan, R. et al. Shotgun metagenome data of a defined mock community using Oxford Nanopore, PacBio and Illumina technologies. Sci Data 6, 285 (2019). https://doi.org/10.1038/s41597-019-0287-z

Download citation

Received: 07 January 2019
Accepted: 31 October 2019
Published: 26 November 2019
DOI: https://doi.org/10.1038/s41597-019-0287-z

This article is cited by

Mock community taxonomic classification performance of publicly available shotgun metagenomics pipelines
- E. Michael Valencia
- Katherine A. Maki
- Jennifer J. Barb
Scientific Data (2024)
POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling
- David J. Burks
- Vaidehi Pusadkar
- Rajeev K. Azad
Environmental Microbiome (2023)
Long-read assembled metagenomic approaches improve our understanding on metabolic potentials of microbial community in mangrove sediments
- Zhi-Feng Zhang
- Li-Rui Liu
- Meng Li
Microbiome (2023)
Nano3P-seq: transcriptome-wide analysis of gene expression and tail dynamics using end-capture nanopore cDNA sequencing
- Oguzhan Begik
- Gregor Diensthuber
- Eva Maria Novoa
Nature Methods (2023)
Recent Advances in Metagenomic Approaches, Applications, and Challenges
- Niguse K. Lema
- Mesfin T. Gemeda
- Adugna A. Woldesemayat
Current Microbiology (2023)