Terabase Metagenome Sequencing of Grassland Soil Microbiomes

To enable an in-depth survey of the metabolic potential of complex soil microbiomes, we performed ultra-deep metagenome sequencing, collecting >1 Tb of sequence data from three grassland soils representing different precipitation regimes.

A s part of the Pacific Northwest National Laboratory (PNNL) Science Focus Area program (1,2), we are investigating the impact of environmental change on microbial community function in grassland soils. Three grassland soils, representing different moisture regimes, were selected for ultra-deep metagenome sequencing, resulting in Ͼ1 Tb of sequence data per location. This data set serves as a resource for deep analysis of soil microbiome composition and metabolic potential.
Soils were collected from three grassland field site locations. Arid regime soil (irrigated agriculture), characterized as a coarse silty loam, was collected from the Washington State University Irrigated Agriculture Research and Extension Center (IAREC) (46.25N, 119.73W). Intermediate precipitation regime soil (rain-fed and irrigated agriculture), characterized as a fine clay loam, was collected from the Konza Prairie Biological Station (KPBS) (39.10N, 96.61W) (3,4). Frequent precipitation regime soil (rain-fed and tile-drained agriculture), characterized as a fine silty clay loam, was collected from the Iowa State University Comparison of Biofuel Systems (COBS) (41.92N, 93.75W) (5).
Surface soil samples (2 cm by 0 to 20 cm) were collected from three randomly selected field site block locations using a push corer (3 subsamples per block, 3 replicates per subsample). Replicate subsamples were sieved together, resulting in 9 independent samples per site. Samples were flash frozen and stored at Ϫ80°C until further processing.
DNA was extracted from 3 ϫ 0.25 g soil for each of the 9 field samples per site using the PowerSoil DNA extraction kit (Qiagen), with bead beating, and quantified. The extracted DNA samples from each site were combined to generate a pooled sample from each location (IAREC, COBS, and KPBS) for sequencing. Metagenomic libraries were prepared using the TruSeq PCR-free kit (Illumina) and a starting material of 1 g DNA from the pooled DNA. Sequencing was performed on an Illumina HiSeq X system at Fulgent Genetics (Los Angeles, CA), generating 150-nucleotide paired-end reads to a final effort of at least 1 Tb of sequence per site (Table 1). BBDuk (BBTools package v38.38) (6) was used to trim adapter sequences from raw reads (adapters_no_transposase database), to perform quality filtering (parameters: int, ow; k, 27; hdist, 1; qtrim, f; minlen, 35), and to remove contaminants (sequencing_artifacts and phix174_ill reference database). Assembly was performed using the metaHipMer assembler (see MIMS metadata files for the specific developmental version used for each site) with kmer lengths of 21, 31, 55, and 71 (7) on the NERSC Cori platform (https://docs.nersc .gov/systems/cori). Scaffolds Ͻ2,500 bp long were omitted from further analysis. Quality-screened reads were mapped to scaffolds using the Burrows-Wheeler Aligner (v0.7.12) (8), and depth of coverage was determined across each scaffold using SAMtools (v1.9) (9).
These metagenomes are intended as a resource for the scientific community and should facilitate understanding of the highly diverse and complex metabolic potential that is encoded in soil microbial genomes.
Data availability. Metagenomic sequence data have been deposited in the PNNL DataHub repository and are available for download under project doi numbers WA-TmG.1.0, KS-TmG.1.0, and IA-TmG.1.0. The versions described in this paper are the first versions. Packages contain raw reads, assemblies, functional annotations, field site plot maps, MIMS.me.soil.5.0 metadata information, and package "read me" files.