Seasonal Ely Copper Mine Superfund site shotgun metagenomic and metatranscriptomic data analysis

High throughput sequencing data collected from acid rock drainage (ARD) communities can reveal the active taxonomic and functional diversity of these extreme environments, which can be exploited for bioremediation, pharmaceutical, and industrial applications. Here, we report a seasonal comparison of a microbiome and transcriptome in Ely Brook (EB-90M), a confluence of clean water and upstream tributaries that drains the Ely Copper Mine Superfund site in Vershire, VT, USA. Nucleic acids were extracted from EB-90M water and sediment followed by shotgun sequencing using the Illumina NextSeq platform. Approximately 575,933 contigs with a total length of 1.54 Gbp were generated. Contigs of at least a size of 3264 (N50) or greater represented 50% of the sequences and the longest contig was 488,568 bp in length. Using Centrifuge against the NCBI “nt” database 141 phyla, including candidate phyla, were detected. Roughly 380,000 contigs were assembled and ∼1,000,000 DNA and ∼550,000 cDNA sequences were identified and functionally annotated using the Prokka pipeline. Most expressed KEGG-annotated microbial genes were involved in amino acid metabolism and several KEGG pathways were differentially expressed between seasons. Biosynthetic gene clusters involved in secondary metabolism as well as metal- and antibiotic-resistance genes were annotated, some of which were differentially expressed, colocalized, and coexpressed. These data can be used to show how ecological stimuli, such as seasonal variations and metal concentrations, affect the ARD microbiome and select taxa to produce novel natural products. The data reported herein is supporting information for the research article “Characterization of an acid rock drainage microbiome and transcriptome at the Ely Copper Mine Superfund site” by Giddings et al. [1].


a b s t r a c t
High throughput sequencing data collected from acid rock drainage (ARD) communities can reveal the active taxonomic and functional diversity of these extreme environments, which can be exploited for bioremediation, pharmaceutical, and industrial applications. Here, we report a seasonal comparison of a microbiome and transcriptome in Ely Brook (EB-90M), a confluence of clean water and upstream tributaries that drains the Ely Copper Mine Superfund site in Vershire, VT, USA. Nucleic acids were extracted from EB-90M water and sediment followed by shotgun sequencing using the Illumina NextSeq platform. Approximately 575,933 contigs with a total length of 1.54 Gbp were generated. Contigs of at least a size of 3264 (N50) or greater represented 50% of the sequences and the longest contig was 4 88,56 8 bp in length. Using Centrifuge against the NCBI "nt" database 141 phyla, including candidate phyla, were detected. Roughly 380,0 0 0 contigs were assembled and ∼1,0 0 0,0 0 0 DNA and ∼550,0 0 0 cDNA sequences were identified and function-ally annotated using the Prokka pipeline. Most expressed KEGG-annotated microbial genes were involved in amino acid metabolism and several KEGG pathways were differentially expressed between seasons. Biosynthetic gene clusters involved in secondary metabolism as well as metal-and antibiotic-resistance genes were annotated, some of which were differentially expressed, colocalized, and coexpressed. These data can be used to show how ecological stimuli, such as seasonal variations and metal concentrations, affect the ARD microbiome and select taxa to produce novel natural products. The data reported herein is supporting information for the research article "Characterization of an acid rock drainage microbiome and transcriptome at the Ely Copper Mine Superfund site" by Giddings et al. [1] .
© 2020 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Table   Subject Microbial Ecology, Genomics and Molecular Biology Specific subject area Metagenomics Type of data Tables, figures, raw data How data were acquired Shotgun metagenomic and metatranscriptomic sequence data were acquired using an Illumina NextSeq500 instrument. Centrifuge was used to perform a read-based taxonomic analysis of metagenomic data. Prokka was used to detect and functionally annotated open reading frames. The predicted amino acid sequence was searched against Swiss-Prot database using DIAMOND. KEGG orthology annotations were predicted for open reading frames. All differential and statistical analyses on taxonomic summaries were performed in edgeR [2] . BacMet [3] , antiSMASH 5.0 [4] , ARTS version 2.0 [5] databases were used to annotate genes. Data format Annotated data, Bray-Curtis dissimilarity matrices, Non-metric multidimensional scaling (NMDS) plots, principal component analysis (PCA) plots, heat map and hierarchal clustering, raw count data, and gradient plots. Parameters for data collection Seasonal environmental water and sediment samples were collected and sequenced. Five water and three sediment samples from summer as well as three sediment samples from winter. Description of data collection Shotgun metagenomic and metatranscriptomic sequencing was performed using an Illumina NextSeq500 instrument. Data source location Sediment (July 28th, 2017 and January 14th, 2018) and water (July 14th, 2017 and July 28th, 2017) samples were collected 90 m upstream from the mouth of Ely

Value of the Data
• This is the first characterization of an acid rock drainage (ARD) metagenome and transcriptome within the Vermont copper belt region, USA, which is comprised of Ely Copper Mine, Elizabeth Mine, and Pike Hill Copper Mine. • The metagenomic data provide seasonal taxonomic profiles of the microbial diversity in the sediment and water of EB-90M. • Active taxa in ARD environments are understudied and the metagenomic and metatranscriptomic data provide insight into their seasonal functional roles within these acidic, metal-rich environments. • These data can be used to perform comparative taxonomic and functional analyses with other ARD metagenomes. • These data can be used to bioprospect enzymes that can be exploited for the bioremediation of metal polluted environments. • These data can be used to identify novel genes encoding proteins involved in the production of bioactive secondary metabolites, which can be used for pharmaceutical and industrial applications.

Data Description
Ten water and six sediment samples at Ely Brook (EB-90M) ( Fig. 1 ), Ely Copper Mine Superfund site were collected in July 2017 and January 2018. Shotgun metagenomic sequencing of nucleic acids extracted from water and sediment samples generated ∼31,545,991 reads with an average length of 147 bp and a total length of 1.54 Gb for 11 samples. Samples of the same sample type (i.e., water or sediment) or season (i.e., summer or winter) were treated as biological replicates. Summer water samples were denoted as July_Water1, July_Water2, July_Water3, July_Water4, July_Water5. Summer sediment samples were denoted as July_Sed1, July_Sed2, and July_Sed3. Winter sediment samples were denoted as Jan_Sed1, Jan_Sed2, and Jan_Sed3. All winter water samples (five samples) did not yield viable sequencing data. Of the remaining 11 samples, ∼12 Gb of data (50 M clusters) were produced per sample with an average of 25,181,359 reads per sample over a range of 8,657,966 and 44,323,783 reads for both metagenomic and metatranscriptomic data. Contigs of ≥ 3264 bp (N50) represented 50% of data and the longest contig was 4 88,56 8 bp in length. Using Centrifuge [6] to perform read-based taxonomic annotation, 141 distinct phyla were annotated, including candidate phyla ( Table 1 ). Taxonomic differences across season and sample type were observed by NMDS and PCA analyses of normalized count data (i.e., counts per million) between the bacteria, archaea, and fungi in samples as well as molecule types ( Figs. 2 -8 ). Differences between molecule type (i.e., DNA or RNA) across sample type and season were assessed by multivariate principal component analyses (PCA) ( Fig. 9 ). Using Prokka-annotated open reading frames [7] , Kyoto Encyclopedia of Genes and Genomes (KEGG) reference pathways [8] were annotated and quantified ( Table 2 ). Significantly differentially expressed KEGG pathways and genes in winter versus summer were defined as having winter/summer RNA p -values ≤ 0.05 for the interaction of season and molecule type followed by false discovery rate (FDR) corrections [9] ( q -values) ≤ 0.05 ( Figs. 10 -12 ). Secondary metabolite gene clusters ( Table 3 ), metal resistance genes ( Table 4 ), and antibiotic resistance genes were identified ( Table 5 ). Approximately 288 metal resistance genes were differentially expressed between winter and summer seasons ( Fig. 13 ). Furthermore, some of these genes were colocalized and coexpressed with genes involved in secondary metabolism ( Table 6 ; Figs. 14 -18 ).      Table 3 antiSMASH annotation. Summary of the number of genes and gene clusters annotated by antiSMASH 5.0 as well as those that match the Prokka-annotated data.
Total count of contigs 575,933 Total number of contigs annotated by antiSMASH 1589 Total number of contigs not annotated by antiSMASH 574,344 antiSMASH annotated genes 10,579 antiSMASH annotated genes that aligned with PROKKA analyzed data 4977 antiSMASH annotated genes that did not align with PROKKA analyzed data 5602 antiSMASH annotated gene clusters that align with PROKKA analyzed data 1349 antiSMASH annotated gene clusters that did not align with PROKKA analyzed data

Statistical comparison of microbial community, DNA, and RNA
EB-90M samples of the same sample type or season were treated as biological replicates. Subsets (i.e., season or sample type) of data were compared to each other in statistical analyses. Beta diversity was evaluated via Bray-Curtis measure of dissimilarity [10] using default parameters in R in the vegan library [11] . Prior to analysis, data were log 10 ( x + 1) transformed and the resulting dissimilarity indices were used to generate NMDS in R using the metaMDS functions in vegan and ggplot2 library [ 11 , 12 ]. Multivariate PCAs were performed in Partek Flow software v8.0 to assess sample group variation based on genera using normalized read counts from readbased taxonomic annotations and quantification. Feature counts (e.g., taxon) were standardized prior to the PCA so that the contribution of each feature did not depend on its variance. PCA plots were generated for DNA and RNA using 1) normalized read counts (i.e., fractions for relative abundance) from the metagenomic assembly and 2) normalized read counts from the metatranscriptome, respectively. Heat maps and hierarchal clusters were generated in Partek Flow v8.0 using the following, respectively: 1) normalized counts of taxa from the metagenome and predicted open reading frames (ORFs) across samples and 2) the Euclidean dissimilarity index and average linkage method to cluster similar expression patterns and taxon abundances. The normalized data were standardized to a mean of zero and a standard deviation of 1 prior to hierarchal clustering.

Differential expression and visualization of KEGG pathways
Differentially expressed KEGG pathways were represented by color gradation maps (Figs. S14-S15). Log 2 fold-changes from gene expression analysis results were converted to a color gradation using KEGG Mapper -Color Pathway tool ( https://www.genome.jp/kegg/tool/map _ pathway3.html ), where blue denotes decreased expression in the winter (RGB color code #6363F7) and red denotes increased expression in the winter (RGB color code #FF0 0 0). Genes with no change in expression are shaded in light gray (RGB color code #D3D3D3). Genes shaded in white indicates that the gene was undetected in the dataset. The numbers in boxes refer to enzyme nomenclature from the KEGG database. Expression data (i.e., normalized counts) for sediment were fit to a linear model, assuming a negative binomial distribution, that included season (i.e., winter versus summer), molecule type (i.e., RNA versus DNA), as well as the interaction of season and molecule type ( p -interaction). Pairwise comparison tests of season were performed within and between each data type and p-values were FDR-corrected [9] . Significant differentially expressed genes met the following criteria: a molecule type-season interaction term p -value of 0.05 or less in combination with an FDR-adjusted p -value ( q -value) of 0.05 or less for the pairwise comparison of winter RNA versus summer RNA. Significant data were indicated by an orange star; however, the overall expression of a node may include other genes.

Analysis of genes involved in natural product biosynthesis, metal resistance, and antibiotic resistance
Contigs were mined for secondary metabolite biosynthetic gene clusters (BGCs) in the bacterial and fungal antiSMASH 5.0 [4] database using default parameters. The BacMet database was used to mine DNA and RNA for experimentally validated metal resistance genes [3] . After filtering annotated-BGCs and BacMet genes that had ≥ 100 raw counts in each sample and at least 10 counts in three or more samples, relative BGC and BacMet gene expression was assessed by comparing the counts of Prokka-annotated transcripts to those of DNA using the criteria described by Giddings et al. [1] . Gradient plots were generated in Partek Flow v8.0 for differentially expressed BGCs and those co-expressed with metal resistance genes. Contigs were also mined for antibiotic resistance genes that were within close proximity or colocalized with BGCs using the Antibiotic Resistant Target Seeker (ARTS) version 2 [5] using default parameters. Duplication and BGC proximity, resistance model screens, and genomes that mapped to the following reference phyla were selected: Actinobacteria and Alphaproteobacteria. Fig. 8. Taxonomic differences. PCA plot demonstrating the differences between genera in summer water and sediment as well as summer (orange) and winter (blue) sediment. Plot is based on normalized read counts at the genus level from the taxonomic annotation and quantification of paired-end reads. The sample name notation is based on the month the sample was collected, the sample type (i.e., sediment or water), and individual sample number. 'Sed' = sediment. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 9. Differences in DNA and RNA. PCA plots of A) DNA in water and sediment and B) RNA present in summer and winter sediment based on normalized counts of all functionally annotated genes from the metagenomic assembly, demonstrating differences between sample type. Each gene's normalized read count contributes equally to the PCA. The sample name notation is based on the month the sample was collected, the sample type (i.e., sediment or water), and individual sample number. 'Sed' = sediment. Fig. 10. Significantly differentially xpressed KEGG pathways. Bar graph of select significantly differentially expressed KEGG pathways in winter versus summer. Differentially expressed pathways were defined based on an unadjusted p-value ≤ 0.05 for the interaction term (molecule type-season) in combination with a q-winter/summer RNA value ≤ 0.05, respectively. Red and blue represent increased and decreased expression in winter, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 11. Carbon fixation in photosynthetic organisms. Carbon metabolism KEGG reference pathway map ( https://www. kegg.jp/pathway/map00710 ) with color gradation highlighting KEGG genes that change significantly between seasons. Log 2 fold-changes from gene expression analyses were converted to a color gradation using the KEGG Mapper -Color Pathway tool, where blue denotes decreased expression in the winter (RGB color code #6363F7) and red denotes increased expression in the winter (RGB color code #FF0 0 0). The Log 2 fold-changes range from −2.33 (blue) to +1.88 (red). Genes with no change in expression are shaded in light gray (RGB color code #D3D3D3) and genes shaded white were undetected in the dataset. Significantly differentially expressed genes are indicated by a star and met the following criteria: p-interaction value ≤ 0.05 in combination with a q-winter/summer RNA value ≤ 0.05, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 12. Nitrogen metabolism gene expression. Nitrogen metabolism KEGG reference pathway map diagram ( https://www. kegg.jp/pathway/map00910 ) with color gradation highlighting KEGG genes that change significantly between seasons. Log 2 fold-changes from gene expression analyses were converted to a color gradation using the KEGG Mapper -Color Pathway tool, where blue denotes decreased expression in the winter (RGB color code #6363F7) and red denotes increased expression in the winter (RGB color code #FF0 0 0). The Log 2 fold-changes range from −3.92 (blue) to +1.91 (red). Genes with no change in expression are shaded in light gray (RGB color code #D3D3D3) and genes shaded white were undetected in the dataset. Significantly differentially expressed genes are indicated by a star and met the following criteria: p-interaction value ≤ 0.05 in combination with a q-winter/summer RNA value ≤ 0.05, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)  (288) genes (e.g., dnaK, copA, copB, copD, pst5, cusA, cusB, mdtA, mdtB, mdtC, actP, mco, ycnJ, corA, csoR , and copZ ) from the BacMet database across sediment samples. Increases or decreases in gene expression range from −2.04 (blue) to +2.04 (red). All data met the following criteria: p-interaction value ≤ 0.05 in combination with a q-winter/summer RNA value ≤ 0.05, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 14.
Colocalization and coexpression of metal resistance and secondary metabolite genes. Gradient plot demonstrating the differential coexpression of mdtA , a metal resistance gene encoding a multidrug resistance protein, with a gene ( ppsE ) annotated to be involved in phthiocerol/phenolphthiocerol polyketide biosynthesis in contig 4698 (20,390 nucleotides long) in summer (orange) and winter (blue). The lines on the y-axis represent the maximum, minimum, and mean of the standardized expression values (i.e., counts per million). All data met the following criteria: p-interaction ≤ 0.05 in combination with a p-winter/summer RNA ≤ 0.05, respectively. Nucleotide positions in contig are shown below gene IDs. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 15. Colocalization and coexpression of metal resistance and secondary metabolite genes. Gradient plot demonstrating the differential coexpression of mgtA , a metal resistance gene encoding a cation transport ATPase that mediates magnesium influx into the cytosol, with genes ( lgrD ) annotated to be involved in gramicidin biosynthesis in contig 80 (113,676 nucleotides long) in summer (orange) and winter (blue). The lines on the y-axis represent the maximum, minimum, and mean of the standardized expression values (i.e., counts per million). Only mgtA met the following criteria: p-interaction ≤ 0.05 in combination with a q-winter/summer RNA ≤ 0.05, respectively. Nucleotide positions in contig are shown below gene IDs. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 16. Colocalization and coexpression of metal resistance and secondary metabolite genes. Gradient plot demonstrating the differential coexpression of czcA , a metal resistance gene encoding a cobalt-zinc-cadmium resistance protein, with a ligase/MSMEI_5285 gene annotated to be involved in the biosynthesis of a polyketide in contig 185 (85,942 nucleotides long) in summer (orange) and winter (blue). The lines on the y-axis represent the maximum, minimum, and mean of the standardized expression values (i.e., counts per million). Only czcA met the following criteria: p-interaction ≤ 0.05 in combination with a q-winter/summer RNA ≤ 0.05, respectively. Nucleotide positions in contig are shown below gene IDs. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 17. Colocalization and coexpression of metal resistance and secondary metabolite genes. Gradient plot demonstrating the differential coexpression of smtB , a zinc-resistance gene encoding a repressor protein of the metallothionein gene smtA , with a gene annotated to be involved in the biosynthesis of a terpene in contig 214 (80,995 nucleotides long) in summer (orange) and winter (blue). The lines on the y-axis represent the maximum, minimum, and mean of the standardized expression values (i.e., counts per million). Only SmtB met the following criteria: p-interaction ≤ 0.05 in combination with a q-winter/summer RNA ≤ 0.05, respectively. Nucleotide positions in contig are shown below gene IDs. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Fig. 18. Colocalization and coexpression of metal resistance and secondary metabolite genes. Gradient plot demonstrating the differential coexpression of mdtA , a metal resistance gene encoding multidrug resistance protein, with genes annotated to be involved in the biosynthesis of a terpene in contig 12,335 (11,958 nucleotides long) in summer (orange) and winter (blue). The lines on the y-axis represent the maximum, minimum, and mean of the standardized expression values (i.e., counts per million). Only mdtA met the following criteria: p-interaction ≤ 0.05 in combination with a q-winter/summer RNA ≤ 0.05, respectively. Nucleotide positions in contig are shown below gene IDs. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)