Datasets of surface water microbial populations from two anthropogenically impacted sites on the Bhagirathi-Hooghly River

The Bhagirathi-Hooghly River, part of the River Ganga, flows along densely urbanized areas in West Bengal, India. The River water is extensively used for household activities, human consumption including bathing, social purposes and multifaceted industrial usage. As a result of discharge of untreated municipal sewage and effluents from industries there is evidence of heavy pollution in this River. Two urbanized sites on the Bhagirathi-Hooghly River, namely Kalyani and Kolkata, were sampled to elucidate the resident microbial communities in lieu of anthropogenic forcing with respect to pollution. The Kalyani station (Kal_Stn1) lies upstream to the Kolkata station (Kol_Stn7) and are approximate 50 km away from each other and located along the bank of Bhagirathi-Hooghly River. Sampling was undertaken in monsoon (September 2018). In situ environmental parameters were measured during sampling and dissolved nutrients were estimated from formalin fixed filtered surface water along with pesticides analysis. One litre surface water sample was collected from each station and environmental DNA was sequenced to identify resident microbial communities (bacterioplankton and oxygenic photoautrophs-phytoplankton). The bacterioplankton community structure was elucidated by sequencing the V4 region of the 16S rDNA on an Illumina MiSeq platform. Proteobacteria was found to be the most abundant bacterioplankton phylum in both sampling stations. Similar to bacterioplankton, variation in oxygenic photoautotrophic community structure including phytoplankton forms was found at phylum, class and family levels. The phytoplankton communities were elucidated by sequencing the V9 region of the 18S rDNA on an Illumina MiSeq platform. Chrysophyta was found to be the most abundant phytoplankton phylum identified from both stations, followed by Chlorophyta and other groups. Variation in phytoplankton community structure between the stations was distinct at phylum, class and family levels.

The Bhagirathi-Hooghly River, part of the River Ganga, flows along densely urbanized areas in West Bengal, India. The River water is extensively used for household activities, human consumption including bathing, social purposes and multifaceted industrial usage. As a result of discharge of untreated municipal sewage and effluents from industries there is evidence of heavy pollution in this River. Two urbanized sites on the Bhagirathi-Hooghly River, namely Kalyani and Kolkata, were sampled to elucidate the resident microbial communities in lieu of anthropogenic forcing with respect to pollution. The Kalyani station (Kal_Stn1) lies upstream to the Kolkata station (Kol_Stn7) and are approximate 50 km away from each other and located along the bank of Bhagirathi-Hooghly River. Sampling was undertaken in monsoon (September 2018). In situ environmental parameters were measured during sampling and dissolved nutrients were estimated from formalin fixed filtered surface water along with pesticides analysis. One litre surface water sample was collected from each station and environmental DNA was sequenced to identify resident microbial communities (bacterioplankton and oxygenic photoautrophsphytoplankton). The bacterioplankton community structure was elucidated by sequencing the V4 region of the 16S rDNA on an Illumina MiSeq platform. Proteobacteria was found to be the most abundant bacterioplankton phylum in both sampling stations. Similar to bacterioplankton, variation in oxygenic photoautotrophic community structure including phytoplankton forms was found at phylum, class and family levels. The phytoplankton communities were elucidated by sequencing the V9 region of the 18S rDNA on an Illumina MiSeq platform. Chrysophyta was found to be the most abundant phytoplankton phylum identified from both stations, followed by Chlorophyta and other groups. Variation in phytoplankton community structure between the stations was distinct at phylum, class and family levels.  Value of the Data These datasets provide baseline information to track pollution and health of River Ganga by using indicator microbial groups as proxies for pollution. These datasets would be of important to scientific community, policy makers, and ecosystem managers engaged in basin management of Ganga River and also working to provide clean and safe drinking water, the source of such water is Ganga. These datasets will help to develop biological intervention strategies for cleaning polluted water of River Ganga. Further information added to the reported datasets can help to track anthropogenic forcing such as climate change on the functioning of microbial communities and resulting changes in Ganga River ecosystem.

Sampling sites
Two sites, namely, Kalyani and Kolkata, on the lower stretch of the Bhagirathi-Hooghly River, were selected to elucidate the resident microbial communities. These two sites are~50 km apart from each other with Kalyani lying upstream of Kolkata and have considerably high concentration of dissolved forms of nitrogen, consistently low dissolved oxygen profiles as well as presence of different forms of pesticides which are indicative of pollution driven by anthropogenic forcing [10].

Sampling
Bacterioplankton and oxygenic photoautotrophic communities including phytoplankton were elucidated by sequencing 16S rDNA and 18S rDNA respectively. Sampling was conducted in Kolkata (Kol_Stn7; 22.56 N 88.33 E) and Kalyani (Kal_Stn1; 22.99 N 88.41 E) in monsoon (September 2018). One litre surface water samples were collected and were immediately fixed with molecular grade absolute alcohol (Merck, India) and transferred to the laboratory. One litre of surface water samples were collected and fixed with formalin (4%; Merck, India) and used for total hardness, total alkalinity and dissolved nutrients measurement.

Measurement of dissolved nutrients and pesticides detection
Following standard published protocol, dissolved nitrate [9] and o-phosphate [6] were analyzed. All measurements were done in triplicates using a UVeVis Spectrophotometer (Hitachi U2900, Japan). Presence of pesticides in the surface water samples were detected during a Triple Quadrupole GC-MS/ MS (TSQ 8000 Evo, Thermo Fisher Scientific). To find the difference in concentration of detected pesticides in the two sampling stations, Student's T-test was performed in MS Office Excel 2010 and a p-value of 0.1, 0.05 and 0.001 were considered to be significant.

Environmental DNA extraction and sequencing
Biomass was concentrated by filtration through a 0.22 mm nitrocellulose filter paper of 47 mm diameter (Pall, USA) using standard methodology [1]. Environmental DNA (eDNA) was extracted from each filter in triplicates following published protocol [7]. The bacterioplankton communities were elucidated by sequencing the V4 hypervariable region of 16S rDNA using 515F (5ʹ-GTGCCAGCMGCCGCGGTAA-3ʹ) and 806R (5ʹ-GGACTACHVGGGTWTCTAAT-3ʹ) primers [5]. The phytoplankton communities were elucidated by sequencing the V9 hypervariable region of the 18S rDNA using 1391F (5ʹ-GTACACACCGCCCGTC-3ʹ) and EukBr (5ʹ-TGATCCTTCTGCAGGTTCACCTAC-3ʹ) primers [8]; [4]). All PCR reactions were performed in triplicates and pooled together. Amplicon libraries were prepared using NEBNext Ultra DNA Library Preparation kit (NEB, USA). Following purification using 1X AmpureXP beads, the libraries were quantified on Agilent High Sensitivity (HS) chip on Bioanalyzer 2100 and quantified using Qubit dsDNA HS Array Kit (Thermo Fisher Scientific). Amplicon libraries were then sequenced on an Illumina MiSeq platform at a concentration of 10e20 pM.

Raw data processing
The generated sequences were processed using SILVAngs 1.3 (https://ngs.arb-silva.de/silvangs; [3]). Generated raw data sequences were aligned, quality filtered, dereplicated, clustered into OTUs and taxonomically classified. Sequences with less than 97% identity to any BLAST hit were marked as 'no relative' [2]. The bar plots were generated in MS Excel 2010 using the taxonomic assignment files obtained from SILVA. This allowed for the comparison of microbial community compositions between the two stations.

Data accessibility
All sequence data were submitted to the NCBI Sequence Read Archive (SRA) under Accession numbers SRR10430152, SRR10430151, SRR10430093 and SRR10430092. Datasets are available on Mendeley and can be accessed using the link https://data.mendeley.com/datasets/84crm633m2/1.