A total of 219 metagenome-assembled genomes of microorganisms from Icelandic marine waters

Marine microorganisms contribute to the health of the global ocean by supporting the marine food web and regulating biogeochemical cycles. Assessing marine microbial diversity is a crucial step towards understanding the global ocean. The waters surrounding Iceland are a complex environment where relatively warm salty waters from the Atlantic cool down and sink down to the deep. Microbial studies in this area have focused on photosynthetic micro- and nanoplankton mainly using microscopy and chlorophyll measurements. However, the diversity and function of the bacterial and archaeal picoplankton remains unknown. Here, we used a co-assembly approach supported by a marine mock community to reconstruct metagenome-assembled genomes (MAGs) from 31 metagenomes from the sea surface and seafloor of four oceanographic sampling stations sampled between 2015 and 2018. The resulting 219 MAGs include 191 bacterial, 26 archaeal and two eukaryotic MAGs to bridge the gap in our current knowledge of the global marine microbiome.


INTRODUCTION
Marine microorganisms are crucial to the global ecosystem as they regulate the carbon cycle (Azam, 1998;Falkowski, Fenchel & Delong, 2008) and support the marine food web (Pomeroy, 1974;Azam et al., 1983). The study of microorganisms within complex environments, such as the ocean, was accelerated by the emergence of sequencing technologies. In particular, metagenomics-the study of the total genetic material recovered from an environmental sample-have provided previously unavailable information on the functional diversity and ecology of the microbial communities within their environments (Hugenholtz & Tyson, 2008;Quince et al., 2017).
Large-scale metagenomics projects, such as the Global Ocean Sampling (Venter et al., 2004;Rusch et al., 2007), Ocean Sampling Day (Kopf et al., 2015) and Tara Oceans (Sunagawa et al., 2015;Sunagawa et al., 2020), have provided fascinating new insights, but also revealed the gaps in our knowledge of marine microbial species, their geographical distribution, and their organisation in complex and dynamic communities. These and other large-scale initiatives have so far not covered the oceanic regions around Iceland, a complex marine environment that is characterized by distinct water masses and powerful currents: the cold Polar Water of the East Greenland Current and the Arctic Water of the East Icelandic Current from the north and the warm North Atlantic Water of the Irminger Current from the south (Malmberg, Valdimarsson & Mortensen, 1995;Valdimarsson & Malmberg, 1999). Most microbial studies in Icelandic waters have so far been conducted with traditional methods, like chlorophyll measurements or microscopy, and were therefore mainly focused on larger heterotrophs and photosynthetic microorganisms (Thórdardóttir, 1986;Gudmundsson, 1998;Astthorsson, Gislason & Jonsson, 2007). To establish the baseline knowledge of microbial ecology in Icelandic marine waters, we assembled metagenomic sequence data into draft microbial genomes often called metagenome-assembled genomes (MAGs).
The recovery of MAGs opens the route to further analysis such as comparative genomics to understand the roles of these microorganisms within their community and ecosystem (Sangwan, Xia & Gilbert, 2016). MAGs are particularly valuable for yet uncultured marine lineages as they reveal the metabolic potential and environmental adaptation of these microorganisms and give clues about trophic interactions and ecology within the environment. Several marine metagenomic studies recovered MAGs from marine environments with-among others-136 MAGs from the Red Sea (Haroon et al., 2016), 290 from the Mediterranean Sea (Tully et al., 2017), and 2,631 from the global oceans with data harvested by Tara Oceans (Tully, Graham & Heidelberg, 2018).
Here, we report 219 MAGs from 31 samples collected in the Arctic Ocean north of Iceland and in the warmer Atlantic waters south of Iceland. The samples were collected between 2015 and 2018 at four established oceanographic sampling stations visited during six research cruises with two depths sampled at each station. A set of metadata is available for these samples following the best practices recommended by Ten Hoopen et al. (2017), offering an opportunity to further understand the environmental conditions that shape the microbial communities in the waters off the Icelandic coasts.

Sampling
Seawater samples were collected between May 2015 and May 2018 from four stations, two in the North Atlantic Ocean, Selvogsbanki 2 and 5 (SB2 and SB5), and two in the Arctic Ocean, Siglunes 3 and 8 (SI3 and SI8) ( Fig. 1A and Table 1). Sampling was conducted on board of the oceanographic research vessel Bjarni Saemundsson RE 30 operated by the Icelandic Marine Research Institute (MRI) by collecting 5 L of seawater from the surface and the seafloor of the ocean, using Niskin bottles on a CTD rosette sampler. Seawater samples were directly filtered onto 0.22 µm Sterivex filter units (Merck Millipore) and immediately flash frozen in liquid nitrogen before stored at −80 • C until further processing (full workflow in Fig. 1B).

Mock community
A marine mock community was included in the analysis for quality control, consisting of 20 bacterial and two archaeal species. Strains were cultivated according to Table 2. After 12 to 24 h of growth (to obtain 10e6 to 10e8 cell/ml), cells were counted on a Thoma cell BRAND (ref. 718020; 0.100 mm depth) to achieve a final concentration of 1.29 × 10e9 cell/L by dilutions. Synthetic seawater was prepared by adding 150 g of sea salts (Sigma-Aldrich, S9883 and 17.25 g of PIPES (Sigma-Aldrich, P1851) to 5 L of autoclaved MilliQ water. The mock community was immediately treated in the same manner as the other seawater samples and filtered onto Sterivex filters for DNA extraction.

DNA extractions
DNA was extracted from all samples using the QIAGEN AllPrep kit according to the manufacturer's instructions with modifications. Sterivex filters were aseptically removed from their plastic casing as described by Cruaud et al. (2017). Filters were transferred to tubes containing 600 µl RTL buffer from the kit and 0.2 g of 0.1 mm zirconia/silica beads (BioSpec, cat. 11079101z) for mechanical disruption of the cells (bead-beating) using a Disrupt MixerMill MM400 by Retsch with the program P9 (300 Hz) three times for 10 s each, cooling down tubes in icy water in between each bead-beating step. DNA quality was assessed with a NanoDrop 1000 Spectrophotometer (ThermoFisher) and DNA was quantified with a Qubit fluorometer (Qubit DNA BR assay, Invitrogen).

Library preparation and sequencing
High-throughput sequencing of the samples was performed by Genome Quebec using the HiSeq system (Illumina). Libraries were prepared using NEBNext UltraTM II DNA Library Prep Kit for Illumina (New England Biolabs) followed by sequencing on two lanes of an Illumina HiSeq 4000 PE150 system (Illumina) allocating 1/20 and 1/25 of a lane for each sample. Demultiplexing and conversion to FASTQ files were performed using bcl2fastq Conversion Software v1.8.4 (Illumina) resulting in 32 metagenomic datasets.

Co-assembly and binning
The quality of the raw sequencing reads was assessed using FastQC v0.11.8 (Andrews et al., 2012) Lander et al., 2001;Schneider et al., 2017). Resulting quality-filtered metagenomic data were divided into surface and seafloor datasets as the surface of the ocean can be considered a different environment compared to the seafloor (Fig. S2). Both datasets also included the mock community. After quality filtering, MEGAHIT v1.2.9 (Li et al., 2015;Li et al., 2016) (parameters: -min-contig-len 1000 -m 0.85) co-assembled both datasets of samples with a minimum contig length of 1000 bp, resulting in two FASTA files of community contigs. Quality-filtered short reads from each sample were mapped back to the contigs of both co-assemblies respectively using Bowtie v2 (default parameters and -no-unal flag) (Langmead & Salzberg, 2012). The resulting SAM files were indexed and converted to BAM files with SAMTOOLS v0.3.3 (parameters: view -F 4 -bS) (Li et al., 2009 (Parks et al., 2015). Based on the comparison of the three binning algorithms, we selected the ''good quality bins'' from MetaBAT 2 with an estimated completion above 50% and an estimated redundancy below 10% according to standards suggested by Bowers et al. (2017). The relative proportions of good quality bins in the total number of bins was assessed by chi 2 test. to construct two bacterial and two archaeal phylogenomic trees containing good quality MAGs (completeness ≥50%; contamination ≤10%) and Genome Taxonomy Data Bank (GTDB) R95 (released in July 2020) reference genomes to confirm taxonomic assignments of the MAGs (Parks et al., 2018). The trees were reconstructed using ARB (Ludwig et al., 2004) for comprehensive visualisation.

Data availability
The raw Illumina sequencing paired-end reads are available in the ENA under project accession number PRJEB41565 (ERP125360). MAGs are available under accession numbers ERS5621908 to ERS5622126. Code is available at https://github.com/clarajegousse/.

Co-assemblies
The co-assembly of the 16 samples of the surface of the ocean yielded 445,328 contigs, with a minimal length of 1,000 bp, representing a total length of 1.06 Gb (1,060,942,783 nucleotides) with N50 of 2,627 bp and 1,271,859 gene calls (Table 3). The co-assembly of the 17 samples of the seafloor of the ocean yielded 554,104 contigs, with a minimal length of 1,000 bp, representing a total of length of 1.23 Gb (1,233,390,295 nucleotides) with N50 of 2,327 bp and 1,532,800 gene calls (Table 3).

Binning
A comparison of the three binning algorithms -CONCOCT, MaxBin2 and MetaBAT 2 -was conducted on the surface and seafloor co-assemblies based on the number of good quality bins (Fig. 2). Good quality bins have an estimated completion above 50% and an estimated redundancy (also called estimated contamination) below 10% (Bowers et al., 2017). The relative proportions of good quality bins is significantly different for the three

Figure 2 Binning comparison. Numbers of contigs binned and numbers of bad and good quality bins obtained with CONCOCT, MaxBin2 and MetaBAT 2 from the surface co-assembly (A) and the seafloor co-assembly (B).
Numbers of contigs binned is represented by the size of the pie plots. Numbers and percentages of bad quality bins and good quality bins are shown within the grey and coloured slices of the chart respectively. Good quality bins have an estimated completion above 50% and an estimated redundancy (also called estimated contamination) below 10% (Bowers et al., 2017).
Full-size DOI: 10.7717/peerj.11112/ fig-2 binning methods (χ 2 = 135.23, df = 2,p-value < 2.2e−16). The results of the binning showed that MetaBAT 2 resulted in a lower number of bins compared to CONCOCT and MaxBin2. Yet the number of good quality bins was much higher with MetaBAT 2 compared with CONCOCT and MaxBin2 (Table 4).
MetaBAT 2 gave the best results which were used for further analysis and shown in more detail in Fig. 3. Out of the 279 bins identified by MetaBAT 2 for the surface samples, 42.4% (118) of them are good quality bins that can be considered draft MAGs according to Bowers et al. (2017). Within the 118 good quality MAGs (Fig. 3B), 16 represent genomes of organisms from the mock community and 102 are assembled from the surface seawater. In the same manner, out of the 299 bins identified by MetaBAT 2 for the seafloor samples, 45.81% (134) of can be considered good draft MAGs. Within the 134 good quality MAGs (Fig. 3D), 17 represent genomes of organisms from the mock community and 117 are assembled from the seawater at the seafloor. The relative proportions of MAGs out of the total number of bins is the same out of the two co-assemblies datasets (χ 2 = 0.27784, df = 1,p-value = 0.5981) which means that the environments do not seem to impact significantly the number of MAGs. In the same manner, the relative proportions of MAGs associated to the mock community out of the total number of MAGs is the same in the two co-assemblies datasets (χ 2 = 0.0003, df = 1,p-value = 0.9858).

Taxonomy
When excluding members of the mock community based on taxonomic assignment and differential coverage, we identified 102 MAGs reconstructed from the surface co-assembly and 117 MAGs from the seafloor co-assembly. The surface MAGs include two eukaryotes (Bathycoccus and Micromonas), 92 bacteria, and eight archaea while the seafloor MAGs include 99 bacteria, 18 archaea and no eukaryotes. The surface co-assembly yielded a total of 92 bacterial MAGs (Fig. 4). These MAGs are members of seven phyla (number of MAGs in brackets): Proteobacteria (52), Bacteroidota (31), Actinobacteriota (2), Verrumicrobiota (2), Planctomycetota (2), SAR324 (1) and Cyanobacteria (1). The MAG within the Cyanobacteria phylum belongs to the genus Synechococcus. Within the phylum Actinobacteriota, we retrieved two MAGs: one from a member of the genus Aquiluna and one of the genus Pontimonas. We reconstructed two MAGs within the phylum Planctomycetota. The two MAGs within the Verrumicrobiota belong to the family Akkermansiaceae. The Bacteroidota phylum includes 31 MAGs reconstructed from the sea surface co-assembly. Most of these Bacteroidota MAGs belong to the Flavobacteriaceae family (18), including one representant of the genus Polaribacter. Many MAGs within the Flavobacteriaceae family are related to MAGs revealed by Tara Ocean Consortium such as Cryomorphaceae bacterium and Flavobacteriales bacterium (CFB group bacteria). We also reconstructed 52 MAGs belonging to the phylum of Proteobacteria, including nine Rhodobacteraceae, ten SAR86 and ten Porticoccaceae. Within the three MAGs of the Burkholderiales order, one is within the Burkholderia genus, and the two others belong to the Methylophilaceae family according to GTDB.
The seafloor co-assembly yielded a total of 99 bacterial MAGs spanning across 12 phyla: Proteobacteria (46), Verrumicrobiota (9), Bacteroidota (9), Marinisomatota (8), Actinobacteria (5), Planctomycetota (5), Gemmatimonadota (4), Nitrospinota (3), Chloroflexota (2), SAR324 (2), Myxococcota (1), Lactescibacterota (1). Six of these phyla include exclusively MAGs from the seafloor (Nitrospinota, Myxococcota, Gemmatimonadota, Marinisomatota, Chloroflexa, Lactescibacterota). Within the Proteobacteria, most of the MAGs belong to the Gammaproteobacteria class with 32 MAGSs while the remaining 14 are part of the Alphaproteobacteria. Five orders within the Proteobacteria exclusively include MAGs reconstructed from the seafloor co-assembly Bad quality bins (completeness below 50% and redundancy above 10%) are shown in grey while good quality bins are in colours (green for surface, blue for seafloor samples). (A) A total of 279 bins obtained with MetaBAT 2 from the surface co-assembly with 118 good quality bins. (B) Good quality bins from the surface co-assembly with the identification bins corresponding to members of the mock community. (C) A total of 299 bins obtained with MetaBAT 2 from the seafloor co-assembly with 134 good quality bins. (D) Good quality bins from the seafloor with the identification of the bins corresponding to members of the mock community.
Out of the 21 bacterial species of the mock community, 12 of them were re-assembled and given the correct taxonomic assignment down to species level (if available for the strain used) for Alteromonas sp., Geobacillus marinus, Colwellia sp., Escherichia coli, Marinobacter sp., Photobacterium sp., Pseudoalteromonas sp., Reinekea marinisedimentorum, Sulfitobacter donghicola, Sulfitobacter guttiformis, Sulfitobacter pontiacus and Thermus thermophilus. However, some distinct species of the mock community that belong to the same genus do not match any specific MAGs but seem to have been reassembled as one single MAG within the genus in question, such as Reinekea aestuarii and Reinekea sp. 84 as well as Sulfitobacter undariae and Sulfitobacter sp. 87. The genomes of Bacillus thermoleovorans, Dietzia sp., Halomonas sp. and Vibrio cyclitrophicus were not reassembled. The surface co-assembly yielded only eight archaeal MAGs (Fig. 5), all within the Thermoplasmota phylum, including three MAGs within the genus MGIIb-O2 of the Thalassarchaeaceae family and five within the Poseidoniaceae family. The seafloor co-assembly resulted in 18 archaeal MAGs including one representant of the Thermoproteota phylum: this MAGs belongs to the UBA57 phylum within the order of the Nitrososphaerales. The 17 other archaeal MAGs are all comprised in the Thermoplasmatota phylum, within the class Poseidoniia, including representatives of the Poseidoniaceae and Thalassarchaeaceae families. The two archaeal members within the mock community (Pyrococcus abyssi and Thermococcus barophilus) were successfully reconstructed in both co-assemblies.

DISCUSSION
Mock communities are used to quantify and characterise biases introduced in the sample processing pipeline (Brooks et al., 2015) and are indispensable to benchmark sequencing methods and downstream analysis (Singer et al., 2016;Sevim et al., 2019). Mock communities can also be used as a positive control for metagenomic studies. Our mock community confirmed that MetaBAT 2 was able to resolve genomes of species within the same genus, thus making it the most suitable binning algorithms out of the three tested in this study: CONCOCT, MaxBin2 and MetaBAT 2. This result is consistent with previous studies (Yue et al., 2020). The ocean is a vast continuum and the samples were taken within a relatively small section/fraction of the North Atlantic Ocean at several sampling depths: the surface and the seafloor (90 m, 470 m, 1,006 m, and 1,060 m depending on the station). The differences in the sampling depth implies differences in lighting, pressure and temperature compared to the surface of the ocean. While the surface of the ocean is subjected to seasonal variations in day light and temperature, the seafloor remains darker and colder than the surface, and such parameters are driving microbial community structure and function. Therefore, we considered the surface and the seafloor of the ocean as two different types of environments which justifies our approach of two co-assemblies rather than assembling all of the 32 samples together. The fact that a number of MAGs were exclusively found in only one of the two environments, confirmed this.

CONCLUSIONS
The goal of this study was to reconstruct MAGs from 31 samples from Icelandic sea waters. The 219 MAGs span across 13 bacterial and two archaeal phyla and contribute to a more define picture of the global marine microbiome. Moreover, this study confirms, thanks to the inclusion of a mock community in the analysis, that the combination of co-assembly and binning with MetaBAT 2 allows, despite a relatively shallow sequencing depth, the recovery of quality MAGs that are a precious resource for further ecological and environmental studies.

DNA Deposition
The following information was supplied regarding the deposition of DNA sequences: Data are available at the ENA under project number PRJEB41565: all MAGs: ERS5621908 to ERS5622126; the surface and seafloor co-assemblies: ERS5565811 and ERS5565812.

Data Availability
The following information was supplied regarding data availability: Code is available at Github: https://github.com/clarajegousse/mime. The following data are available at ENA: -Raw data, co-assemblies and MAGs: PRJEB41565.
-Raw sequence data for the mock community: ERS5472810 to ERS5472840, and ERS5475418.
-The surface and seafloor co-assemblies: ERS5565811 and ERS5565812 respectively.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.11112#supplemental-information.