290 Metagenome-assembled Genomes from the Mediterranean Sea: Ongoing Effort to Generate Genomes from the Tara Oceans Dataset

The Tara Oceans Expedition has provided large, publicly-accessible microbial metagenomic datasets from a circumnavigation of the globe. Utilizing several size fractions from the samples originating in the Mediterranean Sea, we have used current assembly and binning techniques to reconstruct 290 putative high-quality metagenome-assembled bacterial and archaeal genomes, with an estimated completion of ≥50%, and an additional 2,786 bins, with estimated completion of 0-50%. We have submitted our results, including initial taxonomic and phylogenetic assignments for the putative high-quality genomes, to open-access repositories (iMicrobe and FigShare) for the scientific community to use in ongoing research.


Introduction 23
Microorganisms are a major constituent of the biology within the world's oceans and act as the 24 important linchpins in all major global biogeochemical cycles 1 . Marine microbiology is among 25 the disciplines at the forefront of pushing advancements in understanding how microorganisms 26 respond to and impact the local and large-scale environments. An estimated 10 29 Bacteria and 27 Archaea 2 reside in the oceans and an immense amount of poorly constrained, and ever evolving 28 genetic diversity. 29 The Tara Oceans Expedition (2003-2010) encompassed a major endeavor to add to the 30 body of knowledge collected during previous global ocean surveys to sample the genetic 31 potential of microorganisms 3 . To accomplish this goal, members of Tara Oceans sampled 32 planktonic organisms (viruses to fish larvae) at two major depths, the surface ocean and the 33 mesopelagic. The amount of data collected was expansive and included 35,000 samples from 34 210 ecosystems 3 . The Tara Oceans Expedition generated and publically released 7.2 Tbp of 35 metagenomic data from 243 ocean samples from throughout the global ocean, specifically 36 targeting the smallest members of the ocean biosphere, the viruses, Bacteria and Archaea, and 37 picoeukaryotes 4 . Initial work on these fractions produced a large protein database, totaling > 40 38 million nonredundant protein sequences and identified >35,000 microbial operational taxonomic 39 units (OTUs) 4 . 40 Leveraging the publically available metagenomic sequences from the "girus" (giant virus; 41 0.22-1.6 µm), "bacteria" (0.22-1.6 µm), and "protist" (0.8-5 µm) size fractions, we have 42 performed a new joint assembly of these samples using current sequence assemblers (Megahit 5 ) 43 and methods (combining assemblies from multiple sites using Minimus2 6 ). These metagenomic 44 assemblies were binned using a strictly coverage based binning algorithm 7 in to 290 high-quality 45 (low contamination) microbial genomes, ranging from 50-100% estimated completion. 46 Environmentally derived genomes representing the most abundant microorganisms are 47 imperative for a number of downstream applications, including comparative genomes, 48 metatranscriptomics, and metaproteomics. This series of genomic data can allow for the 49 recruitment of environmental "-omic" data and provide linkages between functions and 50 phylogenies. This method was initially performed on the seven sites from the Mediterranean Sea 51 containing microbial metagenomic samples (TARA007, -009, -018, -023, -025 and -030), but 52 will continue through the various Longhurst provinces 8,9 sampled during the Tara Oceans 53 project ( Figure 1). All of the assembly data is publically available, including the initial Megahit 54 assemblies for each site from the various size fractions and depths and putative (minimal quality 55 control) genomes within iMicrobe (http://imicrobe.us). 56 57 Materials and Methods 58 59 A generalized version of the following workflow is presented in Figure 2. 60 61 Sequence Retrieval and Assembly 62 All sequences for the reverse and forward reads from each sampled site and depth within the 63 Mediterranean Sea were accessed from European Molecular Biology Laboratory (EMBL) 64 utilizing their FTP service (Table 1). Paired-end reads from different filter sizes from each site 65 and depth (e.g., TARA0007, girus filter fraction, sampled at the deep chlorophyll maximum) 66 were assembled using Megahit 5 (v1.0.3; parameters: --preset, meta-sensitive). To keep consistent 67 with TARA sample nomenclature, "bacteria" or "BACT" will be used to encompass the size 68 fraction 0.22-1.6 µm. All of the Megahit assemblies were pooled in to two tranches based on 69 assembly size, ≤1,999bp, and ≥2,000bp. Longer assemblies (≥2kb) with ≥99% semi-global 70 identity were combined using CD-HIT-EST (v4.6; -T 90 -M 500000 -c 0.99 -n 10). The reduced 71 set of contiguous DNA fragments (contigs) was then cross-assembled using Minimus2 6 (AMOS 72 v3.1.0; parameters: -D OVERLAP=100 MINID=95 (2) bins with "high" contamination (≥50% complete and ≥10% cumulative redundancy); and (3) 88 low completion bins (<50% complete). The high contamination group were additionally binned 89 using the BinSanity refinement method (refine-contaminated-log.py; parameters: -p 'variable', -90 m 2000, -v 200, -d 0.9), which utilizes affinity propagation 13 to cluster contigs within a bin based 91 on tetranucleotide frequencies and %G+C. 92 To determine the preference values needed to successfully bin the high contamination 93 bins, 15 bins were assessed manually using the number of marker occurrences determined by 94 CheckM. Bins containing approximately two genomes, three genomes, and bins with more 95 genomes used a preference of -1000 (-p -1000), -500 (-p -500), and -100 (-p -100), respectively. 96 The 15 manually assessed bins were used to train a decision tree within scikit-learn 14 (default 97 parameters, DecisionTreeClassifier) to assign parameters to the other bins. The resulting bins 98 were added to one of the three categories: putative high quality genomes, high contamination 99 bins, and low completion bins. The high contamination bins were processed for a third time with 100 the BinSanity refinement step utilizing a preference of -100 (-p -100). These bins were given 101 final assignments to either the putative high quality genomes (some putative genomes had >10% 102 cumulative contamination, but have been designated) or low completion bins. Bins determined to 103 be low completion bins were reserved for an additional round of binning (see below). 104 After this initial round of binning, all contigs not assigned to putative high-quality 105 genomes were assessed using BinSanity using raw coverage values. Two additional rounds of 106 refinement were performed (as above) with the first round of refinement using the decision tree 107 to determine preference and the second round using a set preference of -10 (-p -10). Following 108 this binning phase, contigs were assigned to high quality bins (e.g., Tara Mediterranean genome 109 1, referred to as TMED1, etc.), low completion bins with at least five contigs (0-50% complete; 110 TMEDlc1, etc. lc, low completion), or were not placed in a bin (Supplemental Table 1  using DIAMOND 18 (v0.8.11.73; parameters: -f xml -k 5 --sensitive -e 1e-10) and the output was 122 processed using MEGAN 19 (v4; parameters: recompute toppercent = 5, recompute minsupport = 123 1, collapse rank = species, select nodes = all) to determine the last common ancestor for the top 124 five matches. Using a script from the Multi-Metagenome package (hmm.majority.vote.pl; 125 https://github.com/MadsAlbertsen/multimetagenome; parameters: -n -l 4 [-l 5, -l 6, or -l 7]), each 126 contig was assigned a consensus taxonomic identification at approximately the Phylum, Class, 127 Order, and Family levels. A consensus for all contigs at each taxonomic level was determined. If 128 at any level a tie was achieved between possible assignments, it has been denoted with a "T" in 129 the genome (positions with >50% gaps) and re-aligned with MUSCLE (parameters: -maxiters 8). An 143 approximate maximum likelihood (ML) tree with pseudo-bootstrapping was constructed using 144 FastTree 26 (v2.1.3; parameters: -nt -gtr -gamma; Figure 3). 145 High-quality genomes were assessed for the presence of the 16 ribosomal markers genes 146 used in Hug, et al. (2016) 27 . Putative CDSs were determined using Prodigal (v2.6.3; parameters: 147 -m -p meta) and were searched using HMMs for each marker using HMMER 28 (v3.1b2; 148 parameters: hmmsearch --cut_tc --notextw). If a genome had multiple copies of any single 149 marker gene, neither was considered, and only genomes with ≥8 markers were used to construct 150 a phylogenetic tree. Markers identified from the high quality genomes were combined with 151 markers from 1,729 reference genomes that represent the major bacterial phylogenetic groups (as 152 presented by IMG 29 ). Archaeal reference sequences were not included; however, none of the 153 putative archaeal environmental genomes had a sufficient number of markers for inclusion on the 154 tree. Each marker gene was aligned using MUSCLE (parameters: -maxiters 8) and automatically 155 trimmed using trimAL (parameters: -automated1). Automated trimming results were assessed (as 156 above) and re-aligned with MUSCLE, as necessary. Final alignments were concatenated and 157 used to construct an approximate ML tree with pseudo-bootstrapping with FastTree (parameters: 158 -gtr -gamma; Figure 4).

160
Relative Abundance of High Quality Genomes 161 To set-up a baseline that could approximate the "microbial" community (Bacteria, Archaea and 162 viruses) present in the various Tara metagenomes, which included filter sizes specifically 163 targeting both protists and giruses, reads were recruited against all contigs generated from the 164 Minimus2 and Megahit assemblies ≥2kb using Bowtie2 (default parameters). Some assumptions 165 were made that contigs <2kb would include, low abundance bacteria and archaea, bacteria and 166 archaea with high degrees of repeats/assembly poor regions, fragmented picoeukaryotic 167 genomes, and problematic read sequences (low quality, sequencing artefacts, etc.). All relative 168 abundance measures are relative to the number of reads recruited to the assemblies ≥2kb. Read 169 counts were determined using featureCounts (as above). Length-normalized relative abundance 170 values were determined for each high quality genome for each sample: Available Through iMicrobe 174 In keeping with the open-access nature of the Tara  files and contents can be found on Supplemental Table 3.

187
Results 188 Assembly 189 The initial Megahit assembly was performed on the publicly available reads for Tara stations  190 007, 009, 018, 023, 025, 030. Starting with 147-744 million reads per sample, the Megahit 191 assembly process generated 1.2-4.6 million assemblies with a mean N 50 and longest contig of 192 785bp and 537kb, respectively (Table 1). In general, the assmeblies generated from the Tara 193 samples targeting the protist size fraction (0.8-5 µm) had a shorter N 50 value than the bacteria 194 size fractions (mean: 554bp vs 892bp, respectively). Assemblies from the Megahit assembly 195 process were pooled and separated by length. Of the 42.6 million assemblies generated during 196 the first assembly, 1.5 million were ≥2kb in length ( contigs, all ≥2kb in length ( Table 2; further referred to as data-rich-contigs).

205
Binning 206 The set of data-rich-contigs was used to recruit the metagenomic reads from each sample using 207 Bowtie2. The data-rich-contigs recruited 15-81% of the reads depending on the sample. In 208 general, the protist size fraction recruited substantially fewer reads than the girus and bacteria 209 size fractions (mean: 19.8% vs 75.0%, respectively) ( Table 1). For the protist size fraction, the 210 "missing" data for these recruitments likely results from the poor assembly of more complex and 211 larger eukaryotic genomes. The fraction of the reads that do not recruit in the girus and bacterial 212 size fraction samples could be accounted for by the large number of low quality assemblies (200-213 500bp) and reads that could not be assembled due to low abundance or high complexity (Table  214 2). Coverage was determined as total reads per base pair, based on the number of reads recruited 215 to each contig. 216 Unsupervised binning was performed using both transformed and raw coverage values 217 for a subset of 95,506 contigs from the data-rich-contigs that were ≥7.5kb (referred to further as 218 binned-contigs) utilizing the tool BinSanity. An iterative process was performed that first used 219 coverage to generate putative bins, and then after removing putative high-quality genomes 220 (≥50% complete and <10% redundancy), based on estimates of redundancy through CheckM, 221 used two passes through the BinSanity refinement process, utilizing sequence composition. 222 Binning using the transformed coverage data generated 237 putative high-quality genomes (12 223 putative genomes are of slightly lower quality with >10% redundancy and have been noted) 224 containing 15,032 contigs. Contigs not in putative genomes were re-binned through the iterative 225 use of BinSanity based on raw coverage values, generating 53 additional putative high-quality 226 genomes encompassing 3,348 contigs. In total, 290 putative high-quality genomes were 227 generated with 50-100% completion (mean: 69%) with a mean length and number of putative 228 CDS of 1.7Mbp and 1,699, respectively (iMicrobe; Supplemental Table 1). All other contigs 229 were grouped in to bins with at least five contigs, but with estimated completion of 0-50% (2,786 230 low completion bins; 74,358 contigs; Supplemental Table 2) or did not bin (2,732 contigs). 231 Nearly a quarter of the low completion bins (24.7%) have an estimated completion of 0%.

233
Taxonomy, Phylogeny, & Potential Organisms of Interest 234 The 290 putative high-quality genomes had a taxonomy assigned to it via CheckM during the 235 pplacer step. All of the genomes, except for 20, had an assignment to at least the Phylum level, 236 and 83% of the genomes had an assignment to at least the Class level. Additionally, all of the 237 genomes were assigned putative taxonomies using a consensus method of the taxonomies 238 assigned to the putative CDS on a contig. The genomes were assigned four levels of taxonomic 239 information, roughly equivalent to Phylum, Order, Class, and Family. Due to the nature of this 240 method, especially at lower taxonomic levels, it is possible for a small number of assignments to 241 greatly influence the results. Because all of these methods have inherent biases, consistency 242 across several results should be viewed as reinforcing support for the accuracy of the genome, 243 while inconsistent results should not be used as evidence of an incorrectly binned genome. 244 Attempts were made to provide phylogenetic information for as many genomes as 245 possible. Genomes were assessed for the presence of full-length 16S rRNA genes. In total, 37 246 16S rRNA genes were detected in 35 genomes (mean 16S rRNA gene copy number, 1.05). 16S 247 rRNA genes can prove to be problematic during the assembly steps due the high level of 248 conservation that can break contigs 32 (Figure 3). Additionally, the conserved regions of the 16S 249 rRNA, depending on the situation, can over-or under-recruit reads, resulting in coverage 250 variations that can misplace contigs in to the incorrect genome. As such, several of the 16S 251 rRNA phylogenetic placements support the taxonomic assignments, while some are 252 contradictory. Further analysis should allow for the determination of the most parsimonious 253 result.

254
Beyond the 16S rRNA gene, genomes were searched for 16 conserved, syntenic 255 ribosomal markers. Sufficient markers (≥8) were identified in 193 of the genomes (67%) and 256 placed on a tree with 1,729 reference sequences (Figure 4). Phylogenies were then assigned to 257 the lowest taxonomic level that could be confidently determined.

258
The taxonomic and phylogenetic assignments are provided to give downstream users a 259 guide for determining which genomes prove to be most interesting for further analysis. Highest 260 confidence should be given to genomes with multiple lines of evidence supporting an assignment 261 and additional confirmation should be gathered for those with multiple conflicting results. These 262 putative results reveal a number of genomes were generated that represent multiple clades for 263 which environmental genomic information remains limited, including: Planctomycetes, 264 Verrucomicrobia, Marinimicrobia, Cyanobacteria, and uncultured groups within the Alpha-and 265 Gammaproteobacteria.

267
Relative Abundance 268 Based on the assembly and recruitment results, the assumption was made that the data-rich-269 contigs and their corresponding reads represent the dominant portion of the microbial (bacterial, 270 archaeal, and viral) community and that reads that did not recruit represent eukaryotes, low 271 quality assemblies, and/or less dominant portions of the microbial community. A length-272 normalized relative abundance value was determined for each genome in each sample based on 273 the number of reads recruited to the data-rich-contigs. The relative abundance for the individual 274 genomes was determined based on this portion of the read dataset. 275 In general, the genomes and their underlying contigs had low coverage (<1X coverage) 276 and low relative abundance (maximum relative abundance = 1.9% for TMED155 a putative 277 Cyanobacteria in TARA023-PROT-SRF; Supplemental Table 1). The high-quality genomes 278 accounted for 1.57-25.16% of the approximate microbial community as determined by the data-279 rich-contigs (mean = 13.69%), with the ten most abundant genomes representing 0.61-10.31% 280 (Table 1). 281 Almost all of the contigs in the binned-contigs were low coverage, only a small subset of 282 6,350 contigs (6.6%) had >1X coverage in at least one sample. Of these contigs, 1,962 were 283 assigned to putative high-quality genomes, while the other contigs were placed in the low 284 completion bins. Further, an additional 22,470 contigs (Total bp = 79,422,500bp, mean = 285 3,535bp, and longest contig = 7,498bp) within data-rich-contigs (1.3%) had greater than >1X 286 coverage, but were not included in the binning protocol.

288
Concluding Statement 289 The goal of this project was to provided preliminary putative genomes from the Tara Oceans 290 microbial metagenomic datasets. The 290 putative high-quality genomes and 2,786 low 291 completion bins were created using the 20 samples and six stations from the Mediterranean Sea. 292 We will continue to generate putative high-quality genomes from additional Tara Oceans 293 dataset, starting with the Red Sea and Arabian Sea in the near future. 294 We would like to take some time to highlight to interesting results created within this 295 dataset. For new genomes from environmental organisms, this project created approximately 14 296 new Cyanobacteria genomes within the genera Prochlorococcus and Synechococcus and 33 new 297 SAR11 genomes. Three unconfirmed members related to the Candidate Phyla Radiation (CPR) 298 as determined by placement of an internal node between the Parcubacteria and Microgenomates 299 (with long-branch characteristics; TMED88) and a node basal to the CPR genomes, potentially 300 related to the Wirthbacteria (TMED70 and TMED22), on the concatenated ribosomal marker 301 tree. Additionally, there are putative genomes from the marine Euryarchaeota (n = 11), 302 Verrucomicrobia (n = 17), Planctomycetes (n = 14), and Marinimicrobia (n ≈ 5). 303 Some additional perplexing results include, TMED58 a putative Deltaproteobacteria 304 with taxonomic assignments from the NCBI NR database to the Myoviridae. This result occurs 305 due the presence of a few large contigs assigned to the bacteria and many small contigs assigned 306 to the virus. However, if these two entities should be binned together remains unresolved. Lastly, 307 the low completion bins may house distinct viral genomes. Of particular interest may be the 40 308 bins with 0% completion (based on single-copy marker genes), but that contain >500kb of 309 genetic material (including 3 bins with >1Mb). These large bins lacking markers may be good 310 candidates for research in to the marine "giant viruses" and episomal DNA sources (plasmids, 311 etc.). 312 It should be noted, researchers using this dataset should be aware that all of the genomes 313 generated from these samples (and additional stations, which are on-going) should be used as a 314 resource with some skepticism towards the results being an absolute. Like all results for 315 metagenome-assembled genomes, these genomes represent a best-guess approximation of a 316 taxon from the environment 33 . Researchers are encouraged to confirm all claims through various 317 genomic analyses and accuracy may require the removal of conflicting sequences. 318 319 320 We would like to thank iMicrobe and FigShare for hosting data for this research. We are 321 indebted to the Tara Oceans project and team for their commitment to open-access data that 322 allows data aficionados to indulge in the data and attempt to add to the body of science contained 323 within. And we thank the Center for Dark Energy Biosphere Investigations (C-DEBI) for 324 providing funding to BJT and JFH (OCE-0939654).