Survey of the green picoalga Bathycoccus genomes in the global ocean

Bathycoccus is a cosmopolitan green micro-alga belonging to the Mamiellophyceae, a class of picophytoplankton that contains important contributors to oceanic primary production. A single species of Bathycoccus has been described while the existence of two ecotypes has been proposed based on metagenomic data. A genome is available for one strain corresponding to the described phenotype. We report a second genome assembly obtained by a single cell genomics approach corresponding to the second ecotype. The two Bathycoccus genomes are divergent enough to be unambiguously distinguishable in whole DNA metagenomic data although they possess identical sequence of the 18S rRNA gene including in the V9 region. Analysis of 122 global ocean whole DNA metagenome samples from the Tara-Oceans expedition reveals that populations of Bathycoccus that were previously identified by 18S rRNA V9 metabarcodes are only composed of these two genomes. Bathycoccus is relatively abundant and widely distributed in nutrient rich waters. The two genomes rarely co-occur and occupy distinct oceanic niches in particular with respect to depth. Metatranscriptomic data provide evidence for gain or loss of highly expressed genes in some samples, suggesting that the gene repertoire is modulated by environmental conditions.


Genomic data
Bathycoccus RCC1105 1 was isolated in the bay of Banyuls-sur-mer at the SOLA station at a depth of 3 m in January 2006. Sequences were downloaded from the Online Resource for Community Annotation of Eukaryotes 2 . Two metagenomes of uncultured Bathycoccus sorted by flow cytometry 3

Single-cell isolation and amplification
The four cells composing the final genome sequence assembly of TOSAG39-1 (for Tara Figure S13). Samples were preserved in 6% glycine betaine final and frozen quickly in liquid nitrogen. Samples were shipped to the Bigelow Laboratory Single Cell Genomics Center where they were thawed. Single cells were sorted into a lysis buffer by flow cytometry based on their cell size and chlorophyll content. The DNA content of each cell was amplified separately using Multiple Displacement Amplification (MDA), following previously described protocols 5 . The identification of cells was based on the 18S rRNA gene sequence. After multiple alignments using MUSCLE 6 , it appeared that the 18S rRNA sequence of TOSAG39-1 was strictly identical to that of Bathycoccus prasinos (GenBank: AY425315, FN562453).

DNA sequencing and assembly
The four cells, A, B, C and D were sequenced independently on 1∕8th Illumina HiSeq lane, producing a total of 96 million 101-bp paired-end reads. For the combined-SAG assembly, we pooled the reads from the different cells to increase the completion of the final assembly. To ensure that genomes of these cells could be correctly co-assembled, we first analyzed the contribution of each cell to a global assembly using the HyDA assembler 7 . HyDA produced a colored de Bruijn graph in which most contigs were covered by reads from at least three different cells, suggesting that the genomes were close enough to be successfully co-assembled. We used SPAdes 2.4 8 using parameter k = 21, 33 and 55 to obtain the final assembly, and we scaffolded contigs using the SSPACE program 9 . We used GapCloser (v 1.12-6 from SOAPdeNovo2 package 10 ) with default settings to perform gap filling on the resulting scaffolds. Scaffolds shorter than 500 bp were discarded from the assembly.
We obtained individual assemblies for each cell, A, B, C and D separately using the same versions of SPAdes, SSPace and GapCloser. We computed a merged-assembly by pooling all scaffolds from the four individual assemblies and removing the redundancy using CD-HIT 11,12 v 4.6.1. Scaffolds with ≥ 95% identity and ≥ 80% overlapping (considering the shortest sequence) were clustered together and the longest scaffold of each cluster was kept as representative. The combined-SAGs assembly is the longest and appears as the most complete (Table 1).

Gene prediction on the TOSAG39-1 assembly
To predict different structures or specific genes that would be absent from the RCC1105 genome, we performed a de novo gene prediction using three different resources: protein mapping from a custom database enriched in marine protists transcripts, including the RCC1105 proteome; ab initio gene predictions; and transcriptional evidence from Tara Oceans metatranscriptomic data. Before this process, we masked the TOSAG39-1 assembly against repeated sequences using RepeatMasker version open-3.3.0 13 .
We then mapped all proteins with BLAST+ 2.2.27 14 (e-value < 10 -2 ). The reference database was built with Uniref100 15 (version July 25 th 2013) and the MMETSP transcriptomes 16 (version August 2013). We obtained a total of 6 560 distinct matches. For ab initio predictions, we used the SNAP predictor 17 after calibration on Bathycoccus prasinos RCC1105 gene models. This resulted in the prediction of 6 797 gene models. Biological evidence was also provided by Tara Oceans metatranscriptomes. After mapping metatranscriptomic reads from all Tara Oceans samples of the 0.8-5 µm size fraction, we used the Gmorse pipeline 18 to define the gene structures from vertical coverage. We applied a minimum read coverage threshold of 32 because of the large abundance of Bathycoccus in Tara Ocean samples. We detected 6 112 genes. We finally integrated protein mapping, SNAP ab initio predictions and metatranscriptome derived gene models using a combiner process modified from the Gmorse software 16

TOSAG39-1 and RCC1105 genomic comparison
Best reciprocal hits (BRH) We identified orthologous genes between RCC1105 and TOSAG39-1. We aligned each pair of genes using the Smith-Waterman algorithm 19 and retained alignments having a score higher than 300 (BLOSUM62, gapo = 10, gape = 1). We defined 4 153 best reciprocal hits as orthologs. The distribution of the percent identities for these BRH between the two Bathycoccus genomes is shown in Supplementary Figure S3.

Synteny and collinear genes analysis
We aligned the RCC1105 genomic data against the twenty longest TOSAG39-1 scaffolds (containing 656 genes) using promer (default parameter) from the MUMmer 3.19 package 20 . We used mummerplot to select RCC1105 chromosomes that corresponded to TOSAG39-1 scaffolds. We identified 18 scaffolds having an alignment covering their entire length with 11 chromosomes. We identified 573 RCC1105 genes localized within these syntenic regions. One of the two remaining scaffolds had matches with one RCC1105 contig that is not mapped to any chromosome, and the other could not be aligned and had a lower GC% (0.44 vs. 0.48 averages for the other scaffolds) suggesting a chromosome 19 origin. To identify genes that are shared between the two genomes, we compared TOSAG39-1 scaffolds and RCC1105 in the six translated frames using tblastx 14 (e-value < 10 -3 ). We visually inspected genomic alignment regions using Artemis 21 and identified 52 RCC1105 genes localized in syntenic regions that lacked any alignments. We further compared these 52 genes against the whole genome at the protein level with tblastx 14 (e-value < 10 -3 ) and identified a total of 24 exclusive genes.

Comparison between Bathycoccus genomes and MMETSP transcriptome
We compared the RCC1105 and TOSAG39-1 gene sets to the two Bathycoccus transcriptomes available in the MMETSP collection 16 . We computed the best reciprocal hit at the amino acid level, as defined previously, and distributed their percentage of identity. We identified unambiguously MMETSP1460 (culture strain RCC716) and MMETSP1399 (culture strain CCMP1898) as corresponding to TOSAG39-1 and RCC1105, respectively (Supplementary Figure S5)

Comparison between Bathycoccus genome assemblies and metagenomes containing Bathycoccus
We compared by tblastn 14 (selecting e-value lower than 10 -3 ) the gene sets of RCC1105 and TOSAG39-1 to the two metagenomes (T142 and T149) from the Chile upwelling 3 and to the metagenome from the Atlantic Ocean DCM 4,22 . We selected matches covering more than 80% of the genes. We identified that RCC1105 corresponds to the T142 and T149 metagenome and TOSAG39-1 corresponds to the Atlantic Ocean metagenome (Supplementary Figure S5).

Metagenomic fragment recruitment
In order to analyze the diversity of Bathycoccus genomes and of dispensable genes, metagenomic reads from the Tara Oceans 0.8-5-µm fraction samples were recruited to whole sequence assemblies. We used Bowtie2-2.1.0 23 to align reads longer than 80 bp. We retained matches having more than 80% identity and more than 30% of high-complexity bases. From the initial 122 samples, we further analyzed the 36 samples for which at least 98% of the genes of Bathycoccus were detected (more than one mapped read).
Using R-package ′ggplot2′ 24 , we displayed the density of reads mapping along the genome in 5 000-bp bins and 1% identity height (Supplementary Figure S11). This representation reduces the granularity of the Y-axis, particularly for high identity levels, caused by the short length of reads.

Gene set filtering
Mitochondrial and plastid genes tblastn (e-value < 10 -20 ) 14 was used to compare the mitochondrial and chloroplast RCC1105 proteins against TOSAG39-1 scaffolds. To check the validity of these scaffolds, we compared these selected scaffold against the nr database 25 using blastn 14 . We identified 35 genes as putatively of chloroplast or mitochondrial origin. The corresponding scaffolds were not further considered in the analysis.

Foreign sequences in TOSAG39-1 assembly
To improve detection of non-Bathycoccus DNA sequences in the TOSAG39-1 assembly, we used the results of metagenomic fragment recruitments for Tara Oceans samples. We postulated that assembly contigs corresponding to Bathycoccus vs. to non-Bathycoccus would be mapped by metagenomic reads at different coverages in the various samples. Therefore, we analyzed the variations of coverage of each gene along Tara Oceans samples to retrieve the specific Bathycoccus coverage profile. We assumed that the coverage profile of the majority of genes was the signature of TOSAG39-1. Considering these profiles as a time series, we used the "diss.CORT" function of the "TSclust" R-package 26 to compute distances based on abundance values and spatial correlation between profiles. We tagged 533 genes having a profile quite different from that of TOSAG39-1. However, we untagged from this list genes having an ortholog in Bathycoccus prasinos RCC1105. Finally, we discarded scaffolds containing tagged genes only. The aim of this filter is to discard the maximum of contigs that have an outlier statistical signal on fragment recruitment to avoid any putative bias due to atypical genomic region. Using this approach, we removed 223 scaffolds from the assembly. We compared these scaffolds on public databases using blast 14 . Due to the stringency of this filter, some of these scaffolds (37.8%) seem to correspond to Bathycoccus, but the majority doesn't have any match or match different other organisms (Supplementary Table 6).
We also followed this rationale to detect genes having "outlier" profiles. We identified 826 and 1 051 genes on RCC1105 and TOSAG39-1, respectively. Among these, 111 and 223 were identified as crossmapped genes (see below).

Estimation of cross-species mapped genes
In order to analyze the abundance of the two Bathycoccus genomes in the Tara Oceans metagenomic samples, we checked the possibility that some genes could be cross-mapped, that is genes that could be mapped by metagenomic reads from both genotypes. These genes could lead to a background signal in species detection survey. We identified 1 057 and 1 020 genes from TOSAG39-1 and RCC1105, respectively, that could be aligned on the other genome using Bowtie 23 . In order to do this, we fragmented one genome into 100-bp fragments that we mapped on the second genome to simulate metagenomics fragment recruitment conditions. We retained results having more than 95% identity. Since TOSAG39-1 is 64% complete, we extrapolated the total number of cross-mapped genes to about 1 500.

Relative genomic abundance
We mapped metagenomic reads on RCC1105 and TOSAG39-1 genome sequence using Bowtie2 2.1.0 aligner with default parameters 23 . We filtered out alignments corresponding to low complexity regions using the dust algorithm 27 and we discarded alignments with less than 95% mean identity or with less than 30% of high complexity bases. For each Bathycoccus, we computed relative genomic abundances as the number of reads mapped onto non-outlier genes normalized by the total number of reads sequenced for each sample. We took into account the estimated fraction of genome recovery of TOSAG39-1 assembly to extrapolate the number of reads mapped on non-outlier genes to a complete genome assembly. Cross mapped genes, organelles and outlier genes were dismissed for the calculation. We generated the world maps and heatmaps with R-package maps_2.1-6, mapproj_1.1-8.3, gplots_2.8.0 and mapplots_1.4 (version R-2.13, https://cran.r-project.org/web/packages/maps/index.html).

RPKM MG and RPKM MT
Metagenomic and metatranscriptomic read counts per gene (RPKM MG and RPKM MT ) correspond to the number of mapped reads per gene (intron plus exon for RPKM MG ) or per CDS (for RPKM MT ) divided by the total number of reads sequenced for each sample multiplied by gene length. We used the following formula for figures: . We investigated relative transcriptomic activity of genes by dividing RPKM MT by RPKM MG . If RPKM MT > 0 but RPKM MG is null, we used the median of the total RPKM MG .

Metabarcoding
Metabarcoding abundance values (V9 region of 18S rRNA genes) were extracted from a previous study 28 and correspond to the proportion of all eukaryotic reads assigned to Bathycoccus.

Identification and characterization
To detect variations in gene content of the two Bathycoccus genomes in the different samples, in particular gene loss, we analyzed the coverage of metagenomic reads that were specifically mapped on each genome at high stringency. To avoid putative background signals, we restricted this analysis to samples where 98% of the genes were detected (metagenomic abundance > 0). We retained 34 samples for RCC1105 and 21 samples for TOSAG39-1. We then focused on genes that were detected in at least four samples, and not detected in at least five samples. We obtained 108 and 106 dispensable genes in RCC1105 and TOSAG39-1, respectively. We performed a Mann-Whitney-Wilcoxon test (using R function wilcox.test with default parameters) to compare RPKM values and gene length between dispensable and nondispensable genes. We considered a significant difference at a p-value < 10 −3 .

Validation of dispensable cassette genes in metagenomes
We aimed to validate the genomic pattern of gain or loss of cassettes of dispensable genes on RCC1105 using long metagenomic contigs from the Tara Ocean expedition data. We selected Tara Oceans stations having a high abundance of RCC1105 and a negligible abundance of TOSAG39-1 (relative abundance < 0.05%). We assembled merged metagenomic reads using SOAPdenovo 10 and a kmer size of 31. Most of the metagenomics contigs were short (N50 sizes ranged from 804 to 836 nt in the different samples) because of the difficulty of assembling eukaryotic metagenomes. However, we identified by blastn 14 several long metagenomics contigs that covered two dispensable cassettes, including the longest one.
These metagenomics contigs were from the following stations and depths: TARA_082 surface, TARA_093 surface, TARA_152 surface, TARA_089 surface, TARA_093 DCM and TARA_152 surface ( Figure 4, Supplementary Figure 13). These alignments confirmed the total absence of these dispensable cassettes in these metagenomic contigs. Furthermore, the positions of the insertion or deletion of a given cassette were identical for several metagenomic contigs, indicating a common event and suggesting the existence of only two genomic forms at these genomic positions in these samples

Analysis of environmental parameters
We used physicochemical parameter values related to the expedition sampling sites and available in the Pangaea database 29 . We extrapolated PAR values (corresponding to weekly averages values of Photosynthetically Active Radiation) at sample depth using the following formula with k derived from surface chlorophyll concentration (Chl sur ) using the following published formulas 30 . We carried these analyses for stations for which at least 98% of genes from one of the two Bathycoccus were detected. For each parameter, we performed a Mann-Whitney-Wilcoxon test (using the R function wilcox.test with default parameters) between the TOSAG39-1 and RCC1105 sets of values.

rRNA operon comparison
The Bathycoccus RCC1105 rRNA operon was used as the reference sequence to align the rRNA operons of TOSAG39-1, of two metagenomes (T142 and T149) from the Chile upwelling 3 , of a metagenome from the Atlantic Ocean DCM 4,22 , and the ITS from strains RCC715 and 716 (Genbank accession KT809427, KT809428) that have been isolated from the Indian Ocean. The alignments were done with MAFFT, as implemented in Geneious 7.1 (http://www.geneious.com/).