Dataset of genome sequence, de novo assembly, and functional annotation of Ruegeria sp. (PBVC088), a marine bacterium associated with the toxin-producing harmful dinoflagellate, Pyrodinium bahamense var. compressum

The dataset comprises a whole-genome sequence of Ruegeria sp. PBVC088, a symbiotic (Gram-negative) bacterium associated with Pyrodinium bahamense var. compressum, which has been associated with harmful algal blooms in the coastal waters of west Sabah, Malaysia. Harmful algal blooms contribute to economic losses for the aquaculture industry, as well as human illnesses and fatalities due to paralytic shellfish poisoning. Bacteria-algae dynamics have posited that the interaction is potentially responsible for the toxin production during a toxic harmful algal bloom event. Despite the expanding body of literature on the capabilities of these bacteria to metabolize, produce, and modify toxins autonomously, it has yet to be confirmed that these toxin-producing bacteria are capable of autonomous toxin synthesis. Saxitoxin, a paralytic shellfish poisoning toxin, is produced by a unique biosynthetic pathway, where the genetic basis for the saxitoxin production was first reported in the saxitoxin-producing cyanobacteria strain Cylindrospermopsis raciborskii T3 (NCBI accession no. DQ787200). The genes responsible for saxitoxin biosynthesis in dinoflagellates, have yet to be fully elucidated. The identification of cyanobacteria saxitoxin biosynthesis genes (sxt) may eventually lead to the identification of homologous genes within the dinoflagellates. Previous studies on the diversity of the bacterial communities associated with the same toxic P. bahamense harmful alga has been carried out by using both the culture-dependent 16S ribosomal RNA gene sequence analysis and culture-independent 16S metagenomic sequence analysis. This study extends the knowledge pertaining to the genomic aspect of an associated bacterium isolated from P. bahamense alga by adopting a whole genome sequencing approach. Here, we report the genome sequencing, de novo assembly, and annotation data of a bacterium, Ruegeria sp. PBVC088, associated with harmful alga P. bahamense, which can be referenced by researchers to identify the genes and pathways related to toxin biosynthesis from a much larger data set. The genome of Ruegeria sp. PBVC088 was sequenced using the Illumina MiSeq platform with 250 bp paired-end reads. The number of reads generated from the MiSeq sequencer was 1,135,484, with an estimated coverage of 100X. The estimated genome size for the marine bacterium was computed to be 5.78 Mb. Annotation of the genome predicted 5,689 gene sequences, which were assigned putative functions based on homology to existing protein sequences in public databases. In addition, annotation of genes related to saxitoxin biosynthesis pathway was also performed. Raw fastq reads and the final version of the genome assembly have been deposited in the National Center for Biotechnology Information (NCBI) (BioProject: PRJNA324753, WGS: LZNT00000000, SRA: SRR3646181). The genome data provided here are expected to better understand the genetic processes involved in saxitoxin biosynthesis in marine bacteria associated with dinoflagellates.


a b s t r a c t
The dataset comprises a whole-genome sequence of Ruegeria sp. PBVC088, a symbiotic (Gram-negative) bacterium associated with Pyrodinium bahamense var. c ompressum , which has been associated with harmful algal blooms in the coastal waters of west Sabah, Malaysia. Harmful algal blooms contribute to economic losses for the aquaculture industry, as well as human illnesses and fatalities due to paralytic shellfish poisoning. Bacteria-algae dynamics have posited that the interaction is potentially responsible for the toxin production during a toxic harmful algal bloom event. Despite the expanding body of literature on the capabilities of these bacteria to metabolize, produce, and modify toxins autonomously, it has yet to be confirmed that these toxin-producing bacteria are capable of autonomous toxin synthesis. Saxitoxin, a paralytic shellfish poisoning toxin, is produced by a unique biosynthetic pathway, where the genetic basis for the saxitoxin production was first reported in the saxitoxin-producing cyanobacteria strain Cylindrospermopsis raciborskii T3 (NCBI accession no. DQ787200). The genes responsible for saxitoxin biosynthesis in dinoflagellates, have yet to be fully elucidated. The identification of cyanobacteria saxitoxin biosynthesis genes ( sxt ) may eventually lead to the identification of homologous genes within the dinoflagellates. Previous studies on the diversity of the bacterial communities associated with the same toxic P. bahamense harmful alga has been carried out by using both the culture-dependent 16S ribosomal RNA gene sequence analysis and cultureindependent 16S metagenomic sequence analysis. This study extends the knowledge pertaining to the genomic aspect of an associated bacterium isolated from P. bahamense alga by adopting a whole genome sequencing approach. Here, we report the genome sequencing, de novo assembly, and annotation data of a bacterium, Ruegeria sp. PBVC088, associated with harmful alga P. bahamense , which can be referenced by researchers to identify the genes and pathways related to toxin biosynthesis from a much larger data set. The genome of Ruegeria sp. PBVC088 was sequenced using the Illumina MiSeq platform with 250 bp paired-end reads. The number of reads generated from the MiSeq sequencer was 1,135,484, with an estimated coverage of 100X. The estimated genome size for the marine bacterium was computed to be 5.78 Mb. Annotation of the genome predicted 5,689 gene sequences, which were assigned putative functions based on homology to existing protein sequences in public databases. In addition, annotation of genes related to saxitoxin biosynthesis pathway was also performed. Raw fastq reads and the final version of the genome assembly have been deposited in the National Center for Biotechnology Information (NCBI) (BioProject: PRJNA324753, WGS: LZNT0 0 0 0 0 0 0 0, SRA: SRR3646181). The genome data provided here are expected to better understand the genetic processes involved in saxitoxin biosynthesis in marine bacteria associated with dinoflagellates.

Value of the Data
• We present here the whole genome sequence data of Ruegeria sp., a marine bacterium associated with the saxitoxin-producing dinoflagellate, P. bahamense var. compressum . • The data also provides researchers with a better understanding of the genes associated with the saxitoxin biosynthesis pathway in the associated Ruegeria bacterium. • The data can be used by marine researchers to perform comparative genomic studies of bacteria associated with other saxitoxin-producing dinoflagellates. • The whole genome data can be helpful as a genomic reference to facilitate future studies on the characterization of marine bacteria associated with harmful algal blooms and their contribution to the toxin biosynthesis pathway.

Data Description
The whole genome sequencing of a bacteria (PBVC088) associated with the harmful dinoflagellate, P. bahamense var. compressum generated approximately 1,135,484 raw pair-end reads with approximately 542.5 Mb total bases were generated ( Table 1 ). The draft genome with the total assembly size of 5.78 Mb has 143 contigs with 64.96% GC content. The longest contig was 403,173 bp, whereas the shortest contig was 507 bp. Data generated from the MiSeq sequencer had approximately 100 times coverage, indicating that the reads were sufficient enough for a good de novo assembly. The contig N50 size was 149,793 bp. The N50 is the length of the average contig size and the value for this parameter was relatively high that was indicative of good contiguity of the assembly. A total of 5,635 protein coding sequences (CDSs) were identified with only 3,916 or 70 percent of the total CDSs having predicted functions and the remaining 30 percent functionally annotated as hypothetical proteins. The pie chart in Fig. 1 summarizes the functional distribution of the protein-coding genes in the PBVC088 genome, in which major genes fall within the carbohydrates (pink) and amino acids (light green) pathways.  The proposed biosynthetic pathway by Kellmann et al. [1] delineated 26 sxt ( sxtA -sxtZ ) genes with eight sxt proteins encoded by sxtA , sxtG , sxtB , sxtD , sxtS , sxtU , sxtH/T , and sxtI , which appeared to be directly involved in the synthesis of saxitoxin [ 1 , 2 ]. A total of eleven contigs (protein-coding genes) with the highest alignment score (bit score > 55) or best hit based on E-value were retrieved from the PBVC088 bacterial genome. Results of the protein similarity search have been summarized in Table 2 . The remaining cyanobacterial sxt genes did not have any significant similarity with the protein-coding gene sequences of PBVC088. Table 3 provides a summary of shared domains between the candidate genes in PBVC088 and sxt genes in the cyanobacteria.
The whole genome sequence of PBVC088 was compared to a selection of closely related bacteria genomes retrieved from the Roseobacter and the NCBI databases. The phylogenetic tree is depicted in Fig. 2 . The whole-genome phylogeny based on single nucleotide polymorphism (SNP) matrices resolved 21 species of the closed members of Roseobacter clade that followed the taxonomic classification of the sample, PBVC088. The kSNP analysis provided a direct visualization of the relationship among all the closed species of Roseobacter clade. All clusters have a minimum of 83% bootstrap support. The phylogeny showed that the genome PBVC088 was clustered together with the Ruegeria clade. Two well-defined branches were observed within the Ruegeria clade, where one branch contained a mixed bacteria group of Ruegeria pomeroyi, Ruegeria sp., and Pelagibaca bermudenis , whereas another branch contained only Ruegeria sp. bacteria. The PBVC088 was positioned within the former branch, implying high degree of similarity with R. pomeroyi and P. bermudenis. Despite the fact that P. bermudenis was the only distinct species within the Ruegeria clade, its genome may be highly similar to that of Ruegeria sp., indicating this organism's close association with bacteria of the genus Ruegeria .
In summary, we report the draft genome of Ruegeria sp., the first whole genome of a marine bacterium associated with the Malaysian harmful dinoflagellate, Pyrodinium bahamense var. compressum , to be sequenced, and identify genes associated with the biosynthesis of saxitoxin. Due to the limited genomic sequence resources for marine bacteria associated with toxic blooms, we believe that our research will help gain a better understanding of the biological processes that could aid in the long-term management of harmful algal blooms.

Study area
Seawater samples containing the toxic marine dinoflagellate, P. bahamense var. compressum was collected from Sepanggar Bay (6.08 °N, 116.12 °E) in December of 2012 during the period at which high concentrations of paralytic shellfish poisoning toxin was detected in the coastal waters of Sabah [3] .

Sample preparation
The PBVC088 bacterium was isolated from the clonal culture of the harmful alga P. bahamense var. compressum (CC-UHABS-040(M)) following the previously described method [4] . Early, late exponential, and stationary growth phases of the microalgal cultures were serially diluted (10fold dilution) in sterile seawater. The isolated bacterium was maintained on sterile marine agar (Difco) at 37 °C. To obtain pure culture, bacteria from a dilution containing 50 to 100 colonies were isolated from one replicate plates and replated separately onto the marine agar. For DNA extraction, the PBVC088 isolate was cultured overnight at 37 °C in 5 ml of marine medium (Difco) in an incubator shaker. A total of 1 ml of bacterial cell culture was pelleted in a 1.5 ml centrifuge tube by centrifugation at 50 0 0 x g for 10 min. The genomic DNA of the pelleted bacterial cell culture was extracted using the DNeasy Blood and Tissue DNA Isolation Kit (Qiagen Biotechnology) following the manufacturer's instructions. The concentration and quality of the extracted DNA were assessed using the Qubit 2.0 fluorometer (Life Technologies Corporation) and Nanovue Plus Spectrophotometer (GE Healthcare), respectively, before proceeding to library preparation.

Genome sequencing, assembly and annotation
Next generation sequencing library preparations were constructed following instructions listed in Illumina Nextera XT Library Preparation Kit. The prepared library was then loaded onto an Illumina MiSeq system (Illumina, USA) according to the manufacturer's instructions. Sequencing was carried out using 250 paired-end configurations; base calling and image analysis were conducted by the MiSeq Control Software on the MiSeq instrument. The obtained raw sequences were filtered based on quality using Fastq Quality Filter in Fastx Toolkit [5] . The reads were then subjected to adapter sequence removal and low-quality region or reads trimming using Scythe v0.994 [6] and Trimmomatic v0.35 [7] for quality trimming of reads based on quality threshold of Q-score 25. The draft genome was de novo assembled using Iterative De Brujin Graph De Novo Assembler IDBA-UD software [8] . The assembler is suitable for short reads sequencing data with highly uneven sequencing depths. The assembled reads were annotated using rapid prokaryotic genome annotation (PROKKA) [9] and Rapid Annotation using Subsystem Technology (RAST) server version 2.0 [10] .

Analyses of genes putatively involved in saxitoxin biosynthesis
The protein sequences (faa . files) that were predicted from the nucleotide assemblies were subjected to sxt genes identification and analysis. Saxitoxin (STX)-related genes were identified through sequence similarity searches using BLASTP (expectation value, E-value < 1e -5 ) against the 26 putative sxt genes in the toxic cyanobacterium C. raciborskii T3 [1] as queries. Candidate genes resulted from the protein similarity search was then analyzed against the Conserved Domain Database (CDD) using CD-search web service [11] .