Draft Nuclear Genome Sequence of the Halophilic and Beta-Carotene-Accumulating Green Alga Dunaliella salina Strain CCAP19/18

ABSTRACT The halotolerant alga Dunaliella salina is a model for stress tolerance and is used commercially for production of beta-carotene (=pro-vitamin A). The presented draft genome of the genuine strain CCAP19/18 will allow investigations into metabolic processes involved in regulation of stress responses, including carotenogenesis and adaptations to life in high-salinity environments.

T he halotolerant green alga Dunaliella salina Teodoresco is the type species of the Dunaliella genus (class Chlorophyceae) (1,2). The biflagellate cells grow best at salinities of about 1.5 M (3). Often, new isolates of halotolerant algae of this general appearance are incorrectly identified as D. salina. Strain CCAP 19/18 is a genuine representative of the species D. salina (4), making it a reference for all ongoing and future studies on this and other Dunaliella species.
Due to the absence of a rigid cell wall and the presence of a glycocalyx (5), D. salina cells can adjust rapidly to large changes in salinity. Thus, D. salina is a model for studies of salinity tolerance (6)(7)(8)(9)(10)(11). Further, under abiotic stress cells turn orange due to overaccumulation of beta-carotene in plastidic oil globules (12). Beta-carotene may accumulate to over 8% of the cellular dry weight, making the alga an important crop for the health food market for production of natural beta-carotene (13)(14)(15)(16)(17). With the recent surge in Dunaliella "omics" research and currently 19 BioProjects deposited in NCBI (NIH), we offer this early release draft genome sequence of strain CCAP19/18 to support and encourage ongoing and future research.
Following genomic DNA extraction from cells, libraries were generated for sequencing using either the Illumina or the PacBio platforms, overall producing 554.53 Gbp (after quality filtering). The tight-insert protocol was used for library construction for the Illumina HiSeq 2500 paired-end (PE) runs, which produced 219.9 Gbp (after quality filtering). For Illumina HiSeq 2000 sequencing preparation, the ligation-free paired end (LFPE) construction protocol was used for four libraries, the Illumina standard PE unamplified protocol was used for three libraries, and the Illumina CLIP PE protocol was used for one library, in total providing 341.6 Gbp (after quality filtering). Sequences generated on the Illumina platform provided 62ϫ coverage and those from paired-end reads 27ϫ (from 2-kb, 4-kb, and 6-kb libraries). Additionally, 16 PacBio single-molecule real-time (SMRT) libraries were constructed and sequenced on the PacBio RS platform, providing, after quality filtering, a total of 3.162 Gbp and resulting in about 10ϫ coverage. The chloroplast and mitochondrial sequences (18) were removed prior to assembly.
The genome was assembled with ALLPATHS-LG (19). The resulting assembly was scaffolded using SSPACE (20) with three fosmid libraries (total of 0.50ϫ read coverage). The final assembly was screened for contaminants against the NCBI nonredundant database. The D. salina v1.0 draft genome is provided through the Phytozome portal from the U.S. Department of Energy (DOE) Joint Genome Institute (JGI) at http://phytozome.jgi.doe.gov/. The main genome assembly provides approximately 343.7 Mb arranged in 5,512 scaffolds (N 50 ϭ 353.0 kbp) and a 49.1 GϩC content. The contig N 50 (L 50 ) ϭ 10,635 (7.8 kb), and 1,241 scaffolds are Ͼ50 kb in size, representing approximately 92.0% of the genome. With a conservative annotation approach using a repeat-masked assembly, the presented draft genome contains 16,697 loci providing 18,801 protein-coding transcripts. A less conservative annotation approach without masking repeats revealed 36,851 protein-coding genes. The assembly and annotation are expected to provide the foundation for future studies on halotolerance, secondary carotenoid accumulation, and algal stress biology.
Accession number(s). The whole-genome sequencing project was deposited in GenBank under BioProject number PRJNA32771. The project BioSample number is SAMN02746051 (sample name 1014865, SRA number SRS1543949). A trace file archive is available under NCBI project ID 32771. The draft genome sequence is accessible under GenBank accession number NSFN00000000. This paper describes the first version, NSFN01000000.

ACKNOWLEDGMENTS
The