Dataset of de novo assembly and functional annotation of the transcriptomes of three native oleaginous microalgae from the Peruvian Amazon

Microalgae are photosynthetic organisms with cosmopolitan distribution (i.e., marine, freshwater and terrestrial habitats) and possess a great diversity of species [1] and consequently an immense variation in biochemical compositions [2]. To date genomic information is available mainly from the model green microalga Chlamydomonas reinhardtii[3]. Here we provide the dataset of a de novo assembly and functional annotation of the transcriptomes of three native oleaginous microalgae from the Peruvian Amazon. Native oleaginous microalgae species Ankistrodesmus sp., Chlorella sp., and Scenedesmus sp. were cultured in triplicate using Chu-10 medium with or without a source of nitrate (NaNO3). Total RNA was purified, the cDNA libraries were constructed and sequenced as paired-end reads on an Illumina HiSeq™2500 platform. Transcriptomes were de novo assembled using Trinity v2.9.1. A total of 48,554 transcripts (range from 250 to 7966 bp; N50 = 1047) for Ankistrodesmus sp., 108,126 transcripts (range from 250 to 8160 bp; N50 = 1090) for Chlorella sp., and 77,689 transcripts (range from 250 to 8481 bp; N50 = 1281) for Scenedesmus sp. were de novo assembled. Completeness of the assembled transcriptomes were evaluated with the Benchmarking Universal Single-Copy Orthologs (BUSCO) software v2/v3. Functional annotation of the assembled transcriptomes was conducted with TransDecoder v3.0.1 and the web-based platforms Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) and FunctionAnnotator. The raw reads were deposited into NCBI and are accessible via BioProject accession number PRJNA628966 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA628966) and Sequence Read Archive (SRA) with accession numbers: SRX8295665 (https://www.ncbi.nlm.nih.gov/sra/SRX8295665), SRX8295666 (https://www.ncbi.nlm.nih.gov/sra/SRX8295666), SRX8295667 (https://www.ncbi.nlm.nih.gov/sra/SRX8295667), SRX8295668 (https://www.ncbi.nlm.nih.gov/sra/SRX8295668), SRX8295669 (https://www.ncbi.nlm.nih.gov/sra/SRX8295669), and SRX8295670 (https://www.ncbi.nlm.nih.gov/sra/SRX8295670). Additionally, transcriptome shotgun assembly sequences and functional annotations are available via Discover Mendeley Data (https://data.mendeley.com/datasets/47wdjmw9xr/1).


a b s t r a c t
Microalgae are photosynthetic organisms with cosmopolitan distribution (i.e., marine, freshwater and terrestrial habitats) and possess a great diversity of species [1] and consequently an immense variation in biochemical compositions [2] . To date genomic information is available mainly from the model green microalga Chlamydomonas reinhardtii [3] . Here we provide the dataset of a de novo assembly and functional annotation of the transcriptomes of three native oleaginous microalgae from the Peruvian Amazon. Native oleaginous microalgae species Ankistrodesmus sp., Chlorella sp., and Scenedesmus sp. were cultured in triplicate using Chu-10 medium with or without a source of nitrate (NaNO 3 ). Total RNA was purified, the cDNA libraries were constructed and sequenced as paired-end reads on an Illumina HiSeq TM 2500 platform. Transcriptomes were de novo assembled using Trinity v2. 9

Value of the data
• These are the first datasets of the de novo assembly and functional annotation of the transcriptomes of three native oleaginous microalgae from the Peruvian Amazon. • These data provide valuable information on identified genes encoding enzymes involved in the biosynthesis of lipids and carbohydrates appropriates for biofuel production by the native oleaginous microalgae from the Peruvian Amazon. • The transcriptome datasets can be used to elucidate anabolic pathways involved in the production of human essential nutrients and the great diversity of health-promoting compounds by the native oleaginous microalgae from the Peruvian Amazon.

Data description
Included in this dataset are the de novo assembly and functional annotation of the transcriptomes of the native oleaginous microalgae Ankistrodesmus sp., Chlorella sp., and Scenedesmus sp. that were previously cultured in triplicate using Chu-10 medium with or without a source of nitrate (NaNO 3 ). Total RNA was purified from each microalgae species in both cultured conditions, then pooled in equimolar ratios to construct the six cDNA libraries and paired-end sequenced on an Illumina HiSeq TM 2500 platform. De novo transcriptome assembly was conducted using  . 2 ) with an average number of orthologs per core genes of 1.49, 1.77 and 1.67 for Ankistrodesmus sp., Chlorella sp., and Scenedesmus sp., respectively. The de novo assembled transcriptomes were functionally annotated with TransDecoder v3.0.1 and the web-based platforms Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) and FunctionAnnotator. TransDecoder predicted the four open reading frames categories ( Fig. 3 ) as complete (from 10 to 21%), internal (from 42 to 53%), 5 prime partial (from 28 to 31%), and 3 prime partial (from 7 to 9%). Also, from 17,138 to 39,868 assembled transcripts were assigned with gene ontology terms (Table S1). Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) assigned KEGG Orthology (KO) IDs to 11,347 transcripts (2626 unique KO) for Ankistrodesmus sp., to 25,128 transcripts (3334 unique KO) for Chlorella sp., and to 17,627 transcripts (2942 unique KO) for Scenedesmus sp. BRITE hierarchies (KEGG modules, KEGG orthology, and KEGG reaction modules) and 135 metabolic pathway maps were generated with 2501, 2929, and 2702 enzymes/proteins mapped for Ankistrodesmus sp., Chlorella sp., and Scenedesmus sp., respectively (Table S1). Finally, from 30,412 to 62,012 best hits against the NCBI non-redundant protein database were obtained, from 917 to 2205 enzymes were identified, and from 13,337 to 30,633 transcripts encoding at least one domain region in proteins were identified (Table S2).

Total RNA isolation, library preparation and next-generation DNA sequencing
Total RNA was isolated following the manufacturer's instructions using the RNeasy Plant Mini Kit (Qiagen, Hilden, Germany). Quantity and quality values of the total RNA were evaluated by spectrophotometric analysis using a Nanodrop 20 0 0 Spectrophotometer and RNA integrity using a 2100 Bioanalyzer (Agilent, CA, USA). Total RNA obtained from each microalgae species and every culture condition were pooled in equimolar ratios to construct the six cDNA libraries. The cDNA libraries were constructed following the manufacturer's instructions of the TruSeq Stranded mRNA Sample Preparation Kit (Illumina, San Diego, USA), quantified with the Qubit TM dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, USA) and paired-end sequenced (2 × 150 bp) on an Illumina HiSeq TM 2500 platform.

De novo assembly and functional annotation
Raw paired-end sequences were uploaded as FASTQ files to the Galaxy ( https://usegalaxy. org/ ) bioinformatic platform. In this bioinformatic platform the quality of the fastq files were evaluated with FastQC [4] and the adaptor sequences, low quality bases ( ≤ Q20) and short sequences ( ≤ 50 bp in length) were trimmed with Trimommatic (Galaxy version 0.38.0) [5] . High quality sequence reads were de novo assembled using Trinity (Galaxy version 2.9.1) [6] with default parameters and a minimum contig length of 250 bp. Completeness of assembled transcripts was evaluated using the Benchmarking Universal Single-Copy Orthologs (BUSCO) software v2/v3 [7] as implemented in the web-based server gVolante [8] .
Functional annotations of the de novo assembled transcriptomes were conducted with the following bioinformatic tools: 1) TransDecoder (Galaxy version 3.0.1) [9] to predict Open Reading Frames (ORFs) and to obtain protein sequences of at least 100 amino acids in length; 2) Kyoto