Insight into the genome data of commercially important giant kelp Macrocystis pyrifera

Kelps or brown algae are a wide group of marine macroalgae that play an important role in aquatic ecosystems and generally have high commercial value. To facilitate brown algal studies, we report the complete genome sequence of the largest kelp Macrocystis pyrifera. The whole genome is ∼428 Mb in size, comprises 44,307 scaffolds with an average GC content of 47%, and is predicted to contain a total of 24,778 genes. 18S sequence-based phylogenetic analysis revealed that littoral brown seaweed Scytosiphon lomentaria is the closest species of M. pyrifera. Numerous genes identified in this dataset are involved in genetic information processing, signaling, and cellular processes, carbohydrate metabolism, and terpenoids biosynthesis.


Brown alga
Macrocystis pyrifera Valuable bioproducts Whole genome sequencing Illumina Nanopore a b s t r a c t Kelps or brown algae are a wide group of marine macroalgae that play an important role in aquatic ecosystems and generally have high commercial value. To facilitate brown algal studies, we report the complete genome sequence of the largest kelp Macrocystis pyrifera . The whole genome is ∼428 Mb in size, comprises 44,307 scaffolds with an average GC content of 47%, and is predicted to contain a total of 24,778 genes. 18S sequence-based phylogenetic analysis revealed that littoral brown seaweed Scytosiphon lomentaria is the closest species of M. pyrifera . Numerous genes identified in this dataset are involved in genetic information processing, signaling, and cellular processes, carbohydrate metabolism, and terpenoids biosynthesis.  Table   Subject  Genomics  Specific subject area  Algal Genomics  Type of data  Tables, Figures, Charts  How the data were acquired Illumina HiSeq 40 0 0 (paired-end) and Nanopore GridIon-X5 Data format Raw, filtered, analyzed Description of data collection Genomic DNA was extracted and purified from apical frond tissue samples of Macrocystis pyrifera using Gene Jet Plant genomic DNA purification Kit (Thermo Scientific, USA) and sequenced both on Illumina Hiseq 40 0 0 (paired-end) and Nanopore-GridION platforms. The short reads (Illumina) and long reads (Nanopore) data from both the sequencing platforms were demultiplexed using bcl2fastq (Illumina) and guppy (Oxford Nanopore Technologies

Value of the Data
• This is a high-quality draft genome sequence report of the commercially valuable largest brown algae, which will aid macroalgal genome research. • The draft genome data facilitates identifying several genes involved in the biosynthesis of industrially important complex algal cell wall polysaccharides, which will be useful for polymer biologists or researchers from biochemical industries to develop innovative products. • The draft genome data facilitates identifying a number of terpenoid biosynthetic genes, which will help understand the terpenoid metabolism in macroalga and accelerate marine bioprospecting. • The draft genome data will boost macroalgal functional genomics studies.

Data Description
Kelps or brown algae (Chromista, Phaeophyceae) are multicellular photosynthetic organisms that play crucial roles in the marine ecosystem. Giant kelp Macrocystis pyrifera is the largest and most commonly distributed kelp species on the planet, making it one of the richest ecosystems on earth [1] that is often cultured commercially for marine bioprospecting (developing commercially valuable products exploiting marine organisms) [2] . The economic importance of M. pyrifera is primarily due to its use in the industrial production of the high-value complex polysaccharides alginic acids /alginates and sulfated fucans /fucoidans, although other valuable carbohydrates such as mannitol and laminarin as well as biochemical compounds are also extracted from this kelp species [3 , 4] . A total of 44.64 million paired-end reads were generated from Illumina and 4 million reads from Nanopore-GridION, respectively. The scaffolded M. pyrifera genome was of size ∼428 Mb with an N50 size of ∼42.8 Kb and contained 44,307 scaffolds ( Table 1 ). The GC content was calculated as ∼47%. A total of 24,778 genes and 20,026 annotated proteins were predicted in the analysis. Genes with Pfam domains and signal peptides were found to be 9,705 and 521 in number, respectively. The average lengths (bp) of the gene, coding sequence, and intron of the M. pyrifera genome were found to be ∼5,949, ∼978, and ∼1,514, respectively ( Table 1 ), and the BUSCO evaluation of completeness of the genome was ∼50% complete ( Table 1 , Fig. 1 a). The phylogenetic tree displayed that littoral brown seaweed Scytosiphon lomentaria is phylogenetically the closest species of M. pyrifera ( Fig. 1 b).  Protein level comparative analysis using the Orthovenn tool provides information about a list of paralogous and orthologous proteins shared between multiple related species ( Fig. 2 a). GO analysis of annotated proteins from M. pyrifera showed that in the 'Molecular function' (MF) category, the highest represented GO term was ATP binding (17.67%) followed by metal ion binding (6.54%) and RNA binding (3.12%), while in the 'Biological Process' (BP) category, microtubule-based movement (1.69%) was the highest term followed by DNA repair (1.61%) and translation (1.08%). In the 'Cellular Component' category, most of the proteins were found to be an integral component of the membrane (24.61%), followed by the nucleus (6.06%) and cytoplasm (5.0%) ( Fig. 2 b). Moreover, several important terpenoid biosynthetic genes such as betaring hydroxylase, farnesyl-diphosphate farnesyltransferase, phytoene desaturase, transketolase, farnesyl diphosphate synthase, hydroxymethylglutaryl-CoA synthase, squalene monooxygenase, zeta-carotene desaturase, and geranyl diphosphate synthase were identified which could aid in understanding the terpenoid metabolism in macroalga and facilitates marine bioprospecting.

Sample preparation and sequencing
Apical frond tissue samples of adult Macrocystis pyrifera were collected from the intertidal zone (water surface/0 m depth) at Punta San Juanito, Ica, Peru (Latitude 15 °15 11.3 S, Longitude 0.75 °13 32.4 W) during low tide, washed subsequently with 100% ethanol and frozen immediately in liquid nitrogen until further analysis. Genomic DNA was extracted and purified from frozen frond tissues using Gene Jet Plant genomic DNA purification Kit (Thermo Scientific, USA) following the manufacturer's protocols. A high-quality DNA sample was used for sequencing, and the whole genome sequencing approach was taken using both Illumina Hiseq 40 0 0 (Illumina) and GridION-X5 (Oxford Nanopore Sequencing Technology) platforms at Genotypic Technology Pvt Ltd (Bengaluru, India).

De novo genome assembly and assessment
The short reads (Illumina) and long reads (Nanopore) data were demultiplexed applying bcl2fastq-v2.17.1.14 and Guppy-v2.3, respectively [5] . De novo hybrid assembly was performed by MaSuRCA-v3.3.4 [6] utilizing Illumina and nanopore reads. Scaffolding was done by pyScaf-v0.12a ( https://bioinformaticsonline.com/bookmarks/view/30236/pyscaf ) to improve the quality of the assembled genome. Further soft masking of the repeat regions found in the draft scaffolded genome was carried out using RepeatMasker-v4.0.6 ( https://www.repeatmasker.org/ ) and RepeatModeler-v2.0 [7] . The improved assembled genome was then employed for predicting the genes and protein sequences applying the BRAKER gene prediction tool [8] . Along with scaffolded genome, reference protein data from Ectocarpus siliculosus and transcriptome data (NCBI SRAs: SRR3544557 and SRR3615022) from M. integrifolia as well as M. pyrifera [9] was used for predicting genes/proteins efficiently. The completeness of scaffolded genome predicted genes and transcriptomes was assessed further using BUSCO-v3.0.2 [10] . The phylogenetic position of M. pyrifera was determined using the MEGA-X-v10.0.5 [11] tool through the maximum likelihood method (Jukes-Cantor model); since the whole genome sequences of brown algal species are limited in the NCBI database 18S rDNA sequences of M. pyrifera and other kelps ( Coccophora langsdorfii, Sargassum horneri, Dictyota dichotoma, Desmarestia viridis, Ectocarpus siliculosus, Colpomenia peregrina, Undaria pinnatifida , and Scytosiphon lomentaria ) were selected to construct the phylogenetic tree. Protein level comparative analysis was carried out among predicted proteins from the sample and reference protein sequences from multiple related species such as Ectocarpus siliculosus, Nemacystus decipiens, and Cladosiphon okamuranus . Finally, the Gene Ontology (GO) annotation of the predicted proteins was accomplished via the DIAMOND-v0.8.29 [12] BlastP program against the Uniprot brown algae family protein database.

Ethics Statement
Not applicable.

Declaration of Competing Interest
The authors declare that there are no competing interests.