A draft genome assembly of reef-building octocoral Heliopora coerulea

Coral reefs are under existential threat from climate change and anthropogenic impacts. Genomic studies have enhanced our knowledge of resilience and responses of some coral species to environmental stress, but reference genomes are lacking for many coral species. The blue coral Heliopora is the only reef-building octocoral genus and exhibits optimal growth at a temperature close to the bleaching threshold of scleractinian corals. Local and high-latitude expansions of Heliopora coerulea were reported in the last decade, but little is known about the molecular mechanisms underlying its thermal resistance. We generated a draft genome of H. coerulea with an assembled size of 429.9 Mb, scaffold N50 of 1.42 Mb and BUSCO completeness of 94.9%. The genome contains 239.1 Mb repetitive sequences, 27,108 protein coding genes, 6,225 lncRNAs, and 79 miRNAs. This reference genome provides a valuable resource for in-depth studies on the adaptive mechanisms of corals under climate change and the evolution of skeleton in cnidarian.

www.nature.com/scientificdata www.nature.com/scientificdata/ Methods Sample collection. The blue coral was collected by SCUBA at 5 m depth from Green Island, Taiwan (22°40′37′′N 121°28′23′′E) in April 2018. Coral fragments were transported in seawater to Biodiversity Research Center, Academia Sinica, Taipei, where they were kept in a 5 L aerated aquarium. To avoid contamination by bacteria or algae in the water, the coral fragments were rinsed several times in Milli-Q water immediately prior  www.nature.com/scientificdata www.nature.com/scientificdata/ to DNA and RNA sampling. Coral fragments were immediately fixed in liquid nitrogen for DNA extraction and genome sequencing, whilst tissues were fixed in RNAlater (Invitrogen, CA, USA) for RNA sequencing. All samples were stored at −80 °C in a freezer until subjected to extraction. Genomic sequencing. Genomic DNA was extracted from the coral tissue using the CTAB method 25 .
DNA quality and quantity was measured using agarose gel electrophoresis and a Qubit fluorometer (Thermo Fisher Scientific, MA, USA), respectively. DNA samples were submitted to Novogene (Beijing, China) for library preparation and whole genome sequencing (Table 1). Briefly, 1 µg DNA was used to construct two libraries with 350-bp and 500-bp insert sizes using the NEBNext DNA Library Prep Kit (New England Biolabs, MA, USA), and sequenced on an Illumina HiSeq X Ten sequencer to generate 122.4 Gb paired-end reads with a read length of 150 bp. In addition, 10 µg DNA was used to construct a HiFi SMRTbell library using the SMRTbell Express Template Prep Kit 2.0, and sequenced on a PacBio Sequel II sequencer. Total of 31.8 Gb high-quality HiFi reads were produced using the circular consensus sequencing (CCS) mode on the PacBio long-read platform. RNA sequencing. Total RNA was extracted from the coral tissue using TRIzol reagent (Thermo Fisher Scientific, MA, USA) by following the manufacturer's protocol. The quality of the RNA samples was determined with agarose gel electrophoresis and the quantity was determined using a Qubit fluorometer (Thermo Fisher Scientific, MA, USA). RNA samples were submitted to Novogene (Beijing, China) for mRNA, long non-coding RNA (lncRNA), and microRNA (miRNA) sequencing (Table 1). mRNA library was constructed using Illumina NEBNext Ultra RNA Library Prep Kit (New England Biolabs, MA, USA) and sequenced using an Illumina HiSeq X Ten sequencer to produce 150-bp paired-end reads. For lncRNA, ribosomal RNA was depleted from total RNA using Epicentre Ribo-Zero rRNA Removal Kit (Epicentre, WI, USA). The cDNA libraries were prepared using the NEBNext Ultra RNA Library Prep Kit (New England Biolabs, MA, USA), and sequenced on an Illumina NovaSeq platform under the paired-end mode to produce 150-bp reads. In addition, miRNA libraries were prepared using the NEBNext Multiplex Small RNA Library Prep Kit (Illumina, CA, USA) and sequenced on an Illumina NovaSeq platform to produce 50-bp single-end reads.
Genome assembly. De novo assembly of HiFi reads (N50 of 14.0 kb and mean length of 13.5 kb; Table 1) were performed using nextDenovo v2.5.0 (https://github.com/Nextomics/NextDenovo) under default settings. Algal and microbial sequences were removed by binning genome assembly with MetaBAT2 v2.15 29 , and BLASTn v2.11.0 + search against the 14 cnidarian genomes in Table 4, four Symbiodiniaceae genomes from ReefGenomics database (http://reefgenomics.org/), and NCBI Prokaryotic Refseq genomes with an E-value threshold of 1e-20. The initial assembly generated 1,309.7 Mb metagenome sequences (Table 2). After binning, a total of 170 bins were identified and the "Bin167" with 600.2 Mb and >100X coverage of Illumina data was selected (Table 2 and S1). BLASTn analysis filtered the potential symbiont sequence and resulted in the 586.0 Mb genome with 2,248 contigs. Possible alternative heterozygous contigs were further eliminated using Purge Haplotigs v1.1.230 30 ( Table 2). The completeness of the final genome assembly was assessed by analyzing the Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.5 scores against the databases eukaryota_odb10 and eukaryota_odb10 under the genome mode 31 . QUAST v5.2 was used to assess the assembly statistics 32 . The total assembled size of the genome is 429.9 Mb in length and the N50 is 1.42 Mb (Table 3; Fig. 2).
Raw mRNA reads were trimmed using Trimmomatic v0.38 27 (quality score <30, length <40 bp). The clean reads were de novo and genome-guided assembled using Trinity v2.5.1 39 under the default settings. Cnidaria protein sequences from UniProt database were used as protein evidence. Augustus v3.4 40 and SNAP v2006-07-28 41 were used for ab initio gene prediction. All predicted gene models were integrated into a consensus weighted annotation with EVidenceModeler v1. 1 43 . Finally, we obtained 27,108 predicted protein-coding genes with an N50 of 1,754 bp ( Table 3).
The BUSCO completeness of predicted gene models was assessed against eukaryota_odb10 and metazoa_ odb10 datasets 31 under the protein mode. The predicted genes were functionally annotated using Diamond v2.0.13.151 BLASTp 44 against UniProt and Swissport databases under the "ultra-sensitive" option and an E-value threshold of 1e-5. Gene functional annotation was conducted using eggNOG-mapper v2 45 for Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Pfam domain. lncRNA annotation. The raw lncRNA reads were filtered to remove adapter and low-quality reads (quality score <30, length <40 bp) using Trimmomatic v0.38 27 . The clean lncRNA reads were mapped to the H. coerulea genome using HISAT2 v2.1.0 46 under the default settings. The resulting bam files were then assembled into transcript models using StringTie v1.3.4d 47 under the default settings. The assembled transcripts were processed through FlExible Extraction of LncRNAs (FEELnc) v0.2.1 48 for lncRNA identification and classification. Briefly, the script FEELnc filter.pl was used to remove transcripts with one exon, a size < 200 bp, and overlapping with predicted protein-coding regions. The coding potential score of each candidate transcript was calculated using the script FELLnc_codpot.pl under the shuffle mode. Finally, the FEELnc_classifier.pl was used to classify potential lncRNA with respect to the localization and the direction of transcription of nearby protein-coding genes. A total of 6,225 lncRNA genes were predicted in the H. coerulea genome (Tables S2, S3). miRNA annotation. miRNA analysis was conducted according to Ip et al. 36 . Briefly, raw miRNA reads were trimmed with fastp v0.20.0 49 under the settings of length_required = 18, max_length = 35, unqualified_percent_ limit = 30, n_base_limit = 0. The clean reads were then combined and mapped to the genome using the mapper. pl script in miRDeep2 v2.0.1.2 50 using bowtie v1.2.2 51 . miRNAs were predicted using the miRDeep2.pl script in miRDeep2 with the Cnidaria mature miRNAs from miRBase v22.1 52 . The predicted miRNAs were filtered with a miRDeep2 score ≥ 4, star (complementary) and mature read count ≥ 5, and a significant Randfold p-value. The target genes of miRNAs were predicted using miRanda v3.3a 53 with a miRanda score ≥ 140, a dimer binding free energy < −5 kcalmol −1 , and strict 5′ seed pairing. In total, we detected 79 miRNA candidates ranging from 20 to 24 nt in length, and 10,636 mRNAs were predicted as their potential targets (Tables S4, S5).  anthozoans with the outgroup species Hydra vulgaris (details in Table 4 and Table S6) were identified using OrthoFinder v2.5.4 under the "diamond_ultra_sens" option 54 . A total of 407 single-copy genes were aligned using MUSCLE v3.8.31 55 and trimmed using TrimAL v1.4 56 . The aligned sequences with 91,426 amino acid positions and 1.1-13.9% gaps were concatenated for phylogenetic analysis using a maximum-likelihood method implemented in IQ-TREE v2.13 57 , with the best model of Q.insect + F + I + G4 and 1000 bootstrapping replicates. MCMCtree implemented in PAML v4.9h 58 was used to estimate divergence times using the burn-in, sample frequency and number of samples of 10000000, 1000 and 10000, respectively. The node calibration among cnidarians was based on fossil records (i.e., ~55 MYA for Acropora 59 , ~145 MYA for Helioporacea 18 , ~540 MYA for Hexacorallia 60 ) and TIMETREE database 61 (i.e., Edwardsiidae for 280 -490 MYA, Anthozoa for 520 -740 MYA). Using the orthologous results, we performed the gene family expansion and contraction for each node using CAFÉ v4.2 62 . These analyses revealed that H. coerulea is sister to the soft coral Dendronephthya gigantea, which split during Triassic (~216 MYA, 95% confidence interval of 157-301 MYA; Fig. 4). This D. gigantea + H. coerulea clade is then sister to the Hexacorallia clade, consistent with a previous phylogenetic analysis of 234 anthozoans 63 . Gene family analysis detected 167 expanded and 61 contracted gene families in H. coerulea ( Fig. 4; Table S7).

technical Validation
The quality of H. coerulea genome assembly was assessed by several approaches: (i) comparison with the estimated genome size, which is also ~430 Mb in total length (Figs. 1b, 2); (ii) obtaining the complete mitogenome, which is 100% identical in size and gene order with a published mitogenome of the same species (GenBank: OL616236; Fig. 3); (iii) conducting QUAST analysis, which showed that the assembly statistics of H. coerulea is comparable with published cnidarian genomes (Table 4); (iv) conducting BUSCO analysis, which identified 98.4% eukaryotic BUSCOs and 94.4% metazoan BUSCOs in the H. coerulea genome, and 98.4% eukaryotic BUSCOs and 95.3% metazoan BUSCOs in its predicted gene models (Table 4); (v) conducting the analysis of genome coverage using SAMtools v1.15.1 75 , which showed 100% genome coverage and 91.4% mapping rate of PacBio HiFi reads, and 94.8% genome coverage and 88.4% mapping rate of Illumina short reads (Table 3). These results indicated the H. coerulea assembly is of high-quality.

Code availability
All bioinformatic tools used in this study were executed according to the corresponding manual and protocols. The version and code and parameters of the main bioinformatic tools are described below.     Table S7).