Nanopore Sequencing and De Novo Assembly of a Black-Shelled Pacific Oyster (Crassostrea gigas) Genome

The Pacific oyster, Crassostrea gigas, belongs to one of the most species-rich phyla and provides important ecological and economical services. Here we present a genome assembly for a variety of this species, black-shelled Pacific oyster, using a combination of 61.8 Gb Nanopore long reads and 105.6 Gb raw BGI-seq short reads. The genome assembly comprised 3,676 contigs, with a total length of 587 Mb and a contig N50 of 581 kb. Annotation of the genome assembly identified 283 Mb (48.32%) of repetitive sequences and a total of 26,811 protein-coding genes. A long-term transposable element active, accompanied by recent expansion (1 million years ago), was detected in this genome. The divergence between black-shelled and the previous published Pacific oysters was estimated at about 2.2 million years ago, which implies that species C. gigas had great intraspecific genetic variations. Moreover, we identified 148/188 specifically expanded/contracted gene families in this genome. We believe this genome assembly will be a valuable resource for understanding the genetic breeding, conservation, and evolution of oysters and bivalves.


INTRODUCTION
Fossil records show that oysters appeared about 200 million years ago (Upper Triassic; Stott, 2004). From then on, they began their work of filtering the oceans. Because of their "advanced" defense mechanism (thick shells), which confers strong resistance to desiccation and sunlight exposure, they underwent a large increase during the following 70 million years (from the Jurassic to the Cretaceous). By about 135 million years ago, oysters were the predominant shellfish in the world's ocean (Stott, 2004). Following settlement and metamorphosis, oysters attach their left shells to other objects and begin a sessile lifestyle (Zhang et al., 2012). Individuals of all ages cluster together and eventually form reefs, which harbor many kinds of marine organisms. Thus, the oyster is of great significance to marine ecology (Beck et al., 2011).
Because oysters are sessile animals living in estuarine and intertidal regions, they have to cope with drastically changing environments. Hence, they are ideal models for investigating adaptations of living organisms to climate change (Li et al., 2018). In addition, oysters are frequently used as models in studies on neurobiology, biomineralization, ocean acidification, immunity, and developmental biology (Caldeira and Wickett, 2003).
The Pacific oyster, Crassostrea gigas (Bivalvia, Osteroida, Ostreidae), is economically important in aquaculture around the world (Zhang et al., 2016). Several famous types of oysters, such as the Gillardeau oyster from France and the Rock Oyster from Australia, are Pacific oysters. Because they are of great value to the economy, scientific research, and ecology, Pacific oysters have received extensive attention (Wrange et al., 2010;Du et al., 2017;Sun et al., 2017;De Lorgeril et al., 2018;Powell et al., 2018). Moreover, a number of varieties of this species have been bred, such as the black-shelled Pacific oyster (BPO; Hao et al., 2015). Prior to this work, a Pacific oyster genome sequence had been reported in 2012 (Zhang et al., 2012). However, the extensive intraspecific variations that characterize the Pacific oyster (Li et al., 2018) mean that the previous genome has been of limited assistance to oyster research and the oyster industry. Hence, there is a pressing need for high-quality genome maps of multiple varieties.
Here, we present a high-quality genome assembly for the BPO bred by our lab, constructed using both Nanopore long reads and BGI-seq short reads. Some aspects of this assembly are superior to those of the earlier one mainly due to much advanced technologies. We think that this well-annotated genome assembly and the massive amount of sequencing data generated in this study will serve as substantial resources for future oyster academic research and industrial development.

Sample Collection, library Construction, and Sequencing
An individual of the BPO (C. gigas, NCBI taxonomy ID: 29159; Figure 1), collected from Changdao County, Yantai City, Shandong Province, China, was used for whole-genome sequencing. Genomic DNA was isolated from the mantles using a Qiagen Blood & Cell Culture DNA Mini Kit according to the manufacturer's instructions. A BGI-seq library was constructed using a MGIEasy Library Prep Kit V1.1 (MGI Tech), and paired-end 150 bp single-indexed sequencing was performed on the MGISEQ-2000 platform (BGI, Shenzhen, China;Winnepenninckx et al., 1993). Nanopore libraries were prepared and sequenced on two flow-cells using a GridION DNA sequencer according to the manufacturer's instructions (Oxford Nanopore, Oxford, UK; Jain et al., 2016;Jain et al., 2018;Gong et al., 2019).

Data Filtering and Genome Size estimation
Both Nanopore and BGI-seq reads were used to achieve a highquality genome assembly. Before assembly, these two kinds of reads were filtered as follows: For the Nanopore data, reads with mean quality scores >7 were retained and further corrected with NextDenovo (https://github.com/Nextomics/NextDenovo). For the BGI-seq data, adaptor sequences and low-quality reads were filtered out using the program fastp (v 0.20; Chen et al., 2018). Any reads with more than 30 low-quality bases or 5% unknown bases were abandoned. These reads were used for further assembly and subsequent analyses.
The genome size (G) of this Pacific oyster was estimated from k-mer (k = 23 in this case) frequency distribution analysis using the clean BGI-seq data. The term k-mer is used to refer to each of the possible successive subsequences of length k in a read. If the length of each k-mer is 23 bp, a filtered read that is L bp in length contains (L-23+1) k-mers. The genome size is estimated by the formula: G = K_num/K_depth, where K_num and K_depth are the total number and the peak frequency of 23-mers respectively (Koren et al., 2012;Ross et al., 2013;Wences and Schatz, 2015;Vurture et al., 2017).

Genome Assembly
The corrected Nanopore reads were assembled into contigs by wtdbg (v 2.2; Ruan and Li, 2019) with the parameter "-x FIGURe 1 | A photograph of black-shelled and white-shelled Pacific oysters, Crassostrea gigas. B-L denotes the left shell of a black-shelled oyster; B-R denotes the right shell of a black-shelled oyster; W-L denotes the left shell of a white-shelled oyster; W-R denotes the right shell of a white-shelled oyster.
corrected". After that, the filtered BGI-seq reads were mapped to the contigs using Burrows-Wheeler Aligner (BWA, RRID: SCR_010910; Li and Durbin, 2010) and then subjected to two rounds of polishing with Pilon (v 1.21; Walker et al., 2014). Finally, we employed Purge Haplotigs (v 1.0.4; Roach et al., 2018) to resolve redundancy in the assembly based on Nanopore reads depth information.
The accuracy of the genome assembly was evaluated by mapping the clean BGI-seq short reads against the genome using BWA (Li and Durbin, 2010). Furthermore, we employed Benchmarking Universal Single-Copy Orthologs (BUSCO, v 3.0.2; Simao et al., 2015), a software package that can quantitatively measure completeness of genome assembly based on evolutionarily informed expectations of gene content, to evaluate the completeness of the genome assembly, using 978 genes that are expected to be present in all metazoans (Cai et al., 2019).

Gene Prediction and Annotation
We combined de novo and homology-based predictions to identify protein-coding genes. For homology-based annotation, three genomes of Pteriomorphia species were downloaded from NCBI: C. gigas (GCA_000297895.1), Crassostrea virginica (GCA_002022765.4), and Mizuhopecten yessoensis (GCA_002113885.2). We picked the longest transcript of each gene and removed those with premature terminations. The remaining genes were translated into proteins and aligned to our genome with TBLASTN (v 2.7.1; Altschul et al., 1990) to search for the approximate positions of potential gene homologs. Next, we used Exonerate (v 2.2; Slater and Birney, 2005) to obtain an accurate gene structure for each locus. For de novo annotation, we first picked 898 complete genes in order to obtain parameters suitable for C. gigas. Then we performed de novo prediction on the repeat-masked genome using AUGUSTUS (v 3.2.1; Stanke et al., 2008) with the gene parameters so obtained. Finally, we used EVidenceModeler (v 1.1.1; Haas et al., 2008) to integrate homologs and de novo predicted genes and generate a comprehensive, non-redundant gene set. The completeness of the genome annotation was investigated using BUSCO (v 3.0.2; Simao et al., 2015) with the parameter "-l metazoa _odb9".
We used InterProScan (v 5.30-69.0; Jones et al., 2014) with default parameters to annotate the functions of detected motifs and domains by searching public databases (GO, INTERPRO, PFAM, KEGG, and PANTHER).

Gene Family expansions and Contractions
We used computational analysis of gene family Evolution (CAFÉ; V 4.0.1; De Bie et al., 2006) with default settings to identify the expanded and contracted gene families, along the timetree created in the previous step. any gene family under both family-wide and viterbi P-Values < 0.01 was retained. Then we conducted GO enrichment analysis to investigate functional categories of the expanded and contracted gene families under the standard of Chi.FDR < 0.05 (Xiao et al., 2018).

ReSUlTS AND DISCUSSIONS
In total, 105.6 Gb of raw BGI-seq data and 61.8 Gb of Nanopore reads were produced. After the removal of low-quality reads, we harvested 104.9 Gb of clean BGI-seq reads, and 39.9 Gb of corrected Nanopore reads with a mean subread length of 21.9 kb (Table S1). For the 23-mer analysis, K_num was 83,754,003,275 and K_depth was estimated as 141, giving an estimated genome size of c. 594 Mb (Table S2). This genome size is within the range of 545-637 Mb reported by (Zhang et al., 2012). A bimodal pattern was observed in the 23-mer frequency distribution analysis ( Figure S1). The heterozygous peak (the first peak) was much higher than the second, homozygous peak, indicating that the BPO had a diploid genome with a high level of heterozygosity.
Initial assembly yielded a total length of 656 Mb, comprising 6,815 contigs with a contig N50 length of 436 kb ( Table S3). The genome assembly was larger than the estimated genome size of c. 594 Mb (see above), because some allelic variants failed to be merged due to high heterozygosity (Zhang et al., 2012). After eliminating the redundancy, we obtained a final genome assembly of 587 Mb for the BPO, which was pretty close to the estimated genome size, comprising 3,676 contigs with a contig N50 length of 581 kb (Table 1 and S3).
BUSCO analysis showed that 919 (94.0%) of 978 metazoan BUSCOs were detected in the BPO genome assembly, with 890 (91.0%) and 29 (3.0%) being identified as complete and fragmented respectively ( Table 2). Mapping rate test suggested that more than 97.42% of the clean BGI-seq reads were properly mapped to the genome, and of these 91.42% mapped to their mates. These reads covered 98.08% of the genome assembly (Table S4). These analyses indicated that the assembly was high quality in both levels of completeness and accuracy.
Repeat annotation identified 283 Mb (48.32%) of repetitive sequences in the assembled genome. The proportion of repetitive sequences was a bit higher than that of the previous C. gigas genome (36.1%;Zhang et al., 2012). DNA transposon was the most abundant repeat and accounted for 21.28% (125 Mb in total) of the genome (Tables S5 and S6), while it was only 4.1% of the previous C. gigas genome (Zhang et al., 2012). TE accumulation analysis suggested long-term TE actives, accompanied by recent expansion within 1 million years ago (Ma), in this genome (Figure  2A and Figure S2). DNA transposons and rolling circles (RC) presented landscapes similar to the whole, whereas long terminal repeat elements (LTR) and long interspersed nuclear elements (LINE) showed ancient retention and recent expansion of the TE insertions ( Figure S2). The BPO genome exhibits a remarkable level of recent TE actives that differed from the previous version (Zhang et al., 2012). The specific TEs accumulation may be an important reason underlying the difference in repeats repertoire between these two C. gigas genomes. In addition, 26,811 proteincoding genes were identified, averaging 7.0 exons and a 1,225 bp coding region per gene, which was comparable to the other Pacific oyster genome published in 2012 ( Figure S3 and Table  S7; Zhang et al., 2012). 23,111 genes (86.2% of the predicted genes) were successfully annotated with predicted functions or conserved functional motifs. Respectively,11,810,14,400,16,853,7,575,and 15,997 genes showed positive hits in GO, PFAM, INTERPRO, KEGG, and PANTHER (Table S8). BUSCO (Simao et al., 2015) analysis suggested that 94.9% (928) of the metazoa core conserved genes were detected in the BPO gene set, with 894 (91.4%) and 34 (3.5%) being identified as complete and fragmented, respectively ( Table 2). BUSCO analysis showed that the performance in the genome is better than the genome assembly, which was presumably most that it is simpler to search for genes in a transcriptome or proteome than in a genome (Cai et al., 2019).
OrthoFinder analysis identified 199 single-copy orthologues. RaxML analysis supported that the BPO was clustered with the other C. gigas in the phylogenetic tree. Based on four solid timecalibrations (see MATERIALS AND METHODS), molecular clock analysis suggests that they diverged at about 2.2 million years ago (Figure 2B). Previous phylogenetic studies and fossil records (http://fossilworks.org/bridge.pl?a = taxonInfo&taxon_ no = 109465) indicated that species divergence between C. gigas and C. virginica was at 63.2-82.7 Ma (Plazzi and Passamonti, 2010;Ren et al., 2010). Thus, the divergence time between the BPO in this study and the other C. gigas sequenced in 2012 should be rational but still very long. The deep divergence, combined with remarkable difference in repeats repertoire, suggested that C. gigas comprised great intraspecific genetic variations. Our result was consistent with studies of Li et al. (2018), who found unexpected genetic divergence in several populations of C. gigas. Estimated effective population size (Ne) dynamic indicated that the BPO reached a top Ne at around 0.6 Ma, and then    Table S9). The contracted gene families were enriched in 31 GO categories: peroxidase activity, response to oxidative stress, response to stress, and 28 others (for details, see Table S10). It was possible that TE activities after divergence affected the expansion and contraction of gene families (Joly-Lopez et al., 2012;Kapusta et al., 2017). However, whether these notable changes in BPO tell about special adaptations needs further exploration. Nanopore sequencing produces long read lengths, which can effectively improve the quality of genome assembly and alleviate assembly errors (Xiao et al., 2017). The C. gigas genome, known as its high heterozygosity and high repetition rate, once was a challenging project in the age of Next-generation sequencing (Zhang et al., 2012). Current assembly (contig N50 of 581.9 kb) using Nanopore sequencing is superior to the previous version (contig N50 of 19.4 kb and scaffold N50 of 401 kb, Zhang et al., 2012) in genome continuity. Accuracy and integrity of this genome are also comparable to that of the previous genome. Although, reads produced by this technology have high error rates , several rounds of polishing before and after assembly can effectively eliminate the negative effects of sequencing errors.

CONClUSIONS
Here, we present high-quality genome assembly and annotation of a variety of Pacific oyster, the BPO. The deep divergence of history, specific TE repertoire, and significant expansion and contraction of gene families in this genome suggested that BPO also represents a special Pacific oyster in genetics. Hence, the BPO genome and the massive amount of data created in this study will serve as valuable resources for studying the genetic breeding, conservation, and evolution in oysters. A series of better assembly parameters suggested that nanopore sequencing technology is qualified for the assembly of complex genomes like oysters.

DATA AVAIlABIlITY STATeMeNT
Raw reads from BGI-seq and Nanopore sequencing had been deposited in the NCBI Sequence Read Archive (SRA) database under accession number SRP193912 and BioProject accession PRJNA534417. This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession SZQM00000000. The version described in this paper is version SZQM02000000.

AUThOR CONTRIBUTIONS
XW and CF designed and supervised the project. XW, LW, CH, HS, ZC, WY, QJ, and LL prepared the samples. WX, XW, CZ, and CF analyzed the data. WX, XW, and LW wrote the manuscript with the help of the other authors. KW and CF revised the manuscript. All authors read and approved the final manuscript.  NWPU (19SH030408). We gave our thanks to Dr. Feng Shao for his kind help in TE analysis.