Whole Genome Assembly of the Snout Otter Clam, Lutraria rhynchaena, Using Nanopore and Illumina Data, Benchmarked Against Bivalve Genome Assemblies.

Citation: Thai BT, Lee YP, Gan HM, Austin CM, Croft LJ, Trieu TA and Tan MH (2019) Whole Genome Assembly of the Snout Otter Clam, Lutraria rhynchaena, Using Nanopore and Illumina Data, Benchmarked Against Bivalve Genome Assemblies. Front. Genet. 10:1158. doi: 10.3389/fgene.2019.01158 Whole Genome Assembly of the Snout Otter Clam, Lutraria rhynchaena, Using Nanopore and Illumina Data, Benchmarked Against Bivalve Genome Assemblies


INTRODUCTION
Production of cultured bivalve molluscs was 17.1 million tons in 2016 accounting for 21.4% of global aquaculture production (FAO, 2018). The lack of genomic resources coupled with limited understanding of the molecular basis of gene expression and phenotypic variation have limited advances in aquaculture-based productivity of marine bivalves. Understanding the molecular basis of phenotypic variation and gene function is therefore important for selective breeding programs for traits such as increased growth and disease resistance. Similarly, whole genome assemblies support GWAS studies to identify trait-specific loci and for genomic-based selective breeding. To this end, whole-genome sequencing has been conducted on several commercial bivalve species, including the edible oysters Crassostrea virginica (Gómez-Chiarri et al., 2015), Crassostrea gigas (Gerdol et al., 2015), pearl oysters (Takeuchi et al., 2012) and clams (Mun et al., 2017). However, in general, genomic data for bivalve molluscs, which includes a taxonomically diverse group of species, are sparse (Takeuchi et al., 2012;Zhang et al., 2012;Murgarella et al., 2016;Du et al., 2017;Li et al., 2017;Mun et al., 2017;Sun et al., 2017;Uliano-Silva et al., 2017;Wang et al., 2017;Li et al., 2018;Powell et al., 2018;Renaut et al., 2018;Bai et al., 2019). In this study we present the first genomic resources for a species of clam from the superfamily Mactroidea and for a Vietnamese shellfish species and generate a draft reference genome to form the basis of on-going selective breeding studies. This study also demonstrates the efficacy of using Oxford Nanopore Technology (ONT) reads to scaffold bivalve genome assemblies and shows the value of these relatively inexpensive long reads for spanning large repetitive regions and overcoming complex assembly issues caused by high heterozygosity, which typically confounds short read only assemblies. The quality of our assembly is also benchmarked against other bivalve genome assemblies and we present an initial phylogenomic analysis for the class Bivalvia, which illustrates the value and potential of the increasing number of high quality genomic data sets for phylogenetics.

extraction, Library preparation, and Sequencing
Muscle tissue was collected from a sample obtained from a snout otter clam farm in Van Dong district, Quang Ninh province in Vietnam. For short read sequencing of the clam genome, genomic DNA (gDNA) was extracted according to Sokolov's improved method (Sokolov, 2000). Two sequencing libraries were prepared. These include a PCRfree library prepared using NuGen Celero DNA-Seq Library Preparation Kit (Tecan Genomics, San Carlos, CA) according to manufacturer's instruction and a PCR-based library prepared with NEBNext Ultra DNA Library Preparation Kit (New England Biolabs, Ipwich, MA). For transcriptomic data, RNA was extracted from digestive gland tissue using the Zymo Quick-RNA Miniprep kit, followed by RNA library constructed using the Nugen Universal Plus mRNA-Seq Kit (Tecan Genomics, San Carlos, CA) according to manufacturer's instruction. Both DNA and RNA libraries were sequenced on an Illumina NovaSeq6000 platform at the Deakin Genomics Centre with 2 × 150 bp configuration. For Nanopore long read sequencing, gDNA was extracted from the muscle tissue of the same individual using the column-based Zymo Quick DNA miniprep plus. Approximately 1 µg of the purified gDNA was used as the input for library preparation using the LSK109 kit followed by sequencing on a FLO-MIN106 revD SpotON R9.4 Flow Cell for 48 h.

De Novo Hybrid Assembly of the Snout Otter Clam Genome
Long reads generated by the ONT MinION sequencing device were basecalled and trimmed (adapter and quality) with Guppy v3.0.3 (high accuracy mode, min_qscore 7), which is available via the ONT community site (https://community.nanoporetech. com). A hybrid de novo assembly approach was performed with the MaSuRCA v3.2.8 assembler (USE_LINKING_MATES = 0, cgwErrorRate = 0.15, KMER_COUNT_THRESHOLD = 1) (Zimin et al., 2013), using the previously trimmed long Nanopore reads and polyG-trimmed short Illumina reads as input. The MaSuRCA assembler first reduces high coverage Illumina reads into longer super-reads, then aligns these to long Nanopore reads to build even longer mega-reads (Zimin et al., 2017). These are then assembled by the CABOG assembler within MaSuRCA. Additionally, this version of MaSuRCA uses the high coverage of the longer more error prone Nanopore reads to build consensus sequences for regions not captured by short accurate Illumina reads. As a result, this approach combines the benefits from the accuracy of Illumina reads and the length of Nanopore reads. Completeness of the assembly was assessed with BUSCO v3.0.2 (Simão et al., 2015), using single-copy orthologs from the Metazoan dataset (metazoa_odb9).
As bivalve genomes have been previously shown to be highly heterozygous (Zhang et al., 2012;Wang et al., 2017;Powell et al., 2018), an observation also apparent from the bimodal distribution of k-mer profiles obtained from the GenomeScope analysis in this study (Data Sheet 1), it was necessary to optimize the assembly by removing haplotigs or duplications in the haploid representation of this genome. This was achieved by aligning the long reads back to the assembly with minimap2 v2.15 (-x map_ont, -secondary = no) (Li, 2018) and passing the alignment through the Purge Haplotigs pipeline v1.0.4 (-l 5, -m 20, -h 150) (Roach et al., 2018) to remove artifactual scaffolds.

Transcriptome Assembly
One RNA-seq library was sequenced and a transcriptome was generated to assist in gene prediction. Paired-end, strandspecific RNA reads were polyG-, adapter-and quality-trimmed with the same tools and parameters as outlined for genomic reads. The resulting reads were assembled de novo with the Trinity v2.8.5 assembler (-SS_lib_type FR) (Grabherr et al., 2011). Short reads were subsequently mapped back to the assembly with bowtie2 v2.3.3.1 (Langmead and Salzberg, 2012). Additionally, GMAP v2019-06-10 (default parameters) (Wu and Watanabe, 2005) was used to align the assembled transcriptome to the current genome assembly.

Gene prediction and Annotation
Gene prediction was carried out with the MAKER v2.31.10 annotation pipeline (Holt and Yandell, 2011). Providing this pipeline with the previously identified repeat families, the assembled transcript sequences and a set of protein sequences from the 13 published bivalve genomes (Data Sheet 2) (Takeuchi et al., 2012;Zhang et al., 2012;Murgarella et al., 2016;Du et al., 2017;Li et al., 2017;Mun et al., 2017;Sun et al., 2017;Uliano-Silva et al., 2017;Wang et al., 2017;Li et al., 2018;Powell et al., 2018;Renaut et al., 2018;Bai et al., 2019), MAKER identifies repeats and subsequently aligns transcripts and proteins to the genome to produce initial gene models and sequences in its first iteration (est2genome = 1, protein2genome = 1). These gene models were used to train two ab initio gene prediction softwares AUGUSTUS (using BUSCO with -long and metazoa_odb9) (Stanke et al., 2006) and SNAP (-categorize 1000, -export 1000) (Korf, 2004), followed by a second iteration of MAKER incorporating gene models generated from ab initio predictors. Gene models were again retrained with a second iteration of AUGUSTUS and SNAP before MAKER was repeated for a third iteration.

Maximum Likelihood and Bayesian phylogenetic Analyses
BUSCO-predicted protein-coding genes were used to construct a multi-gene supermatrix in order to infer the phylogenetic position of L. rhynchaena in relation to bivalve species for which published genomes are available (Data Sheet 2), with the exception of M.

Genome Size and Sequencing Coverage
This study generated a large volume of genomic data made up of 122 Gbp of short paired-end reads (150 bp) and 14 Gbp of long reads (average: 4,960 bp, longest: 58,804 bp, shortest: 200 bp) (Data Sheet 3). In addition, 34 Gbp of transcriptomic reads were generated to facilitate gene prediction. Raw Illumina and Nanopore reads generated in this study are available in the Sequence Read Archive (SRP201027) under BioProject PRJNA548223.
The snout otter clam genome size was estimated to range from 545 to 547 Mbp based on 19-, 21-and 25-mers as summarised in Data Sheet 1, with histograms provided in Data Sheet 4. Based on these estimates this study generated short and long read data with sequencing depths of over 200× and 25×, respectively. The haploid genome size for the snout otter clam is at the lower end of a large range of estimated genome sizes for the species of the family Mactridae and also within the phylum Bivalvia more generally (Data Sheet 5).

Genome and Genes of the Snout Otter Clam
The snout otter genome assembly has used the largest volume of ONT data generated for a bivalve species to date and joins the blood clam (S. broughtonii) (Bai et al., 2019) as the only other bivalve genome assembly to incorporate the use of long Nanopore reads and has generated the largest volume of Nanopore data for a bivalve species to date. As the first study to use a combination of only Illumina paired-end reads and Nanopore long reads, it also demonstrates the efficacy of the addition of Nanopore reads as a cost-effective option to achieve a high quality and contiguous genome assembly in combination with Illumina reads. This hybrid assembly approach has generated one of the best bivalve draft genomes currently available consisting of only 1,502 scaffolds (N 50 length: 1.84 Mbp) and a total assembly size of 586.5 Mbp (Table  1), slightly exceeding the kmer-based genome size estimates. The assembly displayed a high overall completeness of 95.9%. However, 2.9% of the metazoan BUSCOs detected in this assessment occurred in duplicates within the assembly, likely to be a result of the elevated heterozygosity of the genome, which is not uncommon for marine invertebrate species. These duplicated regions can cause issues in downstream analyses such as variant discovery and therefore were removed from the assembly before its use for other applications. This then produced a final curated assembly of 544 Mbp contained in 622 scaffolds (N 50 length: 2.14 Mbp) ( Table 1), to which 95.6% of Illumina short reads were successfully aligned using bowtie2 v2.3.3.1 (Langmead and Salzberg, 2012). BUSCO reports a similar overall completeness of 95.8% but with 1.5% BUSCOs detected in duplicates, half of that reported prior to the haplotig purging strategy. This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession VIBL00000000 under BioProject PRJNA548223.
Variant calling analysis detected a total 4,903,576 SNPs, which translates into a heterozygosity estimate of 0.90%. In contrast, heterozygosity was estimated at 1.60% using the kmer-based approach with GenomeScope (Data Sheet 1). Since variant calling methods are typically more conservative than kmerbased ones, the former approach may have excluded some true positives and, thereby leading to an underestimation of genetic variation. Nevertheless, the heterozygosity rates estimated by both methods (0.90 and 1.60%) is within the range of values that have been reported for other bivalves (0.51 to 2.02%). Repeat modelling and masking of the genome masked 29.4% of the L. rhynchaena assembly.
This study also generated 34.6 Gbp of short paired-end, strand specific RNA reads. The resultant transcriptome assembly produced a set of 295,234 transcripts containing 83.9% complete BUSCO genes. A total of 96.4% of these RNA reads were mapped back to this set of assembled transcripts and of these, 79% were aligned to the genome in a splice-aware manner. Both alignment rates indicate that the quality of the set of transcripts is sufficient to improve the gene prediction. Based on hints and evidence from the clam transcripts and protein sequences from other bivalves, gene prediction from the assembled genome resulted in a final set of 26,380 protein-coding genes (AED ≤0.5), 89.8% of which were functionally annotated (i.e. a protein would have at least one associated functional annotation) ( Table 1). The assembled transcriptome, predicted protein-coding genes and annotation information are available as Data Sheet 6 and Data Sheet 7.

Comparison Against Other Bivalves
A summary of assembly sizes, scaffold N 50 lengths, sequencing technologies and other information for all 13 assemblies is available in Data Sheet 2 , with a comparison of repeat content, genome and assembly sizes and BUSCO completeness for all 14 bivalve genomes is visualised in Figure 1. The assembly for M. galloprovincialis has a high level of missing BUSCO genes (14.8% complete, 29.6% fragmented, 55.6% missing) whereas the V. ellipsiformis assembly contained relatively more fragmented BUSCO genes (22.8%). This potentially points to limitations of the data types used, since the former was assembled with only short Illumina paired-end reads, while the latter had both paired-end (PE) and mate-pair (MP) reads but was assisted by only 0.3× of PacBio long reads. While three other assemblies also had only PE and MP reads, most of the remaining assemblies employed the use of other scaffolding strategies including a greater volume of long PacBio reads, fosmid and bacterial artificial chromosome (BAC) libraries (see Data Sheet 2 for details). Furthermore, there is a large discrepancy in estimated genome size and assembly size for R. philippinarum (Figure 1), where the reported assembly is almost twice the size of the expected genome size. For this assembly, a high proportion of duplicated BUSCO genes was also detected (19.1%), highlighting the potential importance of haplotig purging within bivalve assemblies to remove paralogous scaffolds. Nevertheless, this step should be executed with caution as there is also the potential to "over-purge" and exclude actual parts of the genome.

phylogeny of Bivalvia Based on Nuclear Loci
Phylogenies inferred by both methods showed strong UFBoot, SH-aLRT and Bayesian posterior probability (BPP) support for most nodes, the exception being the sister relationship between the bivalve orders Mytilida and Pectinida (Figure 1). The snout otter clam, L. rhynchaena, is the only representative for Mactridae and forms a sister relationship with R. philippinarum from the Veneridae, consistent with their placement in the order Venerida. This initial phylogenomic analysis of the Bivalvia contains species from the three subclasses Imparidentia, Palaeoheterodonta, and Pteriomorpha. All bivalve species are placed within groups consistent with their taxonomic classification at the family and ordinal levels and observed relationships among these three subclasses are consistent with findings from recent studies based on Sanger and transcriptome sequencing (Kocot et al., 2011;González Vanessa et al., 2015;Combosch et al., 2017;Lemer et al., 2019). While these studies include greater taxon sampling from a range of bivalve subclasses, the data matrices used to infer phylogeny are often gappy with a higher level of missing data (~16 to 75%). In contrast, our study successfully used information from a substantial number of genes (> 100) across across a wide density of bivalve taxa with minimal gaps. Nevertheless, our approach to phylogenomics is limited for the time being by the scarcity in genome level resources for bivalve species generally and specifically in relation to missing representatives from important bivalve subclasses including Protobranchia, Archiheterodonta and Anomalodesmata. This initial phylogenomic analysis for the Bivalvia demonstrates the value and potential of this approach as a greater number of high quality genomic data sets become available that can be used for phylogenetic studies, especially for the resolution of basal relationships in what has been a challenging group (Plazzi et al., 2011;Sharma et al., 2012) Multiple sequence alignment and the resulting Newick tree from our analyses are available as Data Sheet 8.

CONCLUSION
We present the first draft genome of the snout otter clam (L. rhynchaena) based on relatively large volumes of short and long genomic reads from Illumina and ONT platforms, respectively, providing sufficient sequencing depth to facilitate the generation of one of the best quality genome assemblies for a bivalve mollusk. The use of long Nanopore reads in a hybrid assembly, presents an effective yet economical approach to achieving a good quality assembly and highlights the importance of long reads in spanning large repeat regions and resolving other complex regions that arises for species with high levels of heterozygosity. This highly contiguous and complete assembly makes this draft genome an important and valuable resource to support ongoing genomic and molecular-based breeding studies for aquaculture. In addition, a transcriptomic data set was generated and assembled to support more refined gene prediction, further adding to the currently scarce transcriptomic resources available for the Mactridae family. Ultimately, we expect results of this study to be used as valuable genomic references for a range of genetic, genomic, phylogenetic and population studies of the snout otter clam and other bivalve species.

DATA AVAILABILITY STATeMeNT
The datasets generated for this study can be found in the NCBI BioProject PRJNA548223 (WGS accession: VIBL00000000, SRA accession: SRP201027).

AUTHOR CONTRIBUTIONS
BT and CA conceived the project. TT collected the samples. YL and HG performed the sequencing. MT analyzed the data. MT and CA wrote the manuscript. MT, CA, BT, YL, HG, TT and LC read and reviewed the manuscript.

FUNDING
This study was funded through a grant to Binh Thanh Thai from the Government of Vietnam through its Aquaculture Biotechnology programs (01/2017/HĐ-TS-CNSH) and the Deakin Genomics Centre, Deakin University, Australia.

ACKNOWLeDGMeNTS
This research utilized computational resources and services provided by the National Computational Infrastructure (NCI), which is supported by the Australian Government. We would also like to thank Robert Ruge of Deakin University for assistance and use of the SIT HPC Cluster.

SUppLeMeNTARY MATeRIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01158/ full#supplementary-material DATA SHeeT 1 | Estimation of genome size, repeat content and heterozygosity by GenomeScope.
DATA SHeeT 2 | Comparisons of published bivalve genome assemblies.
DATA SHeeT 3 | Sequence methods and library kits used in the hybrid genome assembly.
DATA SHeeT 5 | Distribution of genome sizes of bivalve species.
DATA SHeeT 6 | Snout otter clam transcriptome. DATA SHeeT 7 | Protein sequences predicted from the snout otter clam genome and associated annotation.
DATA SHeeT 8 | Supermatrix and phylogenetic tree for the bivalve clade.