A chromosome-scale Mytilus edulis genome assembly for aquaculture, marine ecology, and evolution

Abstract The smooth-shelled blue mussel, Mytilus edulis is part of the Mytilus species complex, encompassing at least three putative species: M. edulis, Mytilus galloprovincialis, and Mytilus trossulus. These three species occur on both sides of the Atlantic and hybridize in nature, and both M. edulis and M. galloprovincialis are important aquaculture species. They are also invasive species in many parts of the world. Here, we present a chromosome-level assembly of M. edulis. We used a combination of PacBio sequencing and Dovetail's Omni-C technology to generate an assembly with 14 long scaffolds containing 94% of the predicted length of the M. edulis genome (1.6 out of 1.7 Gb). Assembly statistics were as follows: total length = 1.65 Gb, N50 = 116 Mb, L50 = 7, and L90 = 13. BUSCO analysis showed 92.55% eukaryote BUSCOs identified. AB-Initio annotation using RNA-seq from mantle, gills, muscle, and foot predicted 47,128 genes. These gene models were combined with IsoSeq validation resulting in 45,379 full CDS protein sequences and 129,708 isoforms. Using GBS and shotgun sequencing, we also sequenced several eastern Canadian populations of Mytilus to characterize single-nucleotide as well as structural variance. This high-quality genome for M. edulis provides a platform to develop tools that can be used in breeding, molecular ecology and evolution to address questions of both commercial and environmental perspectives.


Introduction
The smooth-shelled blue mussel (Mytilus edulis) is common to the North Atlantic from Arctic to Mediterranean regions, with habitat ranging from upper shore to the shallow subtidal (Hayward and Ryland 2017).M. edulis is known to hybridize with Mytilus trossulus in North America and with M. trossulus and Mytilus galloprovincialis in Europe.Together, these three species form what is referred to as the Mytilus edulis species complex (Fig. 1) (McDonald et al. 1991).However, these species are known to also hybridize with putative Mytilus species in the southern hemisphere: Mytilus chilensis (Chile), Mytilus platensis (Argentina and Uruguay), Mytilus aoteanus (New Zealand) and Mytilus planulatus (Australia) (Oyarzún et al. 2021).Collectively, it is more appropriate to refer to them as the Mytilus spp.complex.
This reef-building bivalve is an ecosystem engineer.Mytilus spp.dominate fouling communities in shallow and substrata providing important secondary habitat (Norling and Kautsky 2007).Offshore wind energy structure surveys found that they can cover the structures with up to 3.4 kg of biomass m −2 (Krone et al. 2013).Through filter-feeding, eutrophication is reduced which can alter ecosystems (Broszeit et al. 2016).This nutrient cycling ability has been harnessed by using M. edulis to study the fate of persistent organic and metal pollutants (Chase et al. 2001;McEneff et al. 2014), for the bioremediation of waste (Broszeit et al. 2016) and reduce environmental effects from salmon farms (MacDonald et al. 2011).
Blue mussels (Mytilus spp.), a key bivalve production species (FAO 2020), face decreasing wild spat availability for aquaculture in the UK and elsewhere (Regan et al. 2021).These losses are attributed to multiple stressors including warmer seas (Seuront et al. 2019) causing a poleward range contraction (Jones et al. 2010).Additionally, warmer oceans elevate dissolved CO 2 leading to Ocean Acidification (OA) impacting mussel viability (Asplund et al. 2014) and disease resistance (Ellis et al. 2015).This increases the susceptibility of mussels to bacterial pathogenesis (Ripabelli et al. 1999;Eggermont et al. 2017), with emerging pathogens representing a persistent threat (Charles et al. 2020;Cano et al. 2022).Infectious disease such as disseminated Neoplasia (DN) of M. trossulus origin is associated with reduced fitness in Mytilus spp.(Burioli et al. 2021).Furthermore, the effects of hybridization between the species of the Mytilus spp.complex in the northern hemisphere are yet uncertain with suggested negative effect of M. trossulus hybridization and potential adaptive introgression in the case of M. galloprovincialis hybridization (Fraïsse et al. 2014;Kenchington et al. 2020;Michalek et al. 2021).
To protect the aquaculture industry from these threats, hatchery efforts have been launched in the UK (Regan et al. 2021), elsewhere in Europe (Kamermans et al. 2013) andin Canada (Gurney-Smith et al. 2017).However, these efforts have not been straightforward, and a better understanding of fundamental biology is required to achieve commercial success.Despite their importance in aquaculture and the valuable ecosystem services they provide, no chromosomal assembly existed for any species within the Mytilus species complex prior to this study in 2021.Improved genomic tools are required to address fundamental biological questions such as inheritance patterns and adaptations.Like many bivalves, the mussel genome is highly heterozygous (3.5%) with an estimated 43% repeat content (Corrochano-Fraile et al. 2022).The linear plot of k-mer abundance analysis clearly shows a heterozygous peak in addition to the homozygous peak and estimates a shorter haploid genome length of 1.18 Gb compared with 1.71 G estimate from flow-cytometry (Rodríguez-Juíz et al. 1996).These characteristics make assembly of these genomes challenging.However, recently genomes for the American oyster, Pacific Oyster, Mytilus corruscus and Mytilus californianus have been assembled to chromosome-level using short and longread sequencing technologies as well as Hi-C-based scaffolding (Yang et al. 2021;Paggeot et al. 2022a;Gomez-Chiarri et al. 2015;Penaloza et al. 2021).We used a similar approach in this project to produce a highly contiguous assembly.Practical application of this assembly is demonstrated in cross-species synteny analyses and in population structure of Mytilus individuals sampled from different regions of the Canadian Atlantic.

DNA extraction, library preparation for genome assembly
One naïve (unsexed) M. edulis sample (named "Anne") was selected from samples collected by the Provincial Department of Communities and Fisheries in the estuary Foxley river in Prince Edward Island (PEI).Foxley River was selected as a sampling site because there is no grow-out aquaculture there (i.e.seed from other bays are not transferred to the area).Due to potential introgression of other species of the Mytlius species complex (e.g.M. trossulus), this sample was genotyped using 12 SNPs described by (Wilson et al. 2018).We sampled the foot, gill, mantle, hemolymph, and muscle of sample "Anne" aseptically, flash froze fresh tissues in liquid nitrogen and preserved them at −80°C.Tissues were shipped to Dovetail Genomics in Scott's Valley, CA, in excess dry ice.Dovetail extracted high molecular weight (HMW) DNA from adductor muscle tissue using an in-house modified CTAB method.Dovetail prepared and sequenced PacBIO SMRTbell libraries with SPRI bead size selection of 30 kb + to a depth of 196 × using the Sequel II sequencer producing ∼15 million PacBio CLR reads (∼340 Gb).Processed PacBio data (from Dovetail, SMRT LINK V11) can be found on SRA under accession number SRX11246493.

Raw contig assembly, scaffold formation, and polishing
Dovetail generated a primary contig-level assembly using wtdbg2 (Ruan and Li 2020).This primary assembly was 1.96 Gb long in 17,825 contigs and a N50 of 443 Kb.The contig-level assembly was filtered of putative duplicated haplotypes and contaminants using Purge Haplotigs (Roach et al. 2018) and Blobtools2 (Challis et al. 2020), respectively.Non-molluscan contigs were removed according to Blobtools results.After haplotype purging and contaminant removal, the final contig assembly had 10,111 contigs, for a total of 1.65 Gb and a N50 of 518 Kb.Scaffolding was performed using Omni-C libraries of muscle tissue (SRA Accession SRR28966249) sequenced using Illumina HiSeq 2000 and the HiRise assembler (v1.0) to generate a primary chromosome-level assembly made of 2,117 scaffolds.The same DNA sample used by Dovetail was shipped to UWM (University of Wisconsin-Madison) and sequenced to a 100 × using the NovaSeq 2000 sequencer (SRR28968509).These data were trimmed using Trimmomatic (v0.39) and used for polishing with racon (v1.4.3)Fig. 1.TimeTree for M. edulis mytilidae diverged ∼387 MYA, Mytilus diverged ∼78 MYA and the M. edulis species complex diverged ∼20 MYA.Generated using TimeTree (Kumar et al. 2022).(Vaser et al. 2017).This code is available in the Github repository https://github.com/timregan/M.-edulis-assembly.This final assembly was further filtered to contain only sequences > 5,000 bp.The resulting draft is deposited on NCBI assembly under accession number GCA_019925275.1.The final assembly is made of 1,119 scaffolds and has an N50 of 116 Mb.The 14 putative M. edulis chromosomes are deposited under accession numbers CM034349.1 to CM34362.1.

RNA preparation and IsoSeq analysis
Circular Consensus Sequencing (CCS) data was generated from gill, muscle, hemolymph, and foot, which was analyzed using IsoSeq3.We shipped flash-frozen gill and adductor muscle tissue from sample "Anne" to the Biotechnology Centre Core facility at the University of Wisconsin, Madison (UWM).Samples were homogenized using a Qiagen Tissuelyser (2 min @ 20 Hz).RNA was extracted using the RNeasy Mini Kit (Qiagen) with on-column DNAse treatment.UWM performed RNA QC with a nanodrop and bioanalyzer and prepared libraries using the Iso-seq Library SMRTbell express template prep kit.Libraries were sequenced using one Sequel II SMRT cell in CCS mode (i.e.HiFi reads).UWM provided de-multiplexed processed HiFi reads for gill (1.3 million reads-∼3.9Gbs) and muscle (1.6 million reads-∼4.9Gbp) (Gill-SRR29044552; Muscle-SRR29044553). RNA samples from foot, mantle, gill, hemolymph and muscle were also sent to the Genome Excellence Centre (Genome Quebec) in Montreal.Additional HiFi sequencing was carried out as above for hemolymph (1.7 million reads-∼5.2Gbp and foot (2.03 million reads -∼7.4 Gbp) (SRA Accession Numbers SRR28976541 to SRR28976543).Using gill, mantle, muscle and foot samples four stranded pair-end libraries were made and sequenced in an Illumina HiSeq2000 (SRA accession numbers SRR28976540 to SRR28976543).Putative full-length (FL) transcripts were identified using the IsoSeq3 pipeline (https://github.com/ylipacbio/IsoSeq3). Putative open read frames (ORFs) were identified using TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder/wiki).

Annotation
Repeat modeler (4.1.0)(Flynn et al. 2020) was used to predict repeat motifs for M. edulis and Repeat Masker (2.0.2a) (Smit et al. 2015) was used to mask the final assembly (Supplementary Table 1).Ab-initio annotation was done using Augustus (v3.4.0) (Stanke et al. 2006) trained with genes from M. galloprovincialis, M. coruscus, and Crassostrea virginica.Additional Augustus "hints" were generated using short-read RNA-seq (gill, foot, muscle, and mantle) and IsoSeq data generated herein.The ab-initio annotation was updated using PASA (v2.5) with alignments of a de-novo transcriptome produced using Trinity (v2.8.15) (Grabherr et al. 2011) in Genome-Guided mode.Two runs through PASA were used to update the ab-initio annotation.FL transcripts from IsoSeq3 pipeline were mapped to the genome using pbmm2 with pre-sets for IsoSeq and filtered based on quality IsoSeq3 collapse (minimum alignment identity/coverage: 0.90/0.90)and IsoSeq3 refine.We used the resulting gff annotation to run SQANTI3 (v4.3.0).We used sqanti_qc.pyto generate quality data for sqan-ti_filter.py.Filtering to remove artifacts was carried out using the default parameters.Finally, ab-initio predictions and filtered IsoSeq FLs were merged with AGAT (v.1.2.0).Using agat_sp_merge.pl,we removed duplicate gene models/isoform and assigned orphan isoforms from the IsoSeq data when possible.Amino acid sequences were translated from the CDS using agat_sp_ex-tract_sequences using options -t "CDS" -p -cfs -acs -asc.

Sample collection for SNP discovery and population structure
We collected gill samples for DNA extractions using standard molecular biology techniques and preserved them in non-denatured 100% ethanol.For population genetics analysis, we sampled mussels in sets of 96 from west to east PEI in Foxley River (FOX) (wild population), Malpeque Bay (MB) (North Shore, Prince County); French River (FR), Stanley Bridge (ST), Whetley River (WR), Tracadie (TRC) (North Shore, Queen county); Orwell (OR) (South Shore, Queen's county; Morell (MRL) (North Shore-King's county) and; Murray River (MR) (South Shore, King county) (Fig. 2).Seed deployed on these sites were originally collected in St. Peter's bay, Brudenell River, and MB.We also collected samples in the Bras D'or Lake in Cape Breton (Nova Scotia), the Magdalene Island (Québec) and Notre Dame bay in Newfoundland.

SNP markers, admixture analysis, population structures
Samples for SNP discovery and population genetics were shipped to LGC genomics in Berlin.Restriction-associated DNA libraries (RAD) libraries were prepared by LGC with using restriction enzyme MslI, normalized and Illumina sequenced using a NextSeq 2000 sequencer.The resulting reads were trimmed and checked for the restriction site by LGC.This final read set was used to identify SNPs and call individual genotypes using Tassel5 (v.5.2.4).SNPs were filtered by Minor Allele Frequency (MAF) < 0.01, depth (75 > DP < 500), and % of individuals with genotypes (90%).For the SNP discovery, the samples from Cape Breton were excluded from the filtering analysis.These samples were shown to be a pure M. trossulus population and had missing calls for a significant number of sites.For population genetics analysis, a second set of SNPs (MAF > 0.05, 10 > DP < 100, only bi-allelic sites with genotypes called in 95% of samples-4,610 SNPs) that were successfully called across all populations was used.Allele frequency distribution and the deviation of observed from expected heterozygosity frequencies (based on HW equilibrium) are shown in Supplementary Fig. 1.The observed heterozygosity levels are not greater than expected levels over the vast majority of the genome, indicating no large levels of selection or drift, as expected for an unbiased set of SNPs.All samples were also genotyped for the 12 SNPs from (Wilson et al. 2018) used to discriminate between M. edulis, M. trossulus and M. galloprovincialis.

Genome assembly and annotation
The chromosome-level assembly presented herein was produced in two stages.First, PacBio CLR reads (∼340 Gb) were produced, representing coverage of 196 × for an estimated genome size of 1.7 Gb (Hinegardner 1974;Rodríguez-Juíz et al. 1996).These reads were assembled into contigs using wtdbg2, which uses uncorrected reads (Ruan and Li 2020).The primary assembly was 1.96 Gb long in 17,825 contigs and a N50 of 443 Kb.After haplotype purging and contaminant removal, the final contig assembly had 10,111 contigs, for a total of 1,65 Gb and a N50 of 518 Kb.Following scaffolding using Omni-C libraries and the HiRise assembler, we generated a primary chromosome-level assembly made of 2,117 scaffolds.This assembly was further filtered to contain only sequences > 5,000 bp.The resulting draft is deposited on NCBI assembly under accession number GCA_019925275.1.The final assembly is made of 1,119 scaffolds and has an N50 of 116 Mb.The 14 putative M. edulis chromosomes are deposited under accession numbers CM034349 to CM34362.Detailed statistics for the assemblies can be found in Table 1.
Compared to the three most recent chromosome-level assemblies available on NCBI, the assembly presented herein has better contiguity as measured by scaffold N50 (Table 1).Both M. edulis assemblies' length falls close to the estimated size of the genome based on c-value (1.7) (Rodríguez-Juíz et al. 1996) and is significantly longer than what is estimated by k-mer abundance analysis with GenomeScope (1.18 Gb) (Supplementary Fig. 2).
Despite the total assembly length being close to that estimated using c-values, k-mer-based completeness analysis recovers only 76% in a set of Illumina reads origination from sample "Anne".When the putative purged haplotigs were added back to the assembly, recovery was ∼83%.This apparent low k-mer recovery is likely due to the error rate in the original PacBio data.However, other possible explanations are the consensus being different from other haploid genomes, the fact that the reads used for polishing were not used from the primary assembly, the contigs removed based on length or contaminant status, and the gaps arising from the Omni-C scaffolding.It is also possible that the high heterozygosity affects the accuracy of k-mer abundance analysis, as shown by the large discrepancy between genome size estimates.Completeness analysis in Merqury resulted in 76.71% recovery of k-mers present in the polishing Illumina data from the final version of the assembly.We also evaluated the completeness of the assembly when combined with purged haplotypes, which was 83.44%.QV value for the primary assembly was 32.39, while the combined draft had a QV of 30.73.While Merqury relies on k-merbased analyses between raw reads and assembly, completeness was also estimated using compleasm which relies on a BUSCO (Benchmarking Universal Single-Copy Orthologs) database.Compleasm BUSCO analysis showed 92.55%, 91.95%, and 88.5% complete BUSCOS against the eukaryote, metazoan and molluscan databases, respectively.The 14 putative chromosomes represent ∼96% of the assembly, with lengths varying from 140 Mbp to 90 Mbp.These data and the N50 metric show that this assembly has high contiguity and that this assembly and its annotation will be highly useful for aquaculture, evolution and molecular ecology studies.Herein, we illustrate the possible applications of this assembly by performing population and synteny analyses.

SNP discovery and population structure
Due to the close relationship between the members of the Mytilus species complex, we wanted to verify that the individual sampled (Anne) was pure M. edulis.We genotyped 895 PEI individuals, 96 samples from Magdalene Island and 96 individuals from Newfoundland of the same 12 SNP panel.We also included 96 individuals from Cape Breton (NS), which has long been considered a pure M. trossulus population.We generated two sets of SNPs from the genotyping-by-sequencing (GBS) data: the first set totaling 71,231 SNPs using only samples from PEI and made a polymorphic collection of SNPs in M. edulis.The second set, with 4,610 markers, is a polymorphic set of SNPs in both M. edulis and M. trossulus.Genotypes for both marker sets were called in 1,183 samples above using GBS.
In the PCA, untrained clustering clearly separated both CB from PEI/NL/MAG and also the putative populations in the Gulf of St. Lawrence.PCA and fastSTRUCTURE analysis indicates that there is low population stratification between different regions of PEI, while other off-island populations display more separation.PCA and fastSTRUCTURE also indicate that there is no introgression of Cape Breton Genetics (i.e.M. trossulus) in PEI.Therefore, we are confident that the sample "Anne" represents an M. edulis individual.However, we only genotyped 96 samples collected in an area with no grow-out leases (Foxley River-FOX) while all other samples came from areas with the presence of mussel aquaculture.Although unlikely, we cannot rule out the possibility of minor introgression of M. trossulus genetics masked by a sampling bias in aquaculture sites favoring M. edulis.Population structure and putative admixture are shown in Fig. 3.

Annotation and synteny analysis
Ab-initio gene prediction in Augustus detected 46,604 gene models that produced 46,604 transcripts after filtering based on evidence support.After two rounds of PASA updates, the final number of gene models was 47,128 and the number of transcripts was 55,138.Using IsoSeq3 analysis of CCS data from muscle and gill (2.9 million reads, 188,165 Mb), we identified 196,111 putative open reading frames from the 216,343 FL transcripts.BLASTp analysis against the uniref90 database returned informative hits for ∼80% (164,969) of these translated transcripts.After IsoSeq refine and collapse 85,099 isoforms survived, and following SQANTI3 filtering, 70,592 isoforms remained.They were assigned to 31,211 unique genes.Finally, these models were combined with the Augustus/PASA analyses to yield 65,505 gene models and 129,708 isoforms.Proteins were translated from the CDS ensuring only complete CDS were translated and that isoforms were not incorrectly fused together.This resulted in 45,379 amino acid sequences.The annotation pipeline and density of the final 45,379 protein sequences along each of the chromosomes is displayed in Fig. 4. Compleasm BUSCO analysis of these 45,379 proteins showed a recovery of 78.43 and 76.83% complete BUSCOS against the eukaryote and metazoan databases, respectively.
Synteny analysis showed a high degree of collinearity between putative chromosomes of M. edulis and M. trossulus (Fig. 5).However, putative inversions, transpositions and deletions can be observed in almost all chromosomes.Gene order in chromosomes represented by sequences CM034349 (M.edulis chromosome 1)-NC086373 (M.trossulus chromosome 1) and CM034353 (M.edulis chromosome 5)-NC086374 (M.trossulus chromosome 2) showed the highest conservation.Putative orthologous relationships between M. edulis and M. trossulus chromosomes are shown in Table 2.
The taxonomic status of the "species" in the Mytilus species complex remains in debate.Chromosome-level assemblies allow the study of macroevolution of the genome by looking at synteny across species.Herein, we present the synteny analysis between the 14 putative chromosomes of M. edulis and M. trossulus to exemplify how chromosome-level assemblies may allow us to better understand the phylogenetic relationships within the genus Mytilus.This analysis shows that some of putative orthologous chromosomes of the 2 species maintain high levels of collinearity (e.g.chromosomes 1 and 5 from M. edulis with 1 and 2 from M. trossulus, respectively) while others present significant levels of re-arrangements (e.g.chromosomes 7 and 11 from M. edulis with chromosomes 4 and 12 from M. trossulus respectively).Future studies providing an in-depth analysis of chromosome synteny will shed light on the level of collinearity between multiple members of the genus Mytilus.Homology between chromosomes is a key element of the viability of hybrids.Reproductive isolation tends to increase during speciation (Mayr 1947), and these  resources will permit further studies on the reproductive compatibility of the species in genus Mytilus at the chromosome level.
Here, we present a highly contiguous chromosome assembly for Mytilus edulis confirming species-level individual purity through resequencing.To date, our resource has been applied in multiple studies analyzing Mytilus genome assemblies (Paggeot et al. 2022;Gallardo-Escarate et al. 2023) and cross-species gene orthology analyses (Saco et al. 2023a).Recently, the concept of a pangenome for M. galloprovincialis has received increased attention (Gerdol et al. 2020) with "core genes" and "dispensable genes" being described across populations of M. galloprovincialis (Saco et al. 2023b).Future studies investigating hybridization across Mytilus spp.will be able to investigate pangenome structure across Mytilus spp. with highly contiguous assemblies such as our own.The gene annotations produced in this study were generated using Augustus gene model predictions integrating full transcript IsoSeq data and applying stringent filtering parameters.This comprehensive approach provides a robust foundation for future cross-species analyses and biological studies on gene function within the Mytilus species complex.

Fig. 4 .
Fig. 4. Annotation schematic and gene density.a) A schematic of the annotation pipeline used and (b) the density of genes along each of the chromosomes (made using R package karyoploteR).

Fig. 5 .
Fig. 5. Synteny between the M. edulis and M. trossulus putative chromosomes.Syntenic orthologous regions between the M. edulis and M. trossulus genome assemblies are displayed by Circos plot (left) and dot plot (right).
Assembly database molecule name and accession number.