The genome sequence of the surf clam, Spisula solida (Linnaeus, 1758)

We present a genome assembly from an individual Spisula solida (the surf clam; Mollusca; Bivalvia; Venerida; Mactridae). The genome sequence is 932.1 megabases in span. Most of the assembly is scaffolded into 19 chromosomal pseudomolecules. The mitochondrial genome has also been assembled and is 19.3 kilobases in length. Gene annotation of this assembly on Ensembl identified 13,833 protein coding genes.


Background
Surf clams (Mactridae) are commonly eaten worldwide and are an important fishery resource.Spisula solida is one of three British Spisula species found on clean, sandy, often exposed beaches, living buried low in the intertidal zone and offshore.It is a filter feeder, preferring clean, sandy environments and avoiding muddy sediments.Its range extends from Norway and south to Morocco, but excluding the Mediterranean.Instead, Spisula elliptica and Spisula subtruncata are recorded from the Mediterranean (GBIF).Another species, Spisula solidissima, has been documented in Britain; however it is an imported species originating from the USA.
Spisula solida, solid-shelled and triangular in outline, can grow up to 45 mm in length.It has a creamy shell covered with a thin brown periostracum that flakes off easily.The growth lines are clear, with numerous fine concentric lines between each pair of growth lines (Amguddfa Cymru -National Museum Wales, 2016).Spisula species have thicker shells than other genera within the family such as Mactra and the internal dentition can aid in their separation.The lateral teeth are serrated in Spisula but not in Mactra, and the length and angle of the cardinal teeth serve as a diagnostic character.
The genome of the surf clam, Spisula solida, was sequenced as part of the Darwin Tree of Life Project; a collaborative effort to sequence all named eukaryotic species in the Atlantic Archipelago of Britain and Ireland.

Genome sequence report
The genome was sequenced from one Spisula solida specimen (Figure 1) collected from Jennycliff Bay, Plymouth, UK.A total of 31-fold coverage in Pacific Biosciences singlemolecule HiFi long reads was generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data.Manual assembly curation corrected 35 missing or mis-joins and removed four haplotypic duplications, reducing the scaffold number by 32.14%, and increasing the scaffold N50 by 1.37%.
The final assembly has a total length of 932.1 Mb in 38 sequence scaffolds with a scaffold N50 of 52.7 Mb (Table 1).Most (99.95%) of the assembly sequence was assigned to 19 chromosomal-level scaffolds.Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size (Figure 2-Figure 5; Table 2).While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.The mitochondrial genome was also assembled and can be found as a contig within the multifasta file of the genome submission.
Metadata for specimens, spectral estimates, sequencing runs, contaminants and pre-curation assembly statistics can be found at https://links.tol.sanger.ac.uk/species/31201.

Sample acquisition and nucleic acid extraction
Two Spisula solida individuals (specimen numbers MBA-201117-015A and MBA-201117-015B, ToLIDs xbSpiSoli1 and xbSpiSoli2) were collected from Jennycliff Bay, Plymouth,   UK (latitude 50.34, longitude -4.13) on 17 November 2020.The specimens were retrieved from sand using a grab sampler (MV Sepia).The specimens were collected and identified by Anna Holmes (Amgueddfa Cymru) and then flash-frozen in liquid nitrogen prior to sample processing.
DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute (WSI).The xbSpiSoli1 sample was weighed and dissected on dry ice.The tissue was disrupted using a Nippi Powermasher fitted with a BioMasher pestle.High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit.HMW DNA was sheared into an average fragment size of 12-20 kb in a Megaruptor 3 system with speed setting 30.Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample.The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit.Fragment size distribution was evaluated by running the sample on the FemtoPulse system.

Sequencing
Pacific Biosciences HiFi circular consensus DNA sequencing libraries were constructed according to the manufacturers' instructions.DNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi) instrument.Hi-C data were also generated from tissue of xbSpiSoli2 using the Arima2 kit and sequenced on the Illumina NovaSeq 6000 instrument.

Genome assembly, curation and evaluation
Assembly was carried out with Hifiasm (Cheng et al., 2021) and haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020).The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using YaHS (Zhou et al., 2023).The assembly was checked for contamination Table 3 contains a list of relevant software tool versions and sources.

Genome annotation
The Ensembl gene annotation system (Aken et al., 2016) was used to generate annotation for the Spisula solida assembly (GCA_947247005.1).Annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019).Further, the Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use.
The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.
The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material

Is the rationale for creating the dataset(s) clearly described? Partly
Are the protocols appropriate and is the work technically sound?Yes

Are the datasets clearly presented in a useable and accessible format? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: genome, genetic breeding I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Taro Maeda
Keio University, Keio, Japan I appreciate the opportunity to review this manuscript.Bivalve genomic information is quite scarce, making this paper particularly valuable for its coverage of non-sessile bivalve genomes.The N50 is impressively high, and it appears that chromosome-level sequences were obtained.
Given the difficulty of DNA extraction in this group, these results are certainly noteworthy enough to merit reporting.An omission in this manuscript is the lack of mention of the total genome size.The authors note that the sequence data represents 31-fold coverage, but no clear basis for the genome size estimate is provided.Additionally, the total amount of sequence in base pairs is not reported.The total number of genes is relatively low for a eukaryote, and the BUSCO scores are surprisingly low given the length of the assembly.This may reflect unique biological characteristics of the organism, possibly biological factors such as parasitism, which may influence the genomic gene number.If the authors could provide descriptions of whether such biological backgrounds exist for this species, it would enhance the readability and understanding of the paper.At this point, I do not know if there are such characteristics in this species that could cause such a genomic feature.If so, the lack of a clearly stated genome size combined with this could give the impression that the sequenced genome may be a partial subset missing significant gene regions.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: molluscan genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
"Gene annotation of this assembly on Ensembl identified 13,833 protein coding genes."This seems an extremely low number of genes compared with expectations for bivalves and a fast check of the gene models available through the Ensembl rapid annotation web portal indicates that the annotation is largely incomplete.Although making a gene annotation track available to the scientific community is a great addition, I would urge the authors to improve this aspect, considering the high complexity of bivalve genomes and the limited usefulness of such an incomplete annotation track, which could be most definitely largely improved by the use of a standard annotation pipeline, such as BRAKER3.
Does the size of the assembly match estimated c-value (if available)?Does the number of scaffolded pseudochromosomes match the expected number of metaphasic chromosome pairs (if this data is available for this species)?
The mollusca_odb10 reference set is notoriously inadequate for estimating the completeness of bivalve genomes, due to its reliance on a very low number of molluscan species (mostly gastropods).I would suggest the authors to report the BUSCO scores obtained against the metazoan dataset as well.
Genome annotation report: as previously said, the annotation did perform rather poorly.This is not a specific issue of this genome, since pretty much all the other bivalve genomes annotated using this pipeline suffer from similar issues, suggesting that the Ensembl rapid pipeline has some performance issues with bivalve genomes, in particular when protein alignment data only is used (i.e.without RNA-seq data), as it is the case of this genome.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?

Daniel Garcia-Souto
University of Santiago de Compostela, Santiago de Compostela, Galicia, Spain Dear authors, I've had the opportunity to review the manuscript entitled "The genome sequence of the surf clam, Spisula solida (Linnaeus, 1758)", authored by Holmes et al.I commend the authors for their significant contribution, particularly in achieving a genome assembly of Spisula solida at the chromosome level, accompanied by commendable N50 metrics.Here they present their results on the genome assembly of Spisula solida, assembled to chromosome level and achieving magnificent N50 metrics.Moreover, it's always noteworthy highlighting the context of assembling genomes in this taxonomic group (bivalves), largely unrepresented in genomic analysis despite their biological relevance and (usually) posing a great challenge due to their intrinsic DNA quantity/quality issues.As for the manuscript itself, it's very correctly elaborated, both from the methodological and written points of view.I wanted to express some concerns/suggestions in this regard.
The manuscript itself is impeccably elaborated, demonstrating precision both in its methodology and written presentation.However, I would like to express some considerations and suggestions for further enhancement.
Primarily, my principal concern revolves around the taxonomic identification of the specimens.While acknowledging the authors' expertise and the inclusion of specimen photographs, these images are insufficient for the rigorous verification of taxonomic identification.I recommend the inclusion of additional photographs depicting the hinge and teeth, as these features serve as crucial diagnostic characters.If the shells are still accessible, kindly consider providing pertinent details in this section.
The achievement of encapsulating the entire genome within 126 haploid contigs is indeed remarkable.However, I would like to inquiry about the handling of the "purged duplicates."Given that these organisms typically contain extensive heterochromatin blocks comprising highly repeated, low-variability satellite DNAs, it is common practice to exclude them from the definitive genome version.This exclusion may influence final metrics such as ATGC composition, satellite/transposable element content, or even BUSCO completeness scores, as certain repetitive gene families might also be omitted.I kindly suggest the authors to verify whether these contigs were excluded from the final genome version with a brief sentence.
Finally, and despite the chromosome assembly level of their genome, to my surprise, the BUSCO completeness report for Molluscan is quite low.Could authors pinpoint to why is this happening?
In conclusion, beyond these considerations, I find the manuscript to be meticulously prepared.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound?Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genome assembly, repetitive DNA, molecular cytogenetics, phylogenetics, bivalves I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 2 .
Figure 2. Genome assembly of Spisula solida, xbSpiSoli1.1:metrics.The BlobToolKit Snailplot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 932,108,243 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (69,035,870 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (52,695,421 and 42,436,359 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the mollusca_ odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/xbSpiSoli1.1/ dataset/CAMXUH01/snail.

Figure 3 .
Figure 3. Genome assembly of Spisula solida, xbSpiSoli1.1:GC coverage.BlobToolKit GC-coverage plot.Scaffolds are coloured by phylum.Circles are sized in proportion to scaffold length.Histograms show the distribution of scaffold length sum along each axis.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/xbSpiSoli1.1/dataset/CAMXUH01/blob.

Figure 4 .
Figure 4. Genome assembly of Spisula solida, xbSpiSoli1.1:cumulative sequence.BlobToolKit cumulative sequence plot.The grey line shows cumulative length for all scaffolds.Coloured lines show cumulative lengths of scaffolds assigned to each phylum using the buscogenes taxrule.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/xbSpiSoli1.1/dataset/ CAMXUH01/cumulative.

Figure 5 .
Figure 5. Genome assembly of Spisula solida, xbSpiSoli1.1:Hi-C contact map.Hi-C contact map of the xbSpiSoli1.1 assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure is available at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=bf1x8DCYTLO3Bnq58LJnXA.
Legality of collection, transfer and use (national and international) Each transfer of samples is further undertaken according to a Research Collaboration Agreement or Material Transfer Agreement entered into by the Darwin Tree of Life Partner, Genome Research Limited (operating as the Wellcome Sanger Institute), and in some circumstances other Darwin Tree of Life collaborators.

Table 3 . Software tools: versions and sources. Software tool Version Source
Legal and ethical review process for Darwin Tree of Life Partner submitted materialsThe materials that have contributed to this genome note have been supplied by a Darwin Tree of Life Partner.
This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Partly Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
https://doi.org/10.21956/wellcomeopenres.21584.r71344© 2024 Garcia-Souto D. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.