The genome sequence of the Atlantic cod, Gadus morhua (Linnaeus, 1758)

We present a genome assembly from an individual male Gadus morhua (the Atlantic cod; Chordata; Actinopteri; Gadiformes; Gadidae). The genome sequence is 669.9 megabases in span. Most of the assembly is scaffolded into 23 chromosomal pseudomolecules. Gene annotation of this assembly on Ensembl identified 23,515 protein coding genes.


Background
Atlantic cod (Gadus morhua) (Figure 1) is an ecologically important keystone marine fish species widely distributed throughout the Northern Atlantic Ocean.It has played a major role in fisheries and trade in Northern Europe for hundreds of years (Star et al., 2017), and is still a highly valued marine resource worldwide.The first version of the Atlantic cod genome -which was released in 2011 -was one of the first vertebrate genomes sequenced using only next generation sequencing technologies.Here, we showed that Atlantic cod has lost major histocompatibility complex (MHC) II genes, which are an essential part of the adaptive immune system (Star et al., 2011).Additional genome sequencing of a larger selection of Gadiformes and other fish clades further demonstrated that all Gadiformes species lack MHCII and it was caused by a single evolutionary event around 80-100 Mya (Malmstrøm et al., 2016).Both the original draft genome and the improved second version genome (gadMor2) showed that Atlantic cod has an unusual high proportion of simple tandem repeats (STRs) compared to most other vertebrates (Star et al., 2011;Tørresen et al., 2017), with putative high evolutionary implications (Reinar et al., 2023).
Moreover, population genome sequencing in combination with the chromosome anchored reference genome revealed that three larger chromosomal inversions largely distinguish the iconic migratory ecotype the northeast Arctic cod and the stationary non-migratory Norwegian coastal cod (Berg et al., 2016;Berg et al., 2017).Extending the geographical range further uncovered two genetically distinct co-existing ecotypes in the southernmost fjord systems of Norway: one fjord-type and one more offshore oceanic-type (Barth et al., 2017;Barth et al., 2019;Jorde et al., 2018), with allele frequency differences in a total of four chromosomal inversions as well as differentiation at the genome-wide level (Barth et al., 2017;Barth et al., 2019;Sodeland et al., 2016;Sodeland et al., 2022).Recent studies using the gadMor2 genome have shown that the four larger chromosomal inversions most likely have arisen as separate evolutionary events from 400,000 to over a Mya (Matschiner et al., 2022).
The second version of the genome assembly (gadMor2) was generated using a combination of 454, Illumina and PacBio reads, anchoring scaffolds into chromosomes based on a linkage map (Tørresen et al., 2017) with fifty-fold larger contig N50 than the first version.However, further improvement in sequencing technologies have enabled even more complete genome assemblies.

Genome sequence report
The genome was sequenced from one heterozygous male Gadus morhua, individual NEAC_001, a wild-caught male from the Northeast Arctic cod (NEAC) population, and estimated at 8 years of age based on otolith rings.A total of 130-fold coverage in Pacific Biosciences single-molecule long reads, 167-fold coverage in 10X Genomics read clouds, 1416-fold coverage in BioNano reads was generated.Primary assembly contigs were scaffolded with chromosome conformation Hi-C data (76× coverage).Manual assembly curation corrected 429 missing joins or mis-joins and removed 14 haplotypic duplications, reducing the assembly length by 1.74% and the scaffold number by 42.05%, and increasing the scaffold N50 by 23.03%.
The final assembly has a total length of 669.9 Mb in 226 sequence scaffolds with a scaffold N50 of 28.7 Mb (Table 1).The snail plot in Figure 2 provides a summary of the assembly statistics, while the distribution of assembly scaffolds on GC proportion and coverage is shown in Figure 3.The cumulative assembly plot in Figure 4 shows curves for subsets of scaffolds assigned to different phyla.Most (97.52%) of the assembly sequence was assigned to 23 chromosomal-level scaffolds (Figure 5).Chromosome-scale scaffolds were named based on a genetic map provided by the Jakobsen lab (Table 2).While not fully phased, the assembly deposited is of one haplotype.Contigs corresponding to the second haplotype have also been deposited.

Sample acquisition and nucleic acid extraction
The sequenced cod (NEAC_001) was a wild-caught male from the NEAC population, estimated at 8 years of age based on otolith rings.High molecular weight DNA was extracted from i) flash frozen blood (at Sanger) and ii) agarose blood plugs (at UiO) from the NEAC_001.DNA was dissolved overnight in 1 ml of TE-buffer.Quality and quantity of DNA were checked using NanoDrop (NanoDrop Products), PicoGreen Quant-iT™ (Invitrogen) and FLUOstar Optima (BMG Labtech) and through visual inspection of agarose gels.

Sequencing
PacBio data previously generated on the RSII and Sequel systems by the Jakobsen lab at the University of Oslo, Norway, were combined with data from 5 additional SMRTcells generated at the Wellcome Sanger Institute (WSI).In addition, Chromium 10X Genomics data were generated on the Illumina HiSeqX platform at WSI, and BioNano Saphyr DLE maps were produced for structural variant analysis.Arima Hi-C data were generated from heart and gill tissue at the Jakobsen lab and sequenced on Illumina HiSeq.Raw data can be accessed at GenomeArk.et al., 2020) andBUSCO scores (Manni et al., 2021;Simão et al., 2015) were calculated.
Table 3 contains a list of relevant software tool versions and sources.

Genome annotation
The Ensembl Genebuild annotation system at the EBI (Aken et al., 2016) was used to generate annotation for the Gadus morhua assembly (GCA_902167405.1).Annotation was created primarily through alignment of transcriptomic data to the genome, with gap filling via protein-to-genome alignments of a select set of proteins from UniProt (UniProt Consortium, 2019).

Wellcome Sanger Institute -Legal and Governance
The materials that have contributed to this genome note have been supplied by a Darwin Tree of Life Partner.The submission of materials by a Darwin Tree of Life Partner is subject to the 'Darwin Tree of Life Project Sampling Code of Practice', which can be found in full on the Darwin Tree of Life website here.By agreeing with and signing up to the Sampling Code of Practice, the Darwin Tree of Life Partner agrees they will meet the legal and ethical requirements and standards set out within this document in respect of all samples acquired for, and supplied to, the Darwin Tree of Life Project.
Further, the Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use.The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible.The overarching areas of consideration are: • Ethical review of provenance and sourcing of the material  et al., 2021).Manual curation was performed using gEVAL, HiGlass (Kerpedjiev et al., 2018) and Pretext (Harry, 2022).Chromosome-scale scaffolds were named based on a genetic map provided by the Jakobsen lab.
A Hi-C map for the final assembly was produced using bwa-mem2 (Vasimuddin et al., 2019) in the Cooler file format (Abdennur & Mirny, 2020).To assess the assembly metrics,

Software tool Version
The minor comments presented below agree with the previous reviewer.Please check.
The Background reads more like a discussion than an introduction.It introduces other resources, but the reasoning behind a new effort to assemble a genome, while somewhat trivial, is not explained.In addition, it is not clear if some of the results are from previous papers or a result of the current data note.
For example, in the Background Section, 4th sentence: Here, we showed that Atlantic cod has lost major histocompatibility complex (MHC) II genes, which are an essential part of the

Leif Andersson
Department of Biochemistry and Microbiology, Uppsala University, College Station, Texas, Sweden This brief report documents the release of a high-quality assembly of the cod genome.This will be a very valuable resource for future work on cod biology.

I have only a few comments for minor improvements:
Background line 11.The citation to Reinar et al. is a bit misleading since the reader may get the impression that this is a cod paper, but Reinar et al. is a paper on Arabidopsis this should be rephrased.
○ Your statement that one heterozygous male was sequenced is confusing, all cod are heterozygous at some loci.What do you refer to here?An non-inbred male or a male heterozygous for a particularly important locus?

Reviewer Expertise: Population and Evolutionary Genomics
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
1. Introduction: The introduction reads more like results.While it introduces findings from previous versions of the genome, it should also discuss their limitations and explain how the new version improves upon them.
2. "Here, we showed that Atlantic cod has lost major histocompatibility complex (MHC) II genes,": Clarify whether this finding is original to this study or based on previous research.
3. Biological/Ecological Background: Expand on the biological and ecological significance of the species in the introduction.
4. Mitochondrial Genome: Please double check if the mitochondrial genome was assembled and whether MitoHiFi was used.Reviewer Expertise: Genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Figure 1 .
Figure 1.Photograph of Gadus morhua (not the specimen used for genome sequencing).Photograph by Hans-Petter Fjeld.

Figure 2 .
Figure 2. Genome assembly of Gadus morhua, gadMor3.0:metrics.The BlobToolKit snail plot shows N50 metrics and BUSCO gene completeness.The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 584,119,146 bp assembly.The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (2,763,216 bp, shown in red).Orange and pale-orange arcs show the N50 and N90 scaffold lengths (278,683 and 82,741 bp), respectively.The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude.The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.A summary of complete, fragmented, duplicated and missing BUSCO genes in the actinopterygii_odb10 set is shown in the top right.An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/Gadus%20morhua/dataset/CABHMC01.1/snail.

Figure 5 .
Figure 5. Genome assembly of Gadus morhua, gadMor3.0:Hi-C contact map of the gadMor3.0assembly, visualised using HiGlass.Chromosomes are shown in order of size from left to right and top to bottom.An interactive version of this figure may be viewed at https:// genome-note-higlass.tol.sanger.ac.uk/l/?d=N-q-q64SS4-WKerHMnMklw.

5.
Specimen Photo: Include a photo of the specimen used for sequencingIs the rationale for creating the dataset(s) clearly described?PartlyAre the protocols appropriate and is the work technically sound?YesAre sufficient details of methods and materials provided to allow replication by others?YesAre the datasets clearly presented in a useable and accessible format?Yes Competing Interests: No competing interests were disclosed.

the rationale for creating the dataset(s) clearly described? Partly Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
Is that statement a result of the current paper?If so, how was this identified?If not, referring to previous studies, maybe the sentence should be rephrased, " … In Start et al. 2011, it was shown that the Atlantic code has lost…" It is not explained whether the mitochondrial genome has been generated, and Table3references the MitoHiFi pipeline but does not provide information on the results.If it was generated, please introduce some summary metrics for it.No competing interests were disclosed.

have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
https://doi.org/10.21956/wellcomeopenres.23364.r91121© 2024 Andersson L. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original is properly cited.

Is the rationale for creating the dataset(s) clearly described? Yes Are the protocols appropriate and is the work technically sound? Yes Are sufficient details of methods and materials provided to allow replication by others? Yes Are the datasets clearly presented in a useable and accessible format? Yes Competing Interests:
No competing interests were disclosed.
○It would be useful if the authors could comment why the QV score (38.6) does not reach the bench mark (>50) despite the extensive amount of sequence data.○Table 2. I recommend that a consistent number of decimal places are used for Length in ○ this table.