The genome sequence of the channel bull blenny, Cottoperca gobio (Günther, 1861)

We present a genome assembly for Cottoperca gobio (channel bull blenny, (Günther, 1861)); Chordata; Actinopterygii (ray-finned fishes), a temperate water outgroup for Antarctic Notothenioids. The size of the genome assembly is 609 megabases, with the majority of the assembly scaffolded into 24 chromosomal pseudomolecules. Gene annotation on Ensembl of this assembly has identified 21,662 coding genes.


Background
Cottoperca gobio (channel bull blenny) is a member of the Bovichtidae family of the Notothenioidei, a fish group endemic to the Southern Ocean. The Bovichtidae (thornfishes), are considered to be the most basally diverging family of notothenioids and are less adapted to life in the extreme cold in comparison to Antarctic members of the clade (Near et al., 2015). C. gobio occupies the Patagonian regions of Chile and Argentina, and the area around the Falkland Islands. In contrast to Antarctic notothenioids (cryonotothenioids), the Bovichtidae do not produce antifreeze glycoproteins (AFGPs), a key adaptation to extreme Antarctic cold (Chen et al., 1997;Cheng et al., 2003) and their hemoglobins possess slightly higher oxygen affinity than most high-Antarctic species (Giordano et al., 2006;Giordano et al., 2009). Cytogenetic investigation of C. gobio showed that the karyotype of this species consists of 2n=48 chromosomes (Pisano et al., 1995). This condition, shared by other Bovichtidae, is considered to be the ancestral karyotype condition for all notothenioids (Mazzei et al., 2006).
Here, we present a chromosomally complete genome sequence of Cottoperca gobio generated using specimens collected south of the Falkland Islands/Islas Malvinas. We trust that this genome sequence will be used to aid analysis of population structure and phylogeography of non-Antarctic and Antarctic notothenioid fish species, which are increasingly under threat due to climate change and human activities (Dornburg et al., 2017).

Genome sequence report
The C. gobio genome was sequenced from a specimen collected under permits to fish in territorial waters of the Falkland Islands/Islas Malvinas issued by the United Kingdom, by the Falkland Islands Government, and by Argentina. The genome assembly for C. gobio (fCotGob3.1) is based on a combination of data from four technologies, including 75x coverage Pacific Biosciences (PacBio) single-molecule long reads (N50 14 kb), 54x coverage of Illumina data generated from a 10X Genomics Chromium library (estimated molecule length N50 43 kb), and BioNano Saphyr two-enzyme data (BspQI and BssSI). Additionally, 145x coverage of Illumina HiSeqX data were obtained from a Hi-C library prepared by Arima Genomics using tissue from a second individual (fCotGob2, spleen tissue).
The final assembly has a total length of 609 Mb, in 322 sequence scaffolds with a scaffold N50 of 25 Mb ( Figure 1; Table 1). The majority (94.36%) of the assembly sequence was assigned to 24 chromosomal-level scaffolds using the Hi-C data ( Figure 2; Table 2). The assembly has a BUSCO (Simão et al., 2015) gene completeness score of 93.4% using the actinopterygii reference set (with -sp zebrafish parameter). The chromosomes clearly show a one-to-one relationship with those in the Japanese medaka (Oryzias latipes) HdrR assembly GCA_ 002234675.1 (Figure 3 and Figure 4), with 3671 of the 3780 complete and single copy BUSCO genes present in both genomes found on homologous chromosomes (97.1%), and were thus named correspondingly. Analysis of conserved syntenies detected no major interchromosomal rearrangements in the approximately 195 million years since the divergence of medaka and C. gobio lineages (Steinke et al., 2006), but many intrachromosomal rearrangements ( Figure 4). While not fully phased, the assembly deposited represents one haplotype. Contigs corresponding to the second haplotype have also been deposited.

Gene annotation
An Ensembl annotation was generated for the fCotGob3.1 assembly using RNA-seq data generated from 4 tissues (brain, muscle, ovary, and spleen). The annotation for assembly fCotGob3.1 was released in Ensembl under database version 99.31 (Hunt et al., 2018) (for fish clade annotation information see 2019-09: fish clade gene annotation). The resulting Ensembl annotation includes 60,811 transcripts assigned to 21,662 coding and 2,823 non-coding genes (Channel bull blenny -Ensembl). RefSeq annotation is also available as NCBI Cottoperca gobio Annotation Release 100 (Table 2).

Specimen acquisition and nucleic acid extractions
Both specimens used to generate the genome assembly were collected south of the Falkland Islands/Islas Malvinas in 2004 (Lat Long: -52° 40', -59° 12') during the ICEFISH 2004 Cruise (International Collaborative Expedition to collect and study Fish Indigenous to Sub-Antarctic Habitats; led by H. W. Detrich (Detrich et al., 2012)) of the RVIB Nathaniel B. Palmer. Following euthanasia, fresh blood was collected from specimen fCotGob3, and spleen tissue (used for Hi-C) was collected from specimen fCotGob2 and was flash frozen in liquid nitrogen. Blood was processed immediately, whereas flash frozen spleen was preserved in the -80 freezer until processing. For RNA sequencing, tissue samples from two specimens were used (fCotGob2 -spleen, and fCotGob1 -brain, skeletal muscle, ovary). The additional tissues (fCotGob1)  High molecular weight (HMW) DNA from fresh blood cells was prepared using an agarose plug extraction protocol (Smith et al., 2010). Blood DNA was initially stabilised in agarose plugs and then shipped to Sanger Institute where the final steps of the extraction were performed using a BioNano Tissue extraction protocol. Quality control (QC) of HMW DNA was performed using the Femto Pulse instrument (Agilent). Total RNA was extracted from approximately 20-40 mg of tissue, from brain, skeletal muscle, ovary and spleen tissues using the RNeasy Qiagen extraction kit (Qiagen). QC was performed using Qubit HS RNA kit, and Agilent Bioanalyzer Nano chips. Only extracts with RIN value >8 were used for sequencing.

Sequencing
PacBio continuous long read (CLR) and 10X Genomics linked read sequencing libraries were constructed according to manufacturers' instructions. Sequencing was performed by the Scientific Operations core at the Wellcome Sanger Institute on PacBio SEQUEL I and Illumina HiSeq X instruments. Hi-C data were generated using the Arima Hi-C kit v1 by Arima Genomics. BioNano data were generated on Saphyr (dual enzyme) at Bionano Genomics. RNA-seq was performed on HiSeq 4000 with 150bp insert paired end (PE) libraries.

Genome assembly
An initial PacBio assembly was made using Falcon-unzip (Chin et al., 2016) without repeat-masking during overlap detection with Dazzler. The contigs from this assembly were first scaffolded by comparing them to a second wtdbg (Ruan & Li, 2019) assembly using cross_genome, then they were scaffolded further using the 10X data with scaff10X, and then with BioNano two-enzyme hybrid scaffolding using Contiguity was increased further by filling gaps with the contigs from a second wtdgb assembly, which was made using PacBio reads corrected with Canu (Koren et al., 2017). This assembly was re-polished with Arrow and freebayes, and retained   haplotigs were identified with Purge Haplotigs (Roach et al., 2018). Finally, the assembly was scaffolded to chromosomes using Arima Hi-C data with Salsa (Ghurye et al., 2017). The scaffolded assembly was checked for contamination and manually improved using gEVAL (Chow et al., 2016). The manual curation included steps such as correcting mis-joins, improving concordance with all available data types, and Hi-C 2D map visualized in Juicebox to produce complete chromosomal units (Durand et al., 2016). Curation resulted in 9 manual breaks, 114 manual joins and the removal of 102 regions representing false duplications, decreasing the scaffold count by 39% to 322 and increasing the scaffold N50 by 68% to 25.2 Mb. The chromosomal-level scaffolds were named based on conserved synteny to the medaka assembly (Oryzias latipes, Assembly accession GCA_002234675.1). The genome was further analysed within the BlobToolKit environment (Challis et al., 2020). Software tools and versions used for assembly are listed in Table 3.
The C. gobio genome sequencing is part of the Wellcome Sanger Institute's Vertebrate Sequencing project, and of the Vertebrate Genomes Project (VGP) ordinal references programme (Rhie et al., 2020). All raw data and the assembly have been deposited in the ENA. Raw data and assembly accession identifiers are reported in Table 1.

Reporting guidelines
Not applicable.

Consent
Not applicable.