Improvements to the Gulf pipefish Syngnathus scovelli genome

The Gulf pipefish Syngnathus scovelli has emerged as an important species for studying sexual selection, development, and physiology. Comparative evolutionary genomics research involving fishes from Syngnathidae depends on having a high-quality genome assembly and annotation. However, the first S. scovelli genome assembled using short-read sequences and a smaller RNA-sequence dataset has limited contiguity and a relatively poor annotation. Here, using PacBio long-read high-fidelity sequences and a proximity ligation library, we generate an improved assembly to obtain 22 chromosome-level scaffolds. Compared to the first assembly, the gaps in the improved assembly are smaller, the N75 is larger, and our genome is ~95% BUSCO complete. Using a large body of RNA-Seq reads from different tissue types and NCBI's Eukaryotic Annotation Pipeline, we discovered 28,162 genes, of which 8,061 are non-coding genes. Our new genome assembly and annotation are tagged as a RefSeq genome by NCBI and provide enhanced resources for research work involving S. scovelli.﻿

For NGx and LGx calculations, S. acus was used as the reference species. All the Sygnathus genomes (except S. scovelli) were last accessed from NCBI on 2022-July- 26. assembly that not only improves completeness and accuracy but is also the most contiguous genome yet produced for the genus Syngnathus (Table 1).

Context
Evolutionary novelties are widespread across the tree of life. However, the origin of de novo genes and their associated regulatory networks, as well as their effects on the phenotype, remain mysterious in most species. Syngnathidae is a family of teleost fishes that includes pipefishes, seahorses, and seadragons [1,[12][13][14][15][16]. Syngnathid fishes are known for their evolutionary novelty with respect to morphology and physiology. For instance, species in this family have variously evolved elaborate leafy appendages, male brooding structures, prehensile tails, elongated facial bones, and numerous other unusual traits [1,[12][13][14]. With a variety of mating systems and sex roles [12][13][14][15][16], the syngnathid fishes also provide an excellent study system to investigate the generality of theories on sexual selection and reproductive biology [15,16]. Advances in comparative genomics and the evolutionary developmental biology of novel traits in syngnathids require the development of additional genomic tools. Among these are well-assembled and annotated genomes [1]. Here, we took a step in this direction by producing an improved reference genome for the Gulf pipefish.

DNA and RNA extraction
We collected S. scovelli from the Gulf of Mexico in Florida, USA (Tampa Bay), and flash froze them in liquid nitrogen. We pulverized approximately 50 mg of whole-body tissue (posterior to the urogenital opening) from a single male on liquid nitrogen, which we submitted to the University of Oregon Genomics and Cell Characterization Core Facility (UOGC3F) for high-molecular-weight DNA isolation using the PacBio Nanobind tissue kit. We submitted similar (but unpulverized) frozen tissue from the same individual fish to Phase Genomics to generate a Hi-C library using Proximo Animal (v4) technology.
In addition, we used organic extraction with TRIzol Reagent, followed by column-based binding and purification using the Qiagen RNeasy MinElute Cleanup Kit, to extract mRNA from the Brain, Eye, Gills, Muscle/Skin, Testis, Ovary, Broodpouch, and Flap tissues.

Sequencing and assembly
After the size selection of genomic DNA using the Blue Pippin (11 kb cutoff), the UOGC3F constructed a sequencing library using the SMRTbell Express Template Prep Kit 2.0. One SMRT cell was sequenced by the UOGC3F using PacBio Sequel II technology, yielding 33.39 Gb in 2.05M CCS reads (out of 6.298M Hi-Fi reads in total). We sequenced 70.4 Gb of paired-end 150 nucleotide reads (234.6 million in total) from the Hi-C library using an Illumina NovaSeq 6000 at the UOGC3F. The RNA sequencing libraries were prepared using the KAPA mRNA HyperPrep Kit. We sequenced 159 bp paired-end reads using Illumina Novaseq 6000 for each tissue from the RNA sequencing libraries for annotation.
Using the Hi-Fi sequences, we estimated the genome size using genomescope2 (v2.0, RRID:SCR_017014) [17] and meryl (v2.2) [18] with a default k-mer size of 21 ( Figure 1). The paired-end Hi-C reads were trimmed using trimmomatic (v0.39, RRID:SCR_011848) [19] with the parameter HEADCROP:1 to remove the first base, which was of low quality. Together with the Hi-Fi sequences, we assembled the first-pass genome assembly in Hi-C integrated mode using hifiasm (v0.16.1, RRID:SCR_021069) [18] with default parameters. The First-Pass assembly refers to the first draft consensus assembly from the Hi-Fi and Hi-C data. We extracted the consensus genome from hifiasm in fasta format and assembled the contigs into scaffolds using juicer (v1.6, RRID:SCR_017226) [20]. We used the 3D-DNA (version date: Dec 7, 2016) [21] pipeline to merely order the scaffolds. The Hi-C contact map of the ordered scaffolds was visualized using juicebox (v1.9.8, RRID:SCR_021172) with no breaking of the original contigs.

Annotation using the NCBI Eukaryotic annotation pipeline
The NCBI Eukaryotic Genome Annotation Pipeline (v10.0) is an automated software pipeline identifying coding and non-coding genes, transcripts, and proteins on complete and incomplete genome submissions to NCBI. The core components of this pipeline are the RNA alignment program (STAR and Splign) and Gnomon, a gene prediction program. In this pipeline, the RNA-Seq reads from the various (Brain, Eye, Gills, Muscle/Skin, Testis, Ovary, Broodpouch, and Flap) tissues of multiple samples, including the S. scovelli individual used for Hi-Fi and Hi-C sequence data (SRR20438584-SRR20438604), were aligned to the genome. Gnomon combines the information from alignments of the transcripts and the ab initio models from a Hidden Markov Model-based algorithm to create a RefSeq annotation. This RefSeq annotation produces a non-redundant set of a predicted transcriptome and a proteome that can be used for various analyses. The Eukaryotic annotation pipeline is not publicly available; thus, we requested the staff at NCBI to annotate the S. scovelli genome.

Assembly statistics
With approximately 2 million Hi-Fi reads and 234.6 million Hi-C reads, we generated the first pass consensus assembly with 585 contigs. The N50 and L50 for this assembly were 15.5 Mb and 11, respectively. We scaffolded this assembly to correct misassembles and  . Visualization of contact maps from Hi-C reads for Syngnathus scovelli (v2). The first 22 primary assembly features (blue lines) sum to about 380 Mb in size, which is the estimated genome size for the species. The green lines reflect the individual contigs from the hifiasm assembly that were organized into chromosome-level scaffolds based on Hi-C contact data.   complete with a quality value (QV) of 61.37 and an error rate of 7.3 × 10 −5 % (see GigaDB [28] for more details; Tables 3 and 4).
Consistent with the BUSCO contiguity metrics, the genome is on par with S. acus for completeness, which is also around 95% complete. Missing genes make up the majority of the remaining 5% of genes. We identified genes likely to be truly missing from the S. scovelli genome and more broadly from members of Syngnathidae (including the seahorses, genus Hippocampus along with Syngnathus) by confirming their absence across the BUSCO results from the present assembly, four additional members of the genus Syngnathus, and six additional Hippocampus publicly available assemblies (see GigaDB [28] for additional details). Of the missing BUSCO genes, 83 are shared among all the species of Syngnathus, and 38 are missing from both genera (see GigaDB [28] for additional details). Future work could profitably explore these missing genes, as some may be related to the interesting novel traits in syngnathid fishes.

Annotation results
After masking about 43% of the genome, the annotations resulted in the prediction of about 28,162 genes, of which 8,061 are non-coding genes (see GigaDB [28]; Tables 5 and 6). The

REUSE POTENTIAL
The new version of the S. scovelli genome opens doors to more accurate results by enhancing the comparative genome data analysis and facilitating the creation of robust tools for molecular genetic studies. We generated the original version of the genome to focus on the genetic mechanisms underlying the unique body plan among pipefishes and seahorses. This genome version takes us one step closer to uncovering these evolutionary

DATA AVAILABILITY
The genome is available on NCBI with the assembly accession number GCA_024217435.2.
The genome is annotated via the NCBI eukaryotic genome annotation pipeline, and the annotation report release (100) is available here. Several smaller contigs and contaminant microbes were removed in the annotation pipeline yielding a more robust genome assembly. The sequence identifier for the chromosome-level scaffolds is available in the GigaDB [28]. The NCBI Bioproject accession number is PRJNA851781, the raw Hi-Fi sequence accession is SRR19820733, the Hi-C sequence accession is SRR22219025, and the RNA-Seq sequence files from various tissues are SRR20438584-SRR20438604. Additional data is available in the GigaDB [28].

Ethical approval
Not applicable.

Consent for publication
Not applicable.