A chromosome‐anchored genome assembly for Lake Trout (Salvelinus namaycush)

Abstract Here, we present an annotated, chromosome‐anchored, genome assembly for Lake Trout (Salvelinus namaycush) – a highly diverse salmonid species of notable conservation concern and an excellent model for research on adaptation and speciation. We leveraged Pacific Biosciences long‐read sequencing, paired‐end Illumina sequencing, proximity ligation (Hi‐C) sequencing, and a previously published linkage map to produce a highly contiguous assembly composed of 7378 contigs (contig N50 = 1.8 Mb) assigned to 4120 scaffolds (scaffold N50 = 44.975 Mb). Long read sequencing data were generated using DNA from a female double haploid individual. 84.7% of the genome was assigned to 42 chromosome‐sized scaffolds and 93.2% of Benchmarking Universal Single Copy Orthologues were recovered, putting this assembly on par with the best currently available salmonid genomes. Estimates of genome size based on k‐mer frequency analysis were highly similar to the total size of the finished genome, suggesting that the entirety of the genome was recovered. A mitochondrial genome assembly was also produced. Self‐versus‐self synteny analysis allowed us to identify homeologs resulting from the salmonid specific autotetraploid event (Ss4R) as well as regions exhibiting delayed rediploidization. Alignment with three other salmonid genomes and the Northern Pike (Esox lucius) genome also allowed us to identify homologous chromosomes in related taxa. We also generated multiple resources useful for future genomic research on Lake Trout, including a repeat library and a sex‐averaged recombination map. A novel RNA sequencing data set for liver tissue was also generated in order to produce a publicly available set of annotations for 49,668 genes and pseudogenes. Potential applications of these resources to population genetics and the conservation of native populations are discussed.


| INTRODUC TI ON
Key questions in evolution and conservation biology can only be addressed using genomic approaches and appropriate study species.
Lake Trout (Salvelinus namaycush; Figure 1) are a top predator in many lentic ecosystems across northern North America and express exceptional levels of ecotypic variation (Muir et al., 2014, making them an ideal study species for exploring the processes of ecological speciation and adaptive diversification. The post-Pleistocene parallel evolution of diverse Lake Trout ecotypes has been likened to the adaptive radiation of cichlid species in the Great Lakes of east Africa ; however, the radiation of Lake Trout ecotypes appears to have occurred over a relatively short evolutionary timescale (Harris et al., 2015, ~8000 years). At least three distinct Lake Trout ecotypes (lean, siscowet, and humper) once existed throughout the Laurentian Great Lakes (Hansen, 1999) and anecdotal evidence suggests that as many as 10 easily differentiable forms once existed in Lake Superior (Goodier, 1981). High levels of ecotypic variation have also been documented in contemporary populations across the species range (Blackie et al., 2003;Chavarie et al., 2015;Hansen et al., 2016;Zimmerman et al., 2006), with as many as five trophic ecotypes being found in a single lake (Marin et al., 2016). Lake Trout are also ancestrally autotetraploid, with the common ancestor of all salmonids having undergone a whole genome duplication event (WGD) approximately 80-100 million years ago (Berthelot et al., 2014;Lien et al., 2016;Macqueen & Johnston, 2014). For this reason, salmonids have long been considered ideal study species for understanding the evolutionary consequences of WGD (Allendorf & Thorgaard, 1984;Ohno, 1970). Previous studies have demonstrated that salmonid genomes exhibit a mixture of disomic and tetrasomic inheritance (Allendorf & Danzmann, 1997) and have suggested that salmonid homeologs can be partitioned into two broad categories -ancestral ohnologue resolution regions (AORe) and lineage specific ohnologue resolution regions (LORe; Robertson et al., 2017). AORe regions exhibit elevated differentiation between homeologs because these regions returned to a state of disomic inheritance prior to the radiation of salmonid species. Conversely, LORe regions are characterized by extremely low levels of sequence differentiation between homeologs due to delayed rediploidization. Given the high levels of ecotypic diversity observed in Lake Trout, and the potential for WGD to facilitate the evolution of novel phenotypes (Ohno, 1970;Macqueen & Johnston, 2014; Van de Peer et al., 2017) and reproductive isolation (Lynch & Force, 2000), research exploring the genetic basis for ecotypic differentiation and incipient speciation in Lake Trout could provide important insights about the role of LORe and AORe regions in more recent adaptive radiations.
Furthermore, many Lake Trout populations, particularly those in the Laurentian Great Lakes, have been severely reduced in abundance or distribution, or extirpated, due to invasive species introductions and overfishing (Smith, 1968). Following the basin-wide collapse of the lake whitefish (Coregonus clupeaformis) commercial fishery in the Great Lakes during the early 20th century, fishing pressure was transferred to Lake Trout populations, which partially contributed to population declines starting in the 1930s (Hansen, 1999). A novel predator, the sea lamprey (Petromyzon marinus), also invaded the Great Lakes during this time, leading to further increases in adult Lake Trout mortality and functional extirpation from all lakes except Lake Superior and a small, isolated, population in Lake Huron (Hansen, 1999). The restoration program that commenced largely focused on reducing sea lamprey predation, reducing fishing pressure, creating aquatic refuges, and stocking juvenile Lake Trout from a diverse collection of domesticated strains originating from multiple source populations (Hansen, 1999;Krueger et al., 1983). Lake Trout populations in Lake Superior rebounded allowed us to identify homologous chromosomes in related taxa. We also generated multiple resources useful for future genomic research on Lake Trout, including a repeat library and a sex-averaged recombination map. A novel RNA sequencing data set for liver tissue was also generated in order to produce a publicly available set of annotations for 49,668 genes and pseudogenes. Potential applications of these resources to population genetics and the conservation of native populations are discussed.

K E Y W O R D S
genome assembly, genomics, Lake Trout, salmonid, Salvelinus F I G U R E 1 Photograph of an adult Lake Trout (Salvelinus namaycush) from Great Bear Lake, Northwest Territories, Canada. Photo credit: Andrew Muir relatively quickly; however, the re-emergence of natural reproduction in other lakes was hindered by high levels of lamprey predation on adult Lake Trout (Pycha, 1980), predation on juveniles by invasive alewife (Madenjian et al., 2008), reduced juvenile survival caused by thiamine deficiency (Fitzsimons et al., 2009), and potentially reduced hatching success associated with PCB contamination (Mac & Edsall, 1991).
Today, Lake Superior populations remain relatively stable and recruitment has been observed in lakes Huron (Riley et al., 2007), Michigan (Hanson et al., 2013), and Ontario (Lantry, 2015). Recent research suggests that domesticated strains used for reintroduction have variable fitness in contemporary Great Lakes environments (Larson et al., 2021;Scribner et al., 2018) and may be differentially contributing to recent recruitment. However, the biological mechanisms that underly these differences in fitness and recruitment remain unclear.
Genomic and transcriptomic approaches have been widely used to identify loci associated with adaptive diversity and ecotypic divergence in salmonids (Prince et al., 2017;Rougeux et al., 2019;Veale & Russello, 2017;Willoughby et al., 2018). This work has been partially driven by the publication of high-quality genome assemblies and linkage maps for numerous salmonid species De-Kayne et al., 2020;Gagnaire et al., 2013;Lien et al., 2016;Pearse et al., 2019); however, genomic resources are notably lacking for Lake Trout. An annotated, chromosome-anchored, genome assembly is arguably the most valuable resource for advancing genomic research on any species. A publicly available reference genome for Lake Trout would eliminate many challenges associated with conducting conservationoriented genetic research aimed at restoring ecotypic diversity and viable wild populations. Until recently, the assembly of non-model eukaryotic genomes was prohibitively expensive, computationally challenging, and required the collaborative efforts of large genome consortia; however, the development of long-read ("third generation") sequencing technologies has to some extent eliminated these hurdles (Hotaling & Kelley, 2020;Whibley et al., 2020).
Long-read sequencing data can be useful for scaffolding and filling gaps in existing, fragmented, short-read assemblies (English et al., 2012). A number of assembly algorithms also seek to assemble contigs directly from long-read sequencing data (Falcon; Chin et al., 2016;Canu;Koren et al., 2017;wtdbg2;Ruan & Li, 2020) and recent work suggests that this approach can be highly effective for assembling chromosome-anchored salmonid genomes when combined with additional scaffolding information (De-Kayne et al., 2020; also see RefSeq: GCF_002021735.2). Salmonid genomes are highly complex and relatively difficult to assemble owing to the existence of large LORe regions (Robertson et al., 2017) and high repeat content (De-Kayne et al., 2020;Kajitani et al., 2014;Lien et al., 2016). Sequencing low-diversity individuals from inbred lines or homozygous individuals produced via chromosome set manipulations provides one route for simplifying the assembly process and correctly assembling regions with low levels of differentiation between homeologs. Previous salmonid genome assemblies have made use of doubled haploid individuals Lien et al., 2016;Pearse et al., 2019) because these individuals are theoretically homozygous at all loci (but see Hansen et al., 2020).
Additionally, long read sequencing data has been shown to be highly effective for assembling polyploid genomes (Du et al., 2020), and these data would probably improve our ability to resolve LORe regions in salmonids. For instance, De-Kayne et al. (2020) recently published a highly contiguous assembly for European Whitefish (Coregonus sp. balchen); however, this assembly was produced using data from an outbred, wildcaught, individual rather than a double haploid.
Here, we present a chromosome-anchored reference genome for a female Lake Trout that was assembled using Pacific Bioscience long-read sequencing data and scaffolded using a high-density linkage map (Smith et al., 2020) and genome-wide chromatin conformation capture followed by massively parallel sequencing (Hi-C).
We also produced a number of complementary resources including a custom repeat library and an interpolated recombination map in order to facilitate additional research on this important species. A publicly available set of gene annotations was also produced using the NCBI Eukaryotic Genome Annotation Pipeline. Additionally, we identify Lake Trout homeologs resulting from the salmonid specific autotetraploid event (Ss4R) and establish homologous relationships with chromosomes from other salmonid species.

| Crossing and sample collection
Gynogenetic double haploids were produced by fertilizing eggs with UV irradiated sperm, then pressure shocking embryos immediately following the first mitotic division (as described in Limborg et al., 2016;Thorgaard et al., 1983). Double haploid (DH) offspring were created at Iron River National Fish Hatchery using eggs and sperm collected from captive adult Lake Trout from the U.S. Seneca Lake brood stock. The U.S. Seneca Lake hatchery strain was entirely founded by early Autumn spawning Lake Trout initially collected from Seneca Lake, New York (see Page et al., 2003 andIhssen, 1995). Due to low survivorship of DH offspring (Komen & Thorgaard, 2007), we tested multiple UV and pressure shock treatments on eggs from five different females. Batches of 900 eggs from each female were fertilized with sperm that was irradiated for 140, 280, or 1260 s. Each batch was then split and subbatches were pressure shocked at 11,000 PSI for 5 min at either 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or 10 h post-fertilization. A total of 13,500 eggs were exposed to various UV and pressure shock treatments. One batch of 900 eggs from each female was also exposed to a control treatment which involved no sperm irradiation or pressure shock. Embryos were incubated in heath trays at ambient temperature until eye-up stage (E 3 6 per Balon, 1980), with dead embryos being removed from trays on a daily basis. A single individual that survived past post-embryo stage (sensu Marsden et al., 2021) was grown to a size of approximately 5 cm before being sampled post-mortality and stored at -20°C. The postembryo stage in Lake Trout is characterized by a fully absorbed yolk sac, parr marks, and an inflated gas bladder (Marsden et al., 2021).

| Laboratory methods
High molecular weight (HMW) DNA was extracted from white muscle sampled from the DH individual using a MagAttract HMW DNA Extraction kit (Qiagen). The manufacturer's recommended protocol was used except tissue digestion was done at room temperature for 140 min rather than 12-16 h at 55°C. Fragment size and yield were determined using pulse field gel electrophoresis and an AccuClear Ultra High Sensitivity DNA Quantification assay (Biotium). Prior to sequencing and assembly, we verified that the DH individual was completely homozygous at 15 microsatellite loci that are typically highly heterozygous in Lake Trout populations (Valiquette et al., 2014). A long-read sequencing library was then prepared using the SMRTbell Template Prep Kit 1.0 (Pacific Biosciences), with the optional DNA Damage Repair step after size selection. Size selection was made for fragments >10 kb using a Blue Pippin instrument (Sage Science) according to the manufacturer recommended protocol for 20 kb template preparation. We used 5 µg of concentrated DNA as input for the library preparation reaction. Library quality and quantity were assessed using a genomic DNA Tape Station assay (Agilent), as well as Hi-C proximity ligation libraries were generated using tissue from a 7-year-old diploid female Lake Trout originating from the Killala Provincial hatchery strain. Four Hi-C libraries were prepared using spleen and white muscle tissue using the Arima Hi-C kit according to the manufacturer's protocol (A510008, Arima-HiC_AnimalTissue_ A160132_v00, Arima Genomics) and library preparation kits from Kapa Biosystems and Lucigen. Each Hi-C library was spiked into a portion of an Illumina HiSeqX lane in order to assess how effectively reads could be mapped against the draft contig assembly. hicup version 0.7.2 (Wingett et al., 2015) within genpipes version 3.1.5 (Bourgey et al., 2019) was used to map Hi-C sequencing reads against draft contigs. The Hi-C library prepared using muscle tissue and prepared using the Arima-Hi-C and Lucigen Kits, was selected for further sequencing given that this library produced the highest proportion of reads mapped to draft contigs. This kit employs a restriction enzyme cocktail that digests chromatin at N^GATC and G^ANTC sequence motifs. The selected library was sequenced to high coverage in a single HiSeqX lane using the 2X150 bp paired end read format.
Sequencing produced 182,781,953 paired end reads.
DNA was also extracted from fin tissue collected from an adult (diploid) female Lake Trout from the Seneca Lake broodstock using a MagAttract HMW DNA extraction kit (Qiagen) and protocol recommended by the manufacturer. Sequencing reads from this Seneca strain female were later used for contig polishing and correction (described below in Assembly and Scaffolding). The library was prepared using 100 ng of input DNA and the NEBNext Ultra Library Preparation Kit for Illumina (New England Biolabs).
The library was sheared to approximately 400 bp using a Covaris M220 Ultrasonicator, amplified for eight cycles, and quantified using Quant-It Picogreen dsDNA assays (Thermo Fisher) run in triplicate.
Fragment size was assessed using a genomic DNA Tape Station assay (Agilent). The library was sequenced in multiplex with three other Lake Trout in two HiSeqX lanes using the paired end 2 × 150 read format. Sequencing produced 316,557,707 read pairs for this individual.

| Assembly and scaffolding
Contig assembly using PacBio reads was carried out using the pol-ished_falcon_fat assembly workflow run using the smrt analysis v3.0 pbsmrtpipe workflow engine provided with an installation of smrt link v5.0 (smrtlink-release_6.0.0.47841; https://github.com/Pacif icBio scien ces/pbsmr tpipe). Read metadata were extracted using the smrt analysis v3.0 data set tool with the merge option. Sequencing read metadata, pipeline settings, and an output directory were specified for the polished_falcon_fat pipeline option. Default assembly settings were used except genome size (HGAP_GenomeLength_str) was set to 3 gigabases (Gb), seed coverage (HGAP_SeedCoverage_ str) was set to 40x, and the minimum read length to use a read as a seed (HGAP_SeedLengthCutoff_str) was set to 1000. Multiple settings were also changed. The resulting assembly settings file, read metadata file, and commands used to run the pipeline are available at https://github.com/smith sr90/LakeT routG enome.
The polished_falcon_fat workflow uses FALCON assembly algorithm (Chin et al., 2013) and the quiver/arrow consensus tool (https:// github.com/Pacif icBio scien ces/Genom icCon sensus) to generate a polished contig assembly. The Falcon method operates in two phases: First, overlapping sequence reads were compared to generate accurate consensus sequences with read N50 >10.9 Kb. Next, overlaps between the corrected longer reads were used to generate a string graph. The graph was reduced so that multiple edges formed by heterozygous structural variation were replaced to represent a single haplotype. Contigs were formed by using the sequences of nonbranching paths. Two supplemental graph cleanup operations were applied to improve assembly quality by removing spurious edges from the string graph: tip removal and chimeric duplication edge removal. Tip removal discards sequences with errors that prevent 5′ or 3′ overlaps. Chimeric duplication edges may be produced due to the production of chimeric molecules during library preparation or during the first sequence cleanup step and these errors artificially increase the copy number of a duplication. In a second and final workflow stage, the polished_falcon_fat workflow used the arrow consensus tool to perform error correction on the assembly using PacBio reads in order to generate an initial polished assembly. The resulting contigs were passed through a second round of error correction using Pilon in order to resolve SNP, indel, and local assembly errors before proceeding with scaffolding (https://github.com/broad insti tute/ pilon). The Illumina paired-end sequencing data set from a Seneca strain female (described above) was mapped to draft contigs using BWA mem with default settings (Li, 2013). Reads with mapping qualities <20 were removed from the data set in order to exclude low quality alignments and reads mapping to multiple locations. Improperly paired reads were also excluded using samtools view (Li et al., 2009). The resulting filtered bam file was used as input for Pilon with the --fix all --mindepth 5, and --diploid options. Pilon was run prior to scaffolding in order to identify and correct local assembly errors that could potentially cause downstream scaffolding errors. We adopted a multifaceted scaffolding approach leveraging information from Hi-C sequencing and a high-density linkage map for Lake Trout (Smith et al., 2020). Hi-C reads were mapped to Pilon corrected contigs with default setting using the Arima Genomics Mapping pipeline (Arima Genomics, https://github.com/Arima Genom ics/mappi ng_pipeline), which included four primary steps. First, forward and reverse reads were mapped to the reference genome using bwa version 0.7.17 (Li, 2013) separately. Next, the 5' end of the mapped reads were trimmed. samtools version 1.9 (Li et al., 2009) was then used to filter reads with mapping qualities <10 in order to remove low quality alignments and reads mapping to multiple locations. Finally, picard version 2.17.3 (https://broad insti tute.github.io/picar d/) was used to add read group information and mark duplicate reads. The resulting BAM file was used as input for salsa v2.2 (Ghurye et al., 2017) run with default settings (three iterations). We also tested Salsa2 using five iterations and compared results with those produced using default settings by calculating Spearman's rank order correlation coefficients between the order of loci on the Lake Trout linkage map (Smith et al., 2020) and the order of loci on the 50 largest scaffolds. Linkage mapped RAD contigs were aligned to the reference assembly using Minimap2 (Li, 2018) using the -asm5 option. RAD contigs with mapping qualities less than 60 were removed before calculating correlation coefficients using the r function cor from the stats package (R Core Team, 2017) and the method argument set to "spearman." Additional scaffolding was carried out using chromonomer v1.13 (Catchen et al., 2020). The assembly was initially scaffolded using default settings, which yielded chromosome length scaffolds with a high degree of concordance with the linkage map; however, structural differences between the linkage map and scaffolds were apparent on six chromosomes. In order to resolve these inconsistences, we aligned the full set of PacBio subreads to the assembly using minimap2 (Li, 2018) using the preset option for PacBio data. The resulting bam file was sorted, indexed, and per-base coverage was calculated for all positions using samtools depth with the --aa option. We then ran a second round of Chromonomer using the --rescaffold, --depth, and depth_stdevs = 2 options, which allowed for gaps to be opened in contigs if the site-specific depth within a sliding window of 1000 base pairs was >2 standard deviations from the mean, suggesting an assembly error. This resulted in an assembly with improved concordance with the linkage map; however, linkage group 41 still exhibited a large inversion relative to the scaffolds. We determined the approximate location of this assembly error by identifying the pair of linkage mapped loci for which the level of discordance between the linkage map and assembly was maximized. The scaffold was manually broken and reoriented using an existing gap that existed between these two loci.
Gaps were filled using pbjelly from pbsuite v15.8.24 (English et al., 2012). All PacBio reads were aligned to the draft assembly using Minimap2 using the -pb preset option and reads mapping within 5000 base pairs of a gap were retained for gap filling using bedtools intersect (Quinlan & Hall, 2010). Retained reads were remapped with blasr v5.3.2 (Chaisson & Tesler, 2012) using the options --minMatch 11, --minPctIdentity 75, --bestn 1, --nCandidates 10, --maxScore --500, and --fastSDP. The "maxWiggle" argument was set to 100 kilobases (Kb) for the PBJelly assembly stage in order to account for gaps of unknown length. After filling gaps, we corrected single nucleotide and short indel errors by running 3 iterations of Polca (distributed with masurca v. 3.4.2; Zimin & Salzberg, 2020) using Illumina data from a Seneca strain female as input. Polca was chosen because this error correction approach has been shown to be more effective for correcting single nucleotide and indel errors than comparable tools (Zimin & Salzberg, 2020). Default settings were used except low quality alignments (MQ < 10) and alignments overlapping gaps were removed from bam files using bedtools intersect (Quinlan & Hall, 2010) prior to running the Polca variant calling step.
Illumina paired end data from the same individual used for genome polishing and PacBio data from one SMRTcell were aligned to the Arctic Char (Salvelinus alpinus) mitochondrial genome (RefSeq: NC_000861.1) in order to obtain reads useful for assembling the Lake Trout mitochondrial genome. Reads were aligned using Minimap2 using the sr and map-pb present options for short-reads and long-reads, respectively.
Reads aligning to the Arctic Char mitochondrial genome were extracted from original fastq files using seqtk subseq (https://github.com/lh3/ seqtk) and hybrid assembly was conducted using unicycler v0.4.8 (Wick et al., 2017) using the settings --min_fasta_length 15000 and --keep 0. Unicycler implements a hybrid-assembly approach using Spades (Bankevich et al., 2012), SeqAn (Döring et al., 2008), and Pilon. First, spades (v3.13.1) was used to assemble Illumina short-reads and contigs with graph coverage less than half the median coverage were removed due to potential contamination from the nuclear genome. Contigs were then scaffolded using long-reads and SeqAn (Döring et al., 2008) was used to generate gap consensus sequences. Finally, Pilon was used to resolve assembly errors using short-read alignments as input. The resulting mitochondrial genome assembly was aligned to all salmonid sequences in the NCBI nucleotide collection using blastn and standard settings in order to verify that it was consistent with previous Lake Trout mitochondrial assemblies. A neighbour-joining tree constructed from blast pairwise alignments was exported from the NCBI website and is available in Figure S6.

| Assembly quality control
We used multiple approaches to assess the accuracy, contiguity, and completeness of the genome assembly. First, we determined the proportion of the genome that was recovered in our assembly by comparing total assembly size with an estimate of genome size based on the distribution of k-mer frequencies from Illumina pairedend 2 × 150 data generated using DNA from a Seneca strain female.
The frequency of all 19 mers in the read data was calculated using the count function in jellyfish v2.2.6 (Marçais & Kingsford, 2011) with the options -m 19 and -C. K-mer counts were then exported to the histogram format using the histo function. This file was used as input for genomescope v1.0 (http://qb.cshl.edu/genom escop e/; Vurture et al., 2017) with read length set to 150 bp and k-mer length set to 19.
Basic assembly statistics were calculated using the program summarizeassembly.py from pbsuite v15.8.24 (English et al., 2012). Statistics included total assembly size, contig and scaffold N50s, and minimum and maximum contig and scaffold lengths. Assembly statistics were calculated with and without gaps. Contig and scaffold N50s and counts were obtained for 14 additional salmonid assemblies from NCBI for comparison. Single base consensus accuracy was estimated during each iteration of polishing with Polca as the proportion of bases in input sequences overlapping detected errors.
Next, we calculated percentages of complete singleton, complete duplicated, fragmented, and missing Benchmarking Single-Copy Orthologues (BUSCOs) for seven chromosome-level salmonid assemblies and compared these with scores for the Lake Trout assembly discussed here. These included genomes for Brown Trout  . It should be noted that the assembly originally produced for Arctic Char (GCA_002910315.1; Christensen, Rondeau, et al., 2018, re-ferred to as the Dolly Varden assembly here) was later found to be from a Dolly Varden or potentially a Dolly Varden -Arctic Char hybrid (see Shedko, 2019 andChristensen et al., 2021). BUSCO scores were also calculated for the Northern Pike genome (Esox lucius;

GCA_000721915.3; Rondeau et al., 2014), a member of the order
Esociformes that is commonly used as a pre-Ss4R outgroup species.
Finally, we aligned the linkage mapped contigs from Smith et al. (2020) to the final assembly and calculated Spearman's rank order correlation coefficients between physical mapping locations and the order of loci along linkage groups. Linkage mapped contigs were aligned to the reference assembly using Minimap2 using the -asm5 preset parameters and the resulting sam file was filtered to exclude contigs with mapping qualities less than 60. Correlation coefficients were calculated using the cor function in r (R Core Team, 2017) with the method argument set to "spearman." Correlation coefficients were then converted to absolute values using the abs function in order to compare chromosomes and linkage groups with reversed orientations.

| Repetitive DNA
A custom repeat library was created using repeatmodeler v2.0.1 (Flynn et al., 2020) and repeats were subsequently classified using repeatclassifier (Smit et al., 2015). Repeats were then masked using repeatmasker (Smit et al., 2015) and the output of repeatmasker was used to determine the genome-wide abundance of different repeat families and the relative density of repeat types across chromosomes. The density of the most abundant repeat type (Tcl-mariner) was visualized across chromosomes using the r-package circlize (Gu et al., 2014; Figure 2).

| Homeolog identification and synteny
We performed a self-versus-self synteny analysis using symap v5 (Soderlund et al., 2006(Soderlund et al., , 2011 to identify Lake Trout homeologs resulting from the salmonid specific autotetraploid event (Lien et al., 2016;Macqueen & Johnston, 2014). Prior to running symap, we hardmasked the genome using repeatmasker v4.1.0 (Smit et al., 2015) using our custom repeat library as input and RMblast as the search engine (-e ncbi). Nucmer (Marçais et al., 2018) was used for SyMap alignments and options were set to min-dots = 30, top_n = 2, and merge_blocks = 1. We then used Symap to identify blocks of synteny between Lake Trout and Dolly Varden, Rainbow Trout, and Atlantic Salmon. These alignments were conducted using Promer (Marçais et al., 2018), and we used the options min_dots = 30, top_n = 1, merge_blocks = 1, and no_overlapping_blocks = 1. Results from self-versus-self synteny analysis were visualized using the r package circlize (Gu et al., 2014). Additionally, we identified syntenic relationships with Northern Pike using synmap2 (Haug-Baltzell et al., 2017). We used the last algorithm to align genomes, DAGChainer to identify syntenic blocks (-D20, -A5), Quota Align Merge to merge syntenic blocks (-Dm 0), and Quota Align (Overlap Distance = 40) to enforce a 1-to-2 ploidy relationship between Northern Pike and Lake Trout (Haas et al., 2004;Tang et al., 2011). We also repeated our self-versus-self synteny analysis using SynMap2 while enforcing 2-to-2 synteny relationships. Sequence identity between homeologs was extracted from the output of SynMap2 for all merged regions composed of more than 1000 blocks. We then computed the moving average of local homeolog identity across chromosomes using sliding windows containing 200 blocks. We then fit a Gaussian mixture model to the distribution of homeolog identities using mixtools (Benaglia et al., 2009) and the function normalmixEM (k = 2) after observing that the values were bimodally distributed. Posterior probabilities of assignment to clusters with high and low homeolog divergence were determined for each window in addition to cluster means and mixing proportions (lambdas) for the data set. Results from the Lake Trout-versus-other salmonids synteny analysis were visualized using the Chromosome Explorer option in symap v5. Syntenic relationships between Lake Trout and Northern Pike were visualized as a dotplot generated in r ( Figure S5). The fourth ring displays local homeolog identity between syntenic blocks detected by SynMap2. Red points correspond to windows with elevated sequence identity putatively resulting from delayed rediploidization (posterior probability >0.5). Blue points correspond to windows with elevated sequence divergence between homeologs. (E) The fifth ring displays map distance (centimorgans) for male (red) and female (blue) linkage maps (y-axis) versus physical distance (x-axis) for each of the 42 chromosomes. Connections are drawn between syntenic blocks identified by symap v5 putatively resulting from Ss4R wipe and stored in RNAlater (Invitrogen, Thermo Fisher Scientific) for 24-48 h at room temperature. RNALater was pipetted from the liver tissue and the samples were stored at -80°C until RNA isolation.

| RNA sequencing and gene annotation
Liver tissues were homogenized individually in 2 ml Lysing Matrix D tubes (MP Biomedicals) with 1 ml of Trizol reagent (Invitrogen, Thermo Fisher Scientific). RNA was extracted from the homogenate using phenol-chloroform extraction (Chomczynski & Sacchi, 2006). RNA was precipitated with RNA precipitation solution (Sambrook & Russell, 2001) and isopropanol, and washed with 75% ethanol. RNA samples were resuspended in nuclease-free water (Thermo Fisher Scientific). The purity and concentration of the RNA were initially determined using a NanoDrop-8000 spectrophotometer. RNA quality was also assessed using a Bioanalyzer (Agilent) and resulting RNA integrity numbers (RIN). All RNA samples met our minimum RIN threshold of 7.5.

| Recombination rates and centromeres
Sex averaged recombination rates were estimated across chromosomes using the sliding window interpolation approach implemented in mareymap (Rezvoy et al., 2007). Restriction site associated DNA (RAD) contigs from the Lake Trout linkage map (Smith et al., 2020) were mapped to chromosomes using minimap2 using the -asm5 preset option and reads with mapping qualities <60 were removed. At this point, RAD loci overlapping centromere mapping intervals for each linkage group were extracted and the centromere centre was considered to be the mean mapping position for centromere associated RAD tags. Centromere positions were visualized using the r-package circlize (Gu et al., 2014).
In order to remove contigs with anomalous mapping positions that could bias recombination rate estimates, we fit a loess model describing linkage map position as a function of physical position for each chromosome, extracted model residuals, and removed markers with residuals that were greater than one standard deviation from the mean. Loess models were fit using the loess function in r with the span argument set to 0.2 and the degree argument set to 2.
The remaining markers were output to MareyMap format and were manually curated using mareymap Online (Siberchicot et al., 2017). A sex averaged recombination map was calculated using sliding window interpolation and exported from the program (Appendix S1 -Recombination Map).

| Sequencing, assembly and scaffolding
Of 13,500 embryos exposed to UV irradiation and pressure shock treatments, only two individuals survived beyond post-embryo stage and one individual survived to a size of approximately 5 cm.
This individual was found to be homozygous at all 15 genotyped microsatellite loci, suggesting that chromosome set manipulations were successful at inducing double haploidy. HMW DNA extraction yielded a DNA concentration of 70 ng/µl based on nanodrop TA B L E 1 General summary statistics for the Lake Trout (Salvelinus namaycush) genome assembly. The total number of chromosomes, scaffolds (including chromosomes), and contigs are listed in the top row. Metrics reported for chromosomes and scaffolds include gaps of unknown length. Consensus accuracy was obtained from the output of POLCA after running three iterations of the program

| Assembly quality control
We estimated the total haploid genome size for Lake Trout to be between 2.119 and 2.122 Gb using k-mer analysis and GenomeScope v1.0, with 38% of the genome composed of unique sequence and 62% composed of repetitive sequence (Table S1, Figure S1). Heterozygosity for the sample used for polishing was estimated to be between 2.78 and 2.9 heterozygous sites per 1000 base pairs. It should be noted that the individual used for polishing was a diploid and not a gynogenetic double haploid. The estimated coverage for the sample used for genome-size estimation was 16×, which should be sufficient for k-mer based methods (Williams et al., 2013).
We recovered 93.2% of BUSCO genes with 60.3% and 32.9% being present as singletons and duplicates, respectively ( Figure 3; Table S2). The salmonid genomes evaluated recovered between 88.1% and 95.3% complete BUSCOs with between 25.3% and 34.9% being duplicated and between 58.3% and 65% being singletons. The proportion of duplicated BUSCOs in the Lake Trout genome was the second highest among the compared salmonid genomes (32.9%) and appears to be comparable to the Brown Trout genome (GCA_901001165.1; River Trout), which was also assembled using Falcon (Falcon-unzip) and polished using a method based on the Freebayes variant caller (Garrison & Marth, 2012).
We found that the mitochondrial genome assembly produced here falls within a monophyletic group entirely composed of mitochondrial sequences previously generated for Lake Trout ( Figure S6; Schroeter et al., 2020). The assembly was most similar to one produced for a Lake Trout sampled from Lake Ontario, Pennsylvania (99.96% Identity; Accession: MF621746.1). The Seneca Lake hatchery strain is heavily stocked in Lake Ontario and appears to have elevated fitness in this environment (Perkins et al., 1995).

| Repetitive DNA
RepeatModeler 2 identified 2810 interspersed repeats and 462 of these were classified by RepeatClassifier. RepeatMasker reported that 53.8% of the Lake Trout genome is composed of sequences from this repeat library. A total of 13.04% of the genome was composed of retroelements, with 10.47% being LINEs and 2.57% being LTR elements, and 9.97% of the genome was composed of DNA transposons. As has been observed in other salmonids, TcMar-Tc1 was the most abundant superfamily and these repeats were most abundant near centromeres (Figure 2; Lien et al., 2016;Pearse et al., 2019). A total of 30.79% of the genome was composed of interspersed repeats that were not classified by RepeatClassifier (Table 2).

| Homeolog identification and synteny
Self-versus-self synteny analysis conducted using symap v5 identified 126 syntenic blocks shared between putative Lake Trout homeologs ( Figure 2). Blocks ranged in size from 477,153 bp to 57,126,662 bp.
Fifty-two blocks were longer than 10 Mb and 70 were longer than 5 Mb (Figure 2, inner links). The distribution of local homeolog identity was bimodal and our Gaussian mixture model estimated that the means of these two distributions were 82.72% and 90.64%. The lambda estimates for the model were 0.5848 and 0.4152 suggesting that approximately 41.52% of the Lake Trout genome exhibits a signal of delayed rediploidization (Figure 2d). Loci with elevated homeolog sequence identity were primarily located near the telomeres of metacentric chromosomes (Chr1-Chr8); however, one pair of acrocentric homeologs (Chr23 and Chr34) also exhibited elevated sequence identity.
We identified 50 syntenic blocks shared between Rainbow Trout and Lake Trout and identified homologous Rainbow Trout chromosomes for all Lake Trout chromosomes. Syntenic blocks shared between these two species ranged in size from 1. Fifty-four syntenic blocks were detected between these species that ranged in size from 208,516 bp to 88 Mb. We identified 42 syntenic blocks shared between Dolly Varden and Lake Trout and identified homologues for all chromosomes except Chr 41.
Syntenic blocks ranged in size from 6.8 Mb to 79.9 Mb. Pre-Ss4R ancestral chromosomes were also detected in Northern Pike ( Figures S2-S5).

| Genome annotation
We generated a total of 3.45 billion RNA-seq reads from liver tissue that were subsequently used as input for the NCBI Eukaryotic Genome Annotation Pipeline v8.5 (9 July 2020 release date). An additional 528,760 reads were used from previous Lake Trout gene expression studies. A total of 86% of reads were aligned to the genome assembly, and 12 Lake Trout transcripts from GenBank and 3547 known Atlantic Salmon transcripts from RefSeq were ultimately used as input for the pipeline.

| Recombination rates and centromeres
We were able to map between 1 and 238 centromere-associated In all, 14,438 linkage-mapped contigs were mapped to the genome with mapping qualities greater than 60 (Figure 2e). A total of 11,232 loci were retained for recombination rate estimation after manual curation and filtering using loess model residuals. We determined the mean sex averaged recombination rate to be 1.09 centimorgans/Mb, with recombination rates varying between 0 and 6.58 centimorgans/Mb across the genome. The interpolated recombination map produced by MareyMap is available in Appendix S1 -Recombination Map. Overall, Lake Trout BUSCO scores were most similar to those obtained for Brown Trout; however, the proportion of missing BUSCOs was 1.9% higher for Lake Trout and the proportion of complete duplicated BUSCOs was 2% lower suggesting that some duplicated regions might be missing from the Lake Trout genome. Nonetheless, these two assemblies had the highest percentage of complete BUSCOs and the highest percentage of complete duplicated BUSCOs out of the genome assemblies examined, suggesting that these two assemblies more effectively resolve LORe regions with high sequence similarity.

| DISCUSS ION
Furthermore, the order of loci on the Lake Trout linkage map and the order of loci on Lake Trout chromosomes was shown to be highly concordant; however, it should be noted that the linkage map cannot be considered an independent source of validation. The genome presented here is also highly contiguous, with a contig N50 higher than any published salmonid genome at the time of analysis (but see the recently released assemblies for Arlee Strain Rainbow Trout -GCF_013265735.2 and Atlantic Salmon -GCA_905237065.2).
Interestingly, the PacBio data used for assembly were of similar coverage to the data used for assembling the European Whitefish genome (De-Kayne et al., 2020); however, the Lake Trout genome contig N50 is >3x higher (1.8 Mb vs. 0.53 Mb; Table S3). It is worth noting that this assembly was produced using a different assembler (wtdbg2; Ruan & Li, 2020); however, an assembly with a contig N50 of 211 Kb was also generated from these data using Falcon (De-Kayne et al., 2020). There are at least two reasonable explanations for the pronounced difference in contig N50 between the Lake Trout genome and European Whitefish assemblies produced using Falcon and wtdbg2. First, the European Whitefish genome was assembled using DNA from a wild-caught, outbred individual rather than a double haploid. Second, the European Whitefish genome was not gap filled after scaffolding. Gap filling the Lake Trout genome with PBJelly increased contig N50 by 561,496 bp, which partially explains the difference.
Additionally, our analysis of homeolog sequence identity across the Lake Trout genome indicates that regions exhibiting delayed rediploidization (i.e., LORe regions; Robertson et al., 2017) are primarily associated with metacentric chromosomes and their acrocentric homeologs in Lake Trout. We identified one pair of acrocentric homeologs with elevated sequence identity (Chr23 and Chr34). Similar to Lien et al. (2016), these results suggest that homeologous pairing might not necessarily require one chromosome to be metacentric as suggested by Kodama et al. (2014). Interestingly, the region with elevated homeolog sequence identity on chromosome 4 does not appear to be associated with one of the telomeres. A previous quantitative trait locus mapping study suggested that this chromosome harbours the sex determining gene (SdY) in Lake Trout (Smith et al., 2020).
It is important to note that the assembly presented here does not represent a true haploid assembly even though contigs were assembled using DNA from a double haploid individual. The assembly was error corrected using sequencing reads from a diploid female, the linkage map was generated using families from multiple hatchery strains (Smith et al., 2020), and Hi-C data were generated from a diploid individual from a separate strain. Therefore, the chromosome sequences presented here represent consensus sequences for female Lake Trout (from the Seneca L. strain) rather than haplotypes existing within the DH individual we sequenced. Additionally, PacBio assemblies are known to have an elevated prevalence of short indel errors relative to short read assemblies. These errors can interfere with annotation and necessitate error correction using short-read data (Watson & Warr, 2019). For this assembly, we excluded multimapping reads with low mapping qualities in order to avoid homogenizing variation between Nonetheless, the genome presented here represents a significant improvement compared to existing genomic resources for the genus Salvelinus (Figure 3, Tables S2 and S3). Improvements could probably be made to the assembly using supplementary scaffolding resources such as a higher density linkage map or optical map (Pan et al., 2020). The annotation could also be improved by generating additional RNA-seq data. Nevertheless, the number of annotated genes and pseudogenes . These annotations were produced using RNA-seq evidence from a greater diversity of tissue types, which probably explains this discrepancy. The Lake Trout annotation, as well as annotations for other salmonids, could also be further improved by directly sequencing full length transcripts using long-read sequencing technologies (Workman et al., 2019). We predict that the completeness of the Lake Trout genome annotation will be improved as more gene expression data from a greater diversity of tissue types becomes available for the species (Salzberg, 2019). Nonetheless, the current genome annotation will undoubtably aid in the interpretation of future findings by allowing researchers to link signals of selection and loci associated with phenotypes with putatively causal genes and biological processes. Publicly available gene expression and functional annotation resources, like those being developed by the functional annotation of all salmonid genomes (FAASG) initiative, will also aid in this effort .
The availability of a second high-quality genome assembly for a Salvelinus species will probably benefit comparative genomic research aimed at understanding the evolutionary consequences of genome duplication. Salmonids have long been appreciated as a model system for understanding evolution following whole genome duplication (Ohno, 1970) and a variety of recent studies have utilized the wealth of genomic resources for salmonids to shed light on the evolutionary processes at play following autotetraploid genome duplication events (see Gillard et al., 2021;Gundappa et al., 2021). Additionally, multiple recent studies have highlighted the importance of structural genetic variation for promoting adaptive diversification within salmonid species (Bertolotti et al., 2020;Pearse et al., 2019), and chromosome-anchored genome assemblies are typically needed for detecting and genotyping structural variants (Mérot et al., 2020).
Genomic methods have dramatically increased the precision of population genetic analyses and have enabled researchers to address qualitatively unique questions that require some knowledge of genome structure and function (Waples et al., 2020). The genome assembly presented here will enable researchers to identify loci and candidate genes associated with phenotypic differentiation and reproductive isolation among Lake Trout ecotypes. Additionally, this resource will allow for the identification of loci associated with variation in fitness between Lake Trout hatchery strains in contemporary Great Lakes environments (Larson et al., 2021;Scribner et al., 2018) and loci that are adaptively diverged between hatchery strains. This information could help fisheries managers to maximize adaptive genetic diversity in reemerging wild populations and prioritize hatchery populations for continued propagation. Overall, the availability of a high-quality reference genome for Lake Trout will probably have important implications for ongoing conservation projects in the Great Lakes region and elsewhere.

ACK N OWLED G EM ENTS
We would like to thank hatchery personnel at Iron River National Fish

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest. carried out long-read data quality control and long-read contig assembly. P.M.N. carried out Hi-C scaffolding. C.M.P. generated RNA sequencing data and assisted with drafting the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The Lake Trout whole genome assembly project has been deposited at DDBJ/ENA/GenBank under the accession JAEAGN000000000. The version described in this manuscript is version JAEAGN010000000.
The GenBank assembly accession is GCA_016432855.