Construction and Annotation of a High Density SNP Linkage Map of the Atlantic Salmon (Salmo salar) Genome

High density linkage maps are useful tools for fine-scale mapping of quantitative trait loci, and characterization of the recombination landscape of a species’ genome. Genomic resources for Atlantic salmon (Salmo salar) include a well-assembled reference genome, and high density single nucleotide polymorphism (SNP) arrays. Our aim was to create a high density linkage map, and to align it with the reference genome assembly. Over 96,000 SNPs were mapped and ordered on the 29 salmon linkage groups using a pedigreed population comprising 622 fish from 60 nuclear families, all genotyped with the ‘ssalar01’ high density SNP array. The number of SNPs per group showed a high positive correlation with physical chromosome length (r = 0.95). While the order of markers on the genetic and physical maps was generally consistent, areas of discrepancy were identified. Approximately 6.5% of the previously unmapped reference genome sequence was assigned to chromosomes using the linkage map. Male recombination rate was lower than females across the vast majority of the genome, but with a notable peak in subtelomeric regions. Finally, using RNA-Seq data to annotate the reference genome, the mapped SNPs were categorized according to their predicted function, including annotation of ∼2500 putative nonsynonymous variants. The highest density SNP linkage map for any salmonid species has been created, annotated, and integrated with the Atlantic salmon reference genome assembly. This map highlights the marked heterochiasmy of salmon, and provides a useful resource for salmonid genetics and genomics research.

Linkage maps are valuable tools for the investigation of the genetic basis of complex traits in farmed animal species. For several decades, linkage maps have enabled the mapping of quantitative trait loci (QTL), and formed the basis of attempts at positional cloning of these QTL in both terrestrial (Goddard and Hayes 2009) and aquatic farmed species (Danzmann and Gharbi 2001). High throughput sequencing technologies have now expedited the discovery of millions of single nucleotide polymorphism (SNP) markers (Liu 2010). These SNPs form the basis of modern, high-resolution genetics studies, and underpin genomic selection for faster genetic improvement in terrestrial livestock, and, laterally, aquaculture breeding programs (Meuwissen et al. 2001;Goddard et al. 2010;Sonesson 2010;Yáñez et al. 2014Yáñez et al. , 2015. Scoring of genome-wide SNPs in large populations is achieved either through genotyping by sequencing (Davey et al. 2011), or by creation and application of SNP arrays (e.g., Houston et al. 2014a;Yáñez et al. 2016). High density linkage maps based on these SNP datasets can aid in high resolution mapping of loci underpinning complex traits in farmed animals (e.g., Shi et al. 2014;Wang et al. 2015), improvements in assembly of reference sequences (Fierst 2015), and knowledge of the recombination landscape of the genome (e.g., Groenen et al. 2009;Tortereau et al. 2012).
Reference genome assemblies are now available for several aquaculture species, including Atlantic salmon (Davidson et al. 2010;Lien et al. 2016). Once anchored and annotated, these genome assemblies provide invaluable physical maps of the genome. Due to a recent whole genome duplication, and the relatively high frequency of long and diverse repeat elements (de Boer et al. 2007;Davidson et al. 2010;Lien et al. 2016), assembly of the Atlantic salmon genome has been challenging, with 22% of the current assembly (NCBI GCA_000233375.4) yet to be assigned to chromosome. Salmonid species exhibit marked heterochiasmy, with males showing very low recombination rates across much of the genome, but with much higher recombination rates thought to occur in telomeric regions (e.g., Sakamoto et al. 2000;Lien et al. 2011;Miller et al. 2011;Brieuc et al. 2014;Gonen et al. 2014). This phenomenon may be related to the pairing and recombination between homeologous regions of the genome, particularly in males (Wright et al. 1983;Allendorf and Thorgaard 1984;Allendorf et al. 2015). Several high density SNP arrays exist for Atlantic salmon (Houston et al. 2014a;Yáñez et al. 2016), and integrated linkage maps based on those arrays would facilitate detailed interrogation of the unusual recombination landscape. Further, while the high density SNP arrays have been applied for genome-wide association study (GWAS) and genomic prediction (Ødegård et al. 2014;Correa et al. 2015;Tsai et al. 2015;Tsai et al. 2016), such studies would be enhanced by annotation of the SNPs according to their genomic position, nearby genes, and their predicted effects.
Therefore, the purposes of this study were: (i) to construct a linkage map of the SNPs contained on the publicly available high density Affymetrix Atlantic salmon SNP array 'ssalar01' ); (ii) to align and compare the linkage map to the latest Atlantic salmon reference genome assembly (GenBank assembly accession GCA_000233375.4); (iii) to assign previously unmapped reference genome contigs and genes to chromosomes; (iv) to investigate and compare patterns of male and female recombination across the genome; and (v) to annotate the SNPs according to their position relative to putative genes, including prediction of variant effects.

Animals
The population used for the linkage analysis was a subset of those described in Gharbi et al. (2015), purchased from Landcatch Natural Selection (LNS, Ormsary, UK). The juvenile fish used in the current n Table 1 The characteristics of the physical and genetic maps of the 29 Atlantic salmon (pairs of) chromosomes (GenBank reference GCA_000233375.4; Davidson et al.

2010)]
, and 'unassigned contigs' are those that were unplaced on the reference assembly but mapped to the chromosome in the linkage map. b The correlation between the genetic distance of SNPs (cM) on the linkage map and the physical distance (bp) according to the reference genome assembly.
study were from the 2007 year group of the LNS broodstock, and were from 60 full sibling families (28 sires and 60 dams) comprising at least six progeny per family. The trial (which focused on resistance to sea lice) was performed by Marine Environmental Research Laboratory (Machrihanish, UK), and under approval of ethics review committee in the University of Stirling (UK). Full details of the trial, and the population used, have been described previously (Houston et al. 2014b;Gharbi et al. 2015;Tsai et al. 2015Tsai et al. , 2016.

SNP array genotyping
Genomic DNA from each sample was extracted (Qiagen, Crawley, UK), and genotyped for the 'ssalar01' Affymetrix Axiom SNP array containing 132,000 validated SNPs. Details of the creation and testing of the SNP array are given in Houston et al. (2014a). Details of the quality control filtering of the genotypes are given in Tsai et al. (2015). Briefly, the Plink software was used to filter the validated SNPs by removing individuals and SNPs with excessive (. 1%) Mendelian errors, and SNPs with minor allele frequency (MAF) , 0.05 in this dataset. In total, 111,908 SNPs were retained for 622 fish (534 offspring, 28 sires, and 60 dams). Details of all the SNP markers are available at dbSNP (Sherry et al. 2001) (NCBI ss# 947429275-947844429.)

Linkage analysis
Lep-Map2 (Rastas et al. 2016) was used to construct the linkage maps. The 'Filtering' function was applied to the initial input dataset, with 'MAFLimit' set at 0.05 (consistent with filtering described above), and 'dataTolerance' set at 0.001 to remove markers exhibiting significant segregation distortion. The 'SeparateChromosomes' function was applied to cluster markers into linkage groups, with the LOD threshold of 36 applied (chosen because this is the level at which 29 groups were formed, consistent with the expected karyotype of European Atlantic salmon). The function 'JoinSingles' was applied to assign additional single SNPs to existing linkage groups. Subsequently, the function 'OrderMarkers' was applied to estimate the marker order within each linkage group. Using parallelized computing, this step was repeated several times to assess consistency of marker order between replicates. Sex-specific linkage maps were generated because of the known difference in recombination rate between male and female Atlantic salmon (Gilbey et al. 2004;Moen et al. 2004;Lien et al. 2011;Gonen et al. 2014).
To compare the genetic and physical maps, the flanking sequence for each SNP locus (35 bp either side) was aligned with the Atlantic salmon reference genome assembly (GenBank assembly GCA_000233375.4) (Davidson et al. 2010), and only complete and exact matches to the reference genome (e-value = 3 · 10 229 ) were retained. In cases where the SNP flanking sequence aligned exactly with . 1 genomic region, the alignment corresponding to the chromosome that was consistent with the linkage mapping of the SNP was retained.

RNA sequencing
Atlantic salmon fry samples from two different families from the Scottish breeding nucleus of Landcatch Natural Selection Ltd were selected for RNA sequencing, corresponding to families 'B' and 'S' in Houston et al. (2010). Full details of the library preparation and sequencing are given in Houston et al. (2014a) (although for the current study, only two of the three families previously sequenced were used for assembling the transcriptome. This was because the third family, 'C', had large variation in sequence coverage between samples). Briefly, a total of 48 individual fry were homogenized in 5 ml TRI Reagent (Sigma, St. Louis, MO) using a Polytron mechanical homogenizer (Kinemetica, Switzerland). The RNA was isolated from 1 ml of the homogenate, using 0.5 vol of RNA precipitation solution (1.2 mol/l sodium chloride; 0.8 mol/l sodium citrate sesquihydrate), and 0.5 vol isopropanol. Following resuspension in nuclease-free water, the RNA was purified using the RNeasy Mini kit (Qiagen, UK). The RNA integrity numbers from the Bioanalyzer Figure 1 Comparison of the number of SNPs in corresponding chromosomes and physical length retrieving from recent reference assembly (GenBank assembly reference GCA_000233375.4, Davidson et al. 2010). The correlation was approximately 0.95.

Figure 2
A comparison between genetic and physical maps of a representative chromosome (Chr 22), reflecting the recombination pattern difference between males and females. Details of genetic distance and physical distance for all mapped loci are given in File S1.
2100 (Agilent, Santa Clara, CA) were all over 9.9. Thereafter, the Illumina Truseq RNA Sample Preparation kit v1 protocol was followed directly, using 4 mg of RNA per sample as starting material. Libraries were checked for quality, and quantified using the Bioanalyzer 2100 (Agilent), before being sequenced in barcoded pools of 12 individual fish on the Illumina Hisequation 2000 instrument (100 base paired-end sequencing, v3 chemistry); all sequence data were deposited in the European Nucleotide Archive under accession number ERP003968.

Transcriptome assembly
The quality of the sequencing output was assessed using FastQC (http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/; version 0.11.2). Quality filtering and removal of residual adaptor sequences was conducted on read pairs using Trimmomatic v.0.32 (Bolger et al. 2014). Specifically, residual Illumina specific adaptors were clipped from the reads, leading and trailing bases with a Phred score less than 15 were removed, and the read trimmed if a sliding window average Phred score over four bases was less than 20. Only paired-end reads where both sequences had a length greater than 36 bases postfiltering were retained. The most recent salmon genome assembly (ICSASG_v2, NCBI assembly GCA_000233375.4) was used as a reference for read mapping. Filtered reads were mapped to the genome using Tophat2 v. 2.0.12 (Kim et al. 2013), which leverages the short read aligner Bowtie2 v.2.2.3 (Langmead and Salzberg 2012), allowing a maximum of two mismatches. Using Cuffdiff v.2.2.1 (Trapnell et al. 2012), the aligned reads were merged into a transcriptome assembly. The transcriptome was annotated against NCBI nonredundant protein and nucleic acid databases using local Blast v.2.3.0+ (Altschul et al. 1997) with a cut-off e-value of 10 25 . The completeness of the salmon transcriptome was evaluated using Blast searches with a cut-off e-value of 10 225 against a set of 248 core eukaryotic genes (Parra et al. 2007).

SNP annotation
For every gene, the most highly expressed transcript variant was selected to identify candidate coding regions using Transdecoder v.2.0.1 (http:// transdecoder.sourceforge.net/). Open reading frames (ORF) were predicted for every transcript, requiring a minimum of 100 amino acids (to reduce the number of potential false positives). All the predicted proteins were aligned against the manually curated UniRef90 database using local Blast v.2.3.0+ (Altschul et al. 1997) with a cut-off e-value of 10 25 , discarding ORFs without positive matches. Finally, the longest ORF was selected as the canonical protein for each transcript. The final set of coding regions was used to build a genome annotation file which was used to predict the functional significance of all the SNPs on the 'ssalar01' SNP array using SnpEff v.4.2 (Cingolani et al. 2012).

Data availability
The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.  (Table 1), and the number of SNPs per chromosome was dependent on chromosome sequence length ( Figure 1). SNP density for the successfully genotyped and mapped markers from the 'ssalar1' array is relatively constant across the genome, with an average of 1 SNP per 23 kb in the assembled chromosomes, and 1 SNP per 0.05 cM (male) and 0.07 cM (female) in the full linkage map. The most recent Atlantic salmon reference genome assembly (GCA_000233375.4) contains 2240 MB of sequence contigs anchored to chromosomes (78% of total assembly), and 647 MB of contigs that are not yet assigned to chromosome (22% of total assembly). Linkage mapping using high density SNP arrays was applied to orientate reference genome contigs and scaffolds, and to identify putative misassemblies in the recently published salmon genome paper (Lien et al. 2016). However, those linkage maps are unpublished. In the current study, a total of 4581 previously unassigned contigs comprising 41.9 MB of sequence was tentatively mapped to the 29 salmon chromosomes (Table 1 and File S2). While additional experiments would be required to confirm the correct position of these genome contigs, this linkage map has enabled an additional 1% of the entire reference genome assembly to be tentatively mapped to chromosomes, corresponding to 6.5% of the previously unassigned genome assembly. These contigs were spread across all 29 chromosome pairs (Table 1, and details given  in File S1). Novel potentially misassembled regions were also identified in the reference sequence via regions of discordance between the linkage and physical maps, an example of which is between 11.5 MB and 11.8 MB on Chromosome 26 (File S3). There were substantial differences in the patterns of recombination between the sexes. The female linkage map covered 7153 cM (ranging from 72.6 to 689.0 cM per chromosome), whereas the male linkage map covered 4769 cM (ranging from 70.2 to 428.8 cM per chromosome) (Table 1). Overall, the female map was 1.5 · longer than the male map, consistent with previous Atlantic salmon SNP linkage maps (Lien et al. 2011;Gonen et al. 2014). The pattern of recombination across the genome was notably different between the sexes, with female recombination rates being higher across much of the genome, except for some subtelomeric regions where male recombination was substantially higher (e.g., Figure 2). This phenomenon has been observed in several previous salmonid linkage maps (Sakamoto et al. 2000;Lien et al. 2011;Miller et al. 2011;Brieuc et al. 2014;Gonen et al. 2014), but the availability of the reference genome enables a more detailed investigation. Therefore, linkage and physical maps were aligned, and a proxy of recombination rate (number of centimorgans per megabase) was estimated at regular intervals on each chromosome, with each interval corresponding to 2% of the total chromosome's physical length. The average recombination rate for each corresponding interval on the 29 chromosomes was calculated and graphed against the distance from the nearest telomere (Figure 3). The results highlight the phenomenon of markedly high male recombination in some subtelomeric regions, on average 10 · higher than regions of the genome nearer the middle of the chromosome (Figure 3).

Transcriptome assembly and annotation
To annotate the mapped SNPs and predict their function according to their position relative to putative genes, an annotated reference transcriptome was created. RNA-seq of 48 individual salmon fry yielded n Table 3 Predicted numbers, location and effect of the mapped SNPs according to their position on the annotated reference genome Intergenic  57,582  Genic  48,842  UTR  8091  59  1867  39  6224  Intron  34,534  Splice region  483  Nonsplice region  34,051  Exon  5856  Synonymous  3352  Nonsynonymous 2465 Gain or loss of start/stop codon 39

Summary of annotated SNPs
n  (Table 2 and File S4). The average length of the transcripts was 4127 bp, with an N50 of 5710, an N90 of 2323, and . 90% of transcripts longer than 500 bp. The assembled transcripts were annotated using BLASTx and BLASTn searches against the NCBI nonredundant protein and nucleic acid databases, respectively. Of the 65,803 total putative genes, 58,416 (88.8%) showed significant similarity to known proteins, while an additional 2732 (4.2%) showed significant similarity to nucleotide entries in the NCBI nonredundant nucleotide database (File S5). The proportion of unannotated genes was higher for the shorter transcript sequences (File S6), but all transcripts were retained (since a relevant minimum size threshold was not apparent). The completeness of the transcriptome was evaluated against a set of 248 core eukaryotic genes described in Parra et al. (2007); 247 of these genes were found in our transcriptome (BLASTn e-value , E10 225 ), 222 of which had at least 90% coverage, and 153 of which were fully covered. A total of 53,950 identified genes was located within chromosomes on the Atlantic salmon genome assembly, while the remaining 11,853 were aligned to unassigned contigs. Of these 11,853 genes, 1647 (13.9%) were located in contigs assigned to chromosomes using the linkage map of the current study (Table 1 and File S7).

SNP annotation
The RNA-seq based transcriptome described above was used to predict ORFs and protein sequences in order to annotate the SNPs present on the 'ssalar01' array (Table 3 and File S8). A total of 106,424 SNPs (95%) matched a single genome location, while 2857 SNPs matched two different genomic positions, related in part to the salmonid specific genomic duplication. An additional 880 SNPs mapped to three or more genome locations, indicative of repetitive elements or protein domains. It should be noted that filtering of SNPs during the design process for the array would have removed the majority of SNPs mapping to two or more locations (Houston et al. 2014a). The tentative annotation of all SNPs is given (File S6), but only those mapping to unique genomic regions are described below. Of these 106,424 unique SNPs, 48,842 (45.9%) were located in putative genes, with the remainder mapping to intergenic regions. Of the genic SNPs, the majority were in putative intronic regions (34,534; 70.7%), although 483 of these were associated with splicing regions, and therefore have a higher likelihood of being functionally relevant. The remaining genic SNPs were mapped to putative untranslated regions (UTRs; 8091), with a larger amount of SNPs in the 39-UTR as expected (6224 vs. 1867 in the 59-UTR); and to putative exons (5856). A total of 2465 putative nonsynonymous SNPs was identified, in addition to 39 SNPs predicted to cause gain/loss of start/stop codons, which have a high likelihood of functional consequences (File S8). As an example, a premature stop codon was found in phospholipase D, an enzyme which produces the signal molecule phosphatidic acid, which is also a precursor for the biosynthesis of many other lipids (McDermott et al. 2004). The distribution of the SNP functional categories across the 29 chromosome pairs is given in Table 4. It is important to note that these predicted SNP effects will contain a proportion of false positives due to inevitable errors in the predicted structure of the genes. Nonetheless, their annotation combined with their linkage and physical mapping provides a valuable resource for users of the high density 'ssalar01' array in particular, and for salmonid genomics researchers in general.

Conclusion
A linkage map comprising . 96,000 SNPs from the 'ssalar01' array was created, annotated, and integrated with the reference genome assembly. This represents the highest density SNP linkage map for any salmonid species. Alignment of the linkage and physical maps revealed good agreement between genetic map, and the mapping allowed a further circa 1% of the salmon reference genome assembly to be tentatively assigned to chromosomes. Marked heterochiasmy was observed, with male recombination rate substantially lower than females across much of the genome, but with a notably high level in some subtelomeric regions. Finally, the mapped SNPs were annotated and categorized according to their predicted function. The map will be another useful resource for salmonid genomics research.