Sequencing improves our ability to study threatened migratory species: Genetic population assignment in California's Central Valley Chinook salmon

Abstract Effective conservation and management of migratory species requires accurate identification of unique populations, even as they mix along their migratory corridors. While telemetry has historically been used to study migratory animal movement and habitat use patterns, genomic tools are emerging as a superior alternative in many ways, allowing large‐scale application at reduced costs. Here, we demonstrate the usefulness of genomic resources for identifying single‐nucleotide polymorphisms (SNPs) that allow fast and accurate identification of the imperiled Chinook salmon in the Great Central Valley of California. We show that 80 well‐chosen loci, drawn from a pool of over 11,500 SNPs developed from restriction site‐associated DNA sequencing, can accurately identify Chinook salmon runs and select populations within run. No other SNP panel for Central Valley Chinook salmon has been able to achieve the high accuracy of assignment we show here. This panel will greatly improve our ability to study and manage this ecologically, economically, and socially important species and demonstrates the great utility of using genomics to study migratory species.

Accurately identifying distinct populations, the individuals that belong to those populations, and their habitat use patterns during migration requires tools that match the scale and precision desired for a given question. Telemetry has historically been used to study the movement of migratory animals (e.g., Bonfil et al., 2005;Croxall, Silk, Phillips, Afanasyev, & Briggs, 2005), but is impractical for large-scale applications, often not feasible for small individuals, and can be overly invasive for species of conservation concern (Adams, Rondorf, Evans, & Kelly, 1998;Hebblewhite & Haydon, 2010;Jepsen, Koed, Thorstad, & Baras, 2002). Genetic assignment, however, can provide a more reliable method for tracking individuals and movement across lifehistory stages and can be accomplished at large scales. New advances in genome sequencing provide an opportunity to scan thousands of markers to identify a select subset needed for accurate population assignment (Amish et al., 2012;Hess et al., 2014;Lao, Duijn, Kersbergen, de Knijff, & Kayser, 2006). This approach expands our ability to track the movements, population structure, habitat use, and impacts on a large numbers of individuals from migratory populations, while minimizing handling and sampling stress (Davey et al., 2011).
Chinook salmon (Oncorhynchus tshawytscha, Walbaum 1792, Figure 1) are an important migratory keystone species and provide great economic and social value via recreational, commercial, and heritage fisheries (Cone, 1995;Lichatowich, 1999;Williams, 2006). They also typify the necessity for identifying populations and the individuals that belong to them to facilitate protection and management. Chinook salmon are anadromous-eggs are laid and hatch in freshwater where juveniles rear before migrating to the ocean to grow and mature, after which they typically migrate back to their natal freshwater streams to spawn and die. Spatial and temporal variation has led to the evolution of distinct "runs" that each take advantage of unique environmental conditions. Runs are named for the timing of their spawning migration (e.g., Fall run, Spring run), and a single river system can support multiple unique Chinook runs. The runs mix as adults and juveniles, sharing common migratory pathways during both spawning and seaward migrations, as well as sharing juvenile rearing habitat. There are no obvious morphological characteristics that identify an individual's run (del Rosario et al., 2013;Williams, 2006), but management goals are mandated for each run separately (Good, Waples, & Adams, 2005).
The Great Central Valley of California includes the southernmost edge of the Chinook salmon species range and supports the most diverse assemblage of life-history types anywhere, including four runs: Spring, Fall, Late fall, and the endemic Winter run (Fisher, 1994;Yoshiyama, Fisher, & Moyle, 1998). The Central Valley was once one of the most productive U.S. Pacific salmon systems, yet now all four Chinook salmon runs have declined to a fraction of their historical abundance (Yoshiyama et al., 1998), with Spring and Winter runs being listed as threatened and endangered, respectively, under the federal Endangered Species Act and Fall run and Late fall as species of concern (NOAA- NMFS 1999NMFS , 2005. Each run is managed as a separate evolutionary significant unit (ESU), warranting species level protection, with the exception of Fall and Late-fall run, which are listed as a single ESU. Differentiating between runs is critical for limiting negative impacts to the ESUs, identifying and quantifying mortality, monitoring populations, targeting restoration efforts, and managing the fishery. Yet, despite efforts to target management, monitoring, and conservation to run, there has been somewhat limited success using genetic or morphometric methods in identifying Central Valley Chinook back to their source runs and populations (Harvey, Jacobson, & Banks, 2014).
The primary method currently used to determine the run of Chinook salmon as they migrate through the Central Valley is phenotypic-it is based on size class projections over time and termed the "length-at-date" criteria (del Rosario et al., 2013;Harvey & Stroble, 2013; -for detailed explanation of the criteria see Harvey et al., 2014).
Unfortunately, this method of run assignment has been shown to be very inaccurate, with over 50% of the individuals classified as Winter run belonging to Spring, Fall, or Late-fall runs (Harvey et al., 2014).
Genetic resources to differentiate Central Valley Chinook salmon populations thus far have been limited, with Winter run being the only run that can reliably be distinguished (reviewed in Lindley et al., 2004).
We developed a high-resolution SNP assay panel to identify and study Chinook salmon in the Central Valley. Using restriction site-associated DNA sequencing (RADseq, Miller, Dunham, Amores, Cresko, & Johnson, 2007;Baird et al., 2008), we scanned the genomes of Chinook salmon sampled from across the Central Valley and identified thousands of new SNPs distributed across the genome. We then developed a genetic linkage map and a small subset of ancestryinformative markers (AIMs; Rosenberg, Li, Ward, & Pritchard, 2003) for differentiating Chinook salmon run timing via Fluidigm SNP Type assays, to allow for high-throughput and rapid run identification and identification of unique populations within Spring run. Genetic resources that rapidly and reliably distinguish runs and populations of Central Valley Chinook will be immensely valuable for genetic management and monitoring, as well as studies of life-history trait evolution, genomewide association, habitat use, and other questions of ecological or evolutionary interest in this threatened migratory species.

| Sample collection
We obtained adult Chinook salmon fin clip tissues from the California Dept. of Fish and Wildlife Anadromous Resources Tissue Archive.
Individuals were collected during spawning migrations, from all major tributaries in the Central Valley with known spawning populations for all four Chinook salmon runs across multiple years (Table 1, Figure 2).

| Molecular biology
We constructed RAD libraries using individuals from all the major tributaries in the Central Valley that have consistent Chinook spawning populations and the main stem of the Sacramento River (Table 1; Figure 2). Feather River Hatchery samples were sequenced but not included in our AIM panel selection because past hatchery practices have led to introgression between the Fall and Spring run hatchery populations (California HSRG, 2012;Garza et al., 2008). We extracted genomic DNA from all samples using the Qiagen DNeasy 96 Blood T A B L E 1 Sample numbers from each population of interest and location abbreviations. The AIM panel column shows the number of individuals used in the blind assignment test of the AIM assay SNP panel (the "holdout" set). The number of samples used in the blind assignment test are those that genotyped at >70% of the Fluidigm AIM panel loci and normalized all samples to 25 ng/μl. RAD libraries were constructed using the SbfI restriction enzyme following the protocol described by Lew et al. (2015). We ligated each sample with a unique were resequenced due to low read coverage (fewer than 1,000,000 sequences) in the first sequencing run.

| SNP discovery and genotyping
We performed SNP discovery following the bioinformatics methods detailed in Miller et al. (2012). One hundred base pair reads were trimmed from the 3′ end to 92 bp, and reads were filtered to eliminate those with a >20% probability of sequencing error, or those that contained one or more ambiguous base calls. The six base sample-specific barcode and the partial SbfI site sequence were also removed from the sequence.
We conducted three rounds of SNP discovery, using a different ascertainment panel in each. To discover SNPs that are polymorphic Locations of dots do not represent exact location of samples collected, but rather the existence of a particular run in a given river read, and an alignment score threshold of 120. Using the output of this alignment, we then filtered based on the following criteria: Only loci with a single SNP were retained, sequences with less than two external alignments were ignored, only alignments with two alleles were retained, the minimum samples per allele was one, the minimum sample count per locus was three, and the maximum Novoalign alignment score between any sequences in a locus was 30.
We combined the identified loci across all three discoveries, removing duplicate loci, and then genotyped all 222 sequenced individuals at all the identified loci. We used the program Bowtie (Langmead, Trapnell, Pop, & Salzberg, 2009) to index this file and align the qualityfiltered reads from all individuals, allowing only one mismatch in the alignment. We counted the number of reads with a perfect match to each allele for each individual. If the sum of the total reads for a locus was less than six for an individual, we scored the genotype as unknown. To convert read counts to genotypes, we took the log ratio of the number of reads of allele 1 to number of reads of allele 2 at each locus (Lew et al., 2015). We then excluded those that only had counts for one allele (obvious homozygotes) and calculated the mean and standard deviation of the remaining log ratio distribution. We called everything within 1.5 standard deviations from zero a heterozygote.
Homozygotes were called when 0.85 or greater proportion of the reads were for a single allele. Everything that fell between the homozygote and heterozygote ranges was recorded as unknown.

| Quality filtering
We filtered the dataset to remove low-coverage SNPs and potential paralogs. We removed loci that were not genotyped in at least ten individuals from each run and excluded individuals that were genotyped at an unusually low number of loci (less than two standard deviations below the mean number of genotyped loci per individual).
We then filtered out loci that were typed at <70% of the remaining individuals. We removed loci that had a minor allele frequency of <0.01. We also removed loci with observed heterozygosity >0.55, given that the expectation for biallelic SNPs is a maximum observed heterozygosity of 0.5. We removed all the SNPs located in the last two base pair positions of a sequence, as there were considerably more SNPs in these positions than the other base pair positions. This is likely driven by the fact that Illumina sequencing is more error prone toward the sequence terminal positions (Minoche, Dohm, & Himmelbauer, 2011).

| Linkage mapping
We created a linkage map to evaluate the location of discovered Linkage groups were formed using a minimum LOD score of 7.0.
Within linkage groups, loci were ordered using regression mapping with a recombination frequency threshold of 0.4, a LOD threshold of 1.0 and a jump threshold of 5.0. A ripple was performed after each locus addition, and distance was calculated using Kosambi's mapping function. We independently created linkage groups for each family and then combined them into a consensus map. Linkage groups were assigned to Chinook salmon chromosomes (Phillips, Park, & Naish, 2013) by mapping loci to previously mapped SNP loci (Brieuc, Waters, Seeb, & Naish, 2014;McKinney et al., 2016) with the NCBI MegaBLAST algorithm (Zhang, Schwartz, Wagner, & Miller, 2000) using default parameters. A positive match occurred if sequence similarity was >98% (i.e., ≤1 mismatch, corresponding to SNP location).

| AIM selection and assay development
To choose the best loci for the AIM panel, we calculated the pairwise allele frequency difference for each locus among all the different runs, and locations within Spring run (DER, MIL, BUT). Prior to calculating this, we removed ten randomly chosen individuals from each group and set them aside to be used in the initial testing of assignment accuracy. We determined the AIM panel by first choosing the top twenty loci that had the highest allele frequency differences for each pairwise comparison. We also ranked loci based on F ST and informativeness (I n ) (Rosenberg et al., 2003), but found they did similarly (data not shown), so proceeded with the allele frequency difference ranking, as this ranking resulted in slightly higher assignment accuracies. We tested for linkage disequilibria in all pairwise comparisons within each run and locations within Spring run using the program GENEPOP (Raymond & Rousset, 1995;Rousset, 2008). We removed loci from the AIM panel that were out of linkage equilibrium in more than one comparison. We also evaluated marker location on the Chinook linkage map to ensure no loci that we chose for the panel were tightly linked and to ensure coverage across linkage groups. We obtained sequence reads long enough to allow assay design by aligning the forward and reverse reads of the paired-end library. Using custom Perl scripts (available upon request), we required 30 bases of identical overlap between the end of the forward read and the beginning of the reverse read to ensure accurate alignment. If a longer sequence for a particular locus could not be constructed, or if the alignment of the forward and reverse reads was ambiguous (with a forward or reverse read aligning to more than one locus), the locus was excluded.
We also removed any loci where the SNP was too close to the sequence end for primer design, as well as any loci that did not meet Fluidigm's design standards. Anytime a locus was removed from the AIM panel for any of the above reason, we replaced it with the next highest ranked locus for the comparison for which it was informative.
We developed candidate loci into Fluidigm SNP Type assays using Fluidigm's D3 Assay Design system. Prior to ordering the assays, we tested assignment accuracy of the final AIM panel by assigning the subset of individuals we had removed from the RADseq dataset (prior to calculating allele frequencies) back to the remaining RADseq dataset (data not shown).

| Validation of AIM panel
We validated the Fluidigm SNP Type assays by following the manufacturer's protocols for genotyping. We first genotyped 32 individuals sampled from across the different sampling locations that had been included in sequencing effort, to ensure the genotypes determined by the assays were congruent with our RADseq genotypes. We removed any loci that had more than two mismatches in genotypes, resulting in a panel with <5% mismatches in allele calls between the two methods. We then employed a "training-holdout" procedure to test AIM panel accuracy by evaluating the performance of the AIM panel when assigning additional individuals (the "holdout" set), which were not used in the "training set" to rank SNPs, thus avoiding highgrading bias (Anderson, 2010). We genotyped at least 40 additional individuals per run and per Spring run location (Table 1) for the "holdout" set, using the SNP Type assays. This dataset also included Spring and Fall run individuals from the Feather River Hatchery, to assess how well our AIM panel could distinguish the Spring run from the Fall run fish in this hatchery, despite the past hatchery practices that have lead to introgression. We removed any individuals from the test that were not genotyped in at least 70% of the loci. We assigned individuals back to run using the full, filtered RADseq dataset as the baseline and the program ONCOR (Kalinowski, Manlove, & Taper, 2008). We removed 23 individuals from the baseline whose run was clearly misidentified at the time of sampling, based on structure analysis using the full SNP dataset (data not shown). We tested assignment accuracies for Spring run samples by assigning them to the individual tributaries, to Butte and Mill/Deer Creek reporting groups, as well as to a larger Spring run reporting group. An 0.80-0.90 assignment probability cutoff is commonly used to assign individuals to populations (e.g., Clemento et al., 2014;Daly et al., 2012), yet the higher the probability of assignment required, the fewer samples can be assigned.
To investigate the effects of using different assignment probability cutoffs, we used cutoffs of both 0.08 and 0.06 probabilities of assignment. Assignment accuracy was calculated as the proportion of correct assignments to the total number of assignments made per group, excluding those that could not be assigned due to low assignment probabilities. To characterize the effect of low assignment probabilities on the performance of the AIM panel, we also calculated the proportion of the samples that could not be assigned due to low assignment probabilities, allowing future users to weigh the trade-offs between using different assignment probability cutoffs. Additionally, we evaluated the utility of the AIM panel for conducting mixed stock analyses-a scenario that is relevant for ocean fisheries to determine what proportion of a fishery catch belongs to each reporting group. To do this, we implemented the realistic fishery simulation in ONCOR. We ran 1,000 mixture simulations of 200 individuals per mixture. Table 2 shows the proportions of each group used in the mixture simulation.

| SNP discovery and genotyping
The sequencing effort resulted in an average of 232,000,898 se-

| Linkage mapping
A linkage map containing 34 linkage groups was constructed using a total of 4,600 SNP loci. We were able to align 2,560 of our loci to previously mapped loci. The total length of the linkage map was  Table S1).

| AIM selection
Based on the allele frequency difference rankings and the linkage map, we developed a suite of 114 Fluidigm SNP Type assays for the AIM panel. We excluded 34 of these assays from the final AIM panel because they either failed to properly amplify as assays or the genotypes did not meet matching criteria between the sequencing and assay T A B L E 3 Assignment accuracies with the AIM panel, using 0.80 and 0.60 assignment probability cutoffs. The first column is the putative run (as identified at time of sampling), with number of samples assigned in parentheses. The columns are the proportion of those samples that were assigned to the different groups. The last column displays the percent of the total number genotyped that could be assigned at the given threshold. Shaded boxes highlight the putative correct assignments. S_FRH is not shaded because past hatchery practices have lead to introgression between Fall and Spring run in the hatchery population, making the "correct" assignment unclear. See Table 1  Winter ( efforts. All mismatches were heterozygote versus homozygote mismatches, so most likely allelic dropout in one of the genotypes (with the RADseq data producing the homozygote and the SNP Type assay producing the heterozygote most often). Therefore, our final AIM assay panel contained 80 SNP assays. These markers covered at least 25 of the 34 chromosomes, with several loci spread across each linkage group (see Table S2).

| Validation of AIM panel
The validation of the AIM panel showed high accuracy of assigning individuals to their run types (Table 3). When using the 0.80 probability of assignment requirement, assignment accuracies were >90% for Fall run and averaged 76% for Late fall. Spring run populations assigned to a Spring run reporting group with an average of 96% assignment accuracy. Assignment accuracy of Winter run was 100%. The distinction between Mill and Deer Creek Spring run was not strong enough to get useful assignment accuracies to specific tributaries (33%-54%).
However, collapsing Mill and Deer down to a single Mill/Deer reporting group allowed assignment accuracy of 77%. The majority of both the Fall and Spring run from the Feather River Hatchery assigned to Fall run (≥74% and 70%, respectively). We also tested whether the Spring run from the Feather River Hatchery could be assigned to its own reporting group. This, however, achieved very low assignment accuracy (data not shown).
We were able to assign a higher proportion of samples when we used an assignment probability threshold of 0.60 compared to 0.80 (94% and 84%, respectively, when using Run as the reporting group, Table 3). Assignment accuracies were generally higher when using the more stringent 0.80 probability of assignment requirement, with the difference in accuracies between the 0.80 and 0.60 thresholds ranging from −0.07 to 0.06.
The realistic fishery simulation showed the AIM panel accurately estimates mixture proportions, with the estimated proportion being within 0.0089 on average of the simulated proportion (range: 0.0003-0.0212, Table 2).

| DISCUSSION
Here, we show how combined use of next-generation sequencing data and targeted marker development can be used to distinguish individuals as they mix throughout their migratory pathway. We discov-  (Lindley et al., 2006;Williams, 2006). Now, most of these populations have been extirpated and there are only the three watersheds supporting continuous, yet small, populations of Spring run (Lindley et al., 2007). Protecting, recovering, and possibly expanding these remaining populations requires proper management of distinct within-run genetic resources to preserve the portfolio of Central Valley Spring run (Carlson & Satterthwaite, 2011). Our SNP panel also allows users to fine-tune the parameters they set for assignment (i.e., different required assignment probability cutoffs for assignment).
Given the small observed difference in assignment accuracies when using 0.80 and 0.60 assignment probability cutoffs, it may be desirable in some studies to use the lower threshold when the trade-off between assignment accuracy and numbers assigned favors assigning as many individuals as possible. Additionally, practitioners will want to consider the sources of error and misallocation in any application (Seeb et al., 2007).  (Hecht, Matala, Hess, & Narum, 2015). Climate change is projected to change thermal regimes of Central Valley rivers, reduce available spawning and rearing habitat, and vary in the level of effects on the different Chinook salmon populations (Lindley et al., 2007;Thompson et al., 2012;Yates et al., 2008). Therefore, this new SNP panel will allow researchers to finetune our ecological understanding of the adaptive differences among the unique populations of Chinook salmon in the Central Valley and target restoration activities to aid those populations that are most imperiled by the effects of climate change.

| CONCLUSIONS
Conservation and management of threatened migratory species has been greatly limited by an inability to distinguish individuals from distinct populations as they mix along the migratory corridor.
Management plans for migratory fish species that account for genetic population structure can decrease the probability of overfishing vulnerable populations and result in increased fishery catches due to the ability to manage for the optimal demographic structure (i.e., number of breeders) per population (Spies & Punt, 2015). Managing populations based on genetic distinction can also increase species resilience as well as stabilize ecosystem productivity (Hilborn et al., 2003;Schindler et al., 2010). Therefore, it is important that we have genetic tools that allow us to differentiate individuals based on the population structure present in the system.
This study demonstrates how useful genomic data can be for the management and conservation of organisms, and particularly for migratory species where distinct genetic lineages mix along migratory pathways.

ACKNOWLEDGMENTS
We would like to thank Lea Koerber and the California Dept. of Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 Instrumentation Grants S10RR029668 and S10RR027303.