From promise to practice: pairing non-invasive sampling with genomics in conservation

Conservation genomics has become an increasingly popular term, yet it remains unclear whether the non-invasive sampling that is essential for many conservation-related studies is compatible with the minimum requirements for harnessing next-generation sequencing technologies. Here, we evaluated the feasibility of using genotyping-by-sequencing of non-invasively collected hair samples to simultaneously identify and genotype single nucleotide polymorphisms (SNPs) in a climate-sensitive mammal, the American pika (Ochotona princeps). We identified and genotyped 3,803 high-confidence SNPs across eight sites distributed along two elevational transects using starting DNA amounts as low as 1 ng. Fifty-five outlier loci were detected as candidate gene regions under divergent selection, constituting potential targets for future validation. Genome-wide estimates of gene diversity significantly and positively correlated with elevation across both transects, with all low elevation sites exhibiting significant heterozygote deficit likely due to inbreeding. More broadly, our results highlight a range of issues that must be considered when pairing genomic data collection with non-invasive sampling, particularly related to field sampling protocols for minimizing exogenous DNA, data collection strategies and quality control steps for enhancing target organism yield, and analytical approaches for maximizing cost-effectiveness and information content of recovered genomic data.


INTRODUCTION
There has been much discussion on the transition of conservation genetics to conservation genomics (Helyar et al., 2011;McMahon, Teeling & Hoglund, 2014). Genomic analysis provides the advantage of assessing natural selection and adaptive genetic variation (Schoville et al., 2012), accurately estimating levels of genome wide diversity (Vali et al., 2008), and providing novel information for delineating conservation units (Funk et al., 2012) and informing management strategies (Hoffmann et al., 2015). Yet, it remains unclear the degree to which the conservation community as a whole has embraced genomics as a useful tool, suggesting significant gaps in methodology and analysis that must be overcome before the technology is ready for real-world applications (Shafer et al., 2015). One methodological aspect that has yet to be formally considered is the sample source of DNA; many population genetic studies of elusive or endangered species must rely on non-invasively collected samples. There is an expansive literature on the use of DNA from hair, feces, feathers and other non-invasively sampled materials for anonymous, dominant markers, which precluded the identification of genes responsible for the observed adaptive divergence. Single nucleotide polymorphisms (SNPs), with their broad genomic coverage and better understood mutation models, would overcome many of these limitations if they can be effectively genotyped within the constraints imposed by this system and others involving elusive and endangered species.
In the current study, we used nextRAD (Nextera-tagmented, reductively-amplified DNA) genotyping to collect SNP data from American pikas sampled along parallel elevational gradients to: (1) evaluate the feasibility of using DNA from non-invasively collected hair samples to simultaneously identify and genotype SNPs in an elusive species; and (2) provide preliminary insights into patterns of neutral and adaptive population divergence within this system.

Sample collection
This study was conducted in the North Cascades National Park, Washington, USA (Fig. 1). Sites within this national park present the opportunity to sample American pika along steep elevational transects where climates change rapidly over short linear distances, while controlling for other environmental and historical factors. Additionally, while pika are currently abundant in the park, this area has been disproportionally affected by climate change (Karl et al., 2009). Pika populations were sampled along two elevational transects (Pyramid Peak (PP) and Thornton Lakes (TL)) between July and August 2013. Sites within transects ranged from 450 m to 1,700 m, representing an approximate 6 • C gradient in mean annual temperature (Briggs et al., 1997) over less than 6.5 km distance (Fig. 1).
Non-invasive snares were used to obtain hair samples from 12 individuals at four sites along each of the two transects (n = 96) following Henry & Russello (2011). To minimize resampling the same animal, snares were set a minimum of 15 m apart and only one sample from each snare was used. Subsequent genetic data were used to test the assumption that each sample possessed a unique genotype (see below). All samples were collected under United States Department of Interior National Park Service permit #NOCA-2014-SCI-0022 and in accordance with animal care protocol (A11-0371) as approved by the University of British Columbia's Animal Care & Biosafety Committee.

DNA isolation, genomic data collection and SNP discovery
Total genomic DNA was extracted using the DNA IQ Tissue and Hair Extraction Kit (Promega, Madison, WI, USA) following the manufacturer's protocol. Each sample contained 60 hair follicles with the majority of the hair shaft removed under a dissecting microscope to reduce protein and other contamination. All DNA extractions were conducted in a separate laboratory free of concentrated PCR products. Negative controls were included in each extraction to monitor contamination. DNA quantifications were conducted using real-time PCR fluorescence measurements of double stranded DNA (Blotta et al., 2005) and the Quant-it kit (Life Technologies, Foster City, CA). Genomic DNA was converted into nextRAD genotyping-by-sequencing libraries (SNPsaurus, LLC). The nextRAD method uses selective PCR primers to amplify genomic loci consistently between samples. Genomic DNA (10 ng or less depending upon extraction yield) was first fragmented with Nextera reagent (Illumina, Inc), which also ligates short adapter sequences to the ends of the fragments. Fragmented DNA was then amplified, with one of the primers matching the adapter and extending 9 arbitrary nucleotides into the genomic DNA with the selective sequence. Thus, only fragments starting with a sequence that can be hybridized by the selective sequence of the primer will be efficiently amplified. The resulting fragments are fixed at the selective end, and have random lengths depending on the initial Nextera fragmentation. Because of this, amplified DNA from a particular locus is present at many different sizes and careful size selection of the library is not needed. For this project, an arbitrary 9-mer was chosen from those previously validated in smaller genomes, which did not appear to target repeat-masked regions in the publically available American pika genomic scaffolds (Ensembl, release 74, Ochotona princeps.74.dna sm.toplevel.fa) and that would approximate the results of standard RAD sequencing projects using SbfI (Baird et al., 2008).
Since these samples were collected non-invasively, it was important to assess the proportion of sequence reads in each sample that originated from the target organism relative to other environmental sources prior to conducting genotyping analysis. This was done using a custom script (SNPsaurus, LLC) that randomly sampled 1,000 high-quality reads from each sample and aligned those to the publically available American pika genomic scaffolds as well as subjected them to a  (Altschul et al., 1997) search of all sequences in the NCBI non-redundant database. Only samples that had greater than 50% sequencing reads that mapped to Ochotona princeps were retained for genotyping analysis.
The genotyping analysis used custom scripts (SNPsaurus, LLC) that created a reference from abundant reads present between 500 and 2,000 times across the combined set of samples, mapping all of the reads to the reference allowing two mismatches. The identified variants were then filtered by removing loci that had more than the expected maximum of two alleles and those that were present in less than 10% of all samples.
Following assembly, mapping and variant detection, the data were further filtered to maximize data quality. We retained only those loci that were genotyped in ≥50% of individuals from each transect, had a minor allele frequency ≥0.05, and a minimum coverage of 6X for homozygotes (affording 95% confidence in the genotype) while heterozygotes were required to have a minimum of 2X coverage per allele for each individual. These values were chosen to minimize null alleles and sequencing errors from biasing homozygote and heterozygote genotype calls, respectively. We then removed loci that displayed significant deviation from Hardy-Weinberg equilibrium (HWE) in more than two sites per transect as assessed using the method of Guo & Thompson (1992) as implemented in G 4.3 (Raymond & Rousset, 1995;Rousset, 2008).
To ensure that only non-redundant samples were included in subsequent analyses, we conducted genotype matching across a random subset of 100 loci. We conducted the match analysis and calculated the multi-locus probability of identity (Waits, Luikart & Taberlet, 2001) for the 100 randomly chosen loci using GenAlEx (Peakall & Smouse, 2006). Only samples with unique genotypes were retained.

Outlier locus detection and annotation
Polymorphic loci were screened for statistical outliers using the Bayesian simulation method of Beaumont & Balding (2004) as implemented in B 2.1 (Foll & Gaggiotti, 2008). This analysis was run independently for each transect, with all samples coded by site (PP1-PP4, TL1-TL4). We used a prior odds value of 10, with 100,000 iterations and a burn-in of 50,000 iterations. We identified loci that were significant outliers at a q-value of 0.20. A q-value is a false discovery rate (FDR) analogue of the p-value, with the former only defined in the context of multiple testing, whereas the latter is defined on a single test. Consequently, a 20% threshold for q-values is much more stringent than a 20% threshold for p-values in classical statistics. To test for non-random association of genotypes, linkage disequilibrium was assessed between all pairs of outlier loci in each population using the exact test of Guo & Thompson (1992) and 10,000 dememorization steps, 100 batches, and 10,000 iterations per batch as implemented in G 4.3 (Raymond & Rousset, 1995;Rousset, 2008). In addition, each haplotype from all nextRAD-tags that contained outlier loci were subject to a  (Altschul et al., 1997) search of all sequences in the NCBI non-redundant database (word size = 11; mismatch scores = 2, −3; maximum e-value = 10-15). To reduce annotations to repetitive sequences in the database, we required either a unique  hit or a top hit with an e-value that was at least an order of magnitude lower than the next closest hit.

Population genetic analyses
We segregated loci into two datasets including: (1) all loci identified as an outlier ("outlier dataset"); and (2) all loci not identified as an outlier ("neutral dataset"). The neutral dataset was used to conduct standard population genetic analyses for quantifying the extent and distribution of variation within and among sites. Within sites, proportion of polymorphic loci, observed (H o ) and expected (H e ) heterozygosity, and gene diversity (N g ) were calculated using A 3.5 (Excoffier & Lischer, 2010). Global tests for heterozygote deficit were conducted using Fisher's method and 10,000 dememorization steps, 100 batches, and 10,000 iterations per batch as implemented in G 4.3 (Raymond & Rousset, 1995;Rousset, 2008). The inbreeding coefficient, F is , was calculated for each site as implemented in G (Belkhir et al., 2004), with significance assessed using 1,000 permutations. To evaluate whether site-level genetic diversity was correlated with elevation and sample size, we conducted linear regression analyses implemented in R v. 3.1 (R Development Core Team, 2011).
Levels of genetic differentiation among groups were estimated by pairwise comparisons of θ (Weir & Cockerham, 1984), as calculated in G (Belkhir et al., 2004), and evaluated using 1,000 permutations. The hierarchical organization of genetic variation within and among transects was calculated using an analysis of molecular variance () as implemented in A 3.5 (Excoffier & Lischer, 2010), with significance assessed using 1,000 permutations. In addition, the model-based clustering method implemented in S 2.3.4 (Pritchard, Stephens & Donnelly, 2000) was used to infer the number of discrete genetic units across both transects. Run length was set to 100,000 MCMC replicates after a burn-in period of 100,000 using correlated allele frequencies under a straight admixture model. We varied the number of clusters (K) from 1 to 10, with 10 replicates for each value of K. The most likely number of clusters was determined by plotting the log probability of the data (ln Pr(X|K)) (Pritchard, Stephens & Donnelly, 2000) across the range of K values tested and selecting the K where the value of ln Pr(X|K) plateaued as suggested in the S manual. We also employed the ΔK method (Evanno, Regnaut & Goudet, 2005) as calculated in S H (Earl, 2011). Results for the identified optimal values of K were summarized using  (Jakobsson & Rosenberg, 2007) and plotted using  (Rosenberg, 2004). In order to test for unrecognized substructure in the broader S analysis, we repeated the above analysis for each transect separately using neutral and outlier loci.

Data quality
The mean starting DNA concentration recovered from the non-invasively collected hair samples was 0.55 ng/µl with as little as 1 ng total for some samples. The mean number of sequencing reads per sample was 1,863,634. Ten samples yielded less than 100,000 sequencing reads, likely due to the degraded quality and very low quantity of starting DNA. Nineteen additional samples had less than 50% of their sequencing reads mapping to O. princeps. Sixteen of these samples had high proportions of sequence reads matching with two small mammals that likely co-occur in the sampling area (Mus musculus (n = 13) and Spermophilus (n = 3)), with others matching Homo sapiens (n = 2) and Zea mays (n = 1). The above samples (n = 29) with low overall sequence reads or a low proportion mapping to O. princeps (or both) were removed, leaving 67 samples from eight sites across two elevational transects that were subject to all downstream analyses ( Fig. 1 and Table 2).
We identified 9,825 SNPs that met the minimum parameters for recovering genotypes. To minimize linkage, we retained the highest coverage SNP from each contig, resulting in 3,830 SNPs. Twenty-seven SNPs deviated from HWE in two or more sites per transect and were removed from the dataset. Consequently, all downstream analyses were based on genotypic data at 3,803 SNPs. All 67 of the retained samples possessed unique genotypes at a random subset of 100 loci (average probability of identity within each sampling site = 1.1 × 10 −23 ), suggestive of unique individuals.

Outlier detection and annotation
Outlier detection identified 37 loci along the TL transect and 18 loci along the PP transect, none of which were shared. There was no evidence of significant deviation from linkage equilibrium for any pairwise comparison of outlier loci across populations. Fourteen outlier loci unambiguously matched sequences from the NCBI nr database, five of which annotated to genes of known functions (Table 1). Locus 57863 76 identified from the PP transect mapped to the receptor tyrosine kinase-like orphan receptor 2 (ROR2) gene that is part of a conserved family that function in developmental processes including skeletal and neuronal development, cell movement and cell polarity (Green, Kuntz & Sternberg, 2008). Likewise, locus 108547 114 identified from the TL transect annotated to another gene encoding a cell surface tyrosine kinase receptor (beta-type platelet-derived growth factor receptor), but for members of the platelet-derived growth factor family (Shim et al., 2010). Locus 28594 45 was similar to the laminin alpha 3 gene in humans that codes for a protein that is essential for formation and function of the basement membrane, with additional functions in regulating cell migration and mechanical signal transduction (Hamill, Paller & Jones, 2010). Lastly, locus 23486 75 was annotated to the hephaestin-like 1 (HEPHL1) gene that may function as a ferroxidase and may be involved in copper transport and homeostasis, while locus 33398 46 mapped to thioredoxin-related transmembrane protein 4 (TMX4) that may act as a reductase in the calnexin folding complex (Sugiura et al., 2010).

Population genetic analyses
The proportion of polymorphic loci varied across the sampling sites, with the lower elevation sites (PP1, PP2, TL1) exhibiting substantially lower numbers (P = 0.661-0.777) than found at the mid-and high-elevation sites (P = 0.837-0.947) for both transects (Table 2). Similar trends were seen for gene diversity along both transects where the low and mid-elevation sites recorded the lowest values relative to higher elevation sites

Notes.
Elevation in meters; sample size (n); proportion of polymorphic loci (P); observed heterozygosity (H o ); unbiased expected heterozygosity (H e ); gene diversity (N g ); inbreeding coefficient (F is ). * p < 0.05. (Table 2). Indeed, both measures of site-level genetic variation were significantly correlated with elevation (P: r 2 = 0.557, p = 0.034; N g : r 2 = 0.738 p = 0.006; Fig. 2). Although P significantly correlated with sample size (r 2 = 0.635, p = 0.018), this was not the case for gene diversity (r 2 = 0.0813, p = 0.493) or elevation (r 2 = 0.184, p = 0.289). This general trend of increasing variation with elevation seemed to hold for observed heterozygosity along the PP transect, but were stable across TL sites (H o : 0.336-0.368; Table 2). Yet, all low (PP1, TL1) and mid-low (PP2, TL2) sites exhibited significant, genome-wide evidence of heterozygote deficit. Interestingly, significant inbreeding was also detected at the low and  mid-low elevation sites along PP, with no such evidence at the higher elevation sites (PP3, PP4; Table 2). All sites along the TL transect exhibited evidence of inbreeding ( Table 2). The AMOVA revealed that a significant amount of variation (p < 0.0001) was exhibited both among transect (4.14%, d.f . = 1) and among sites within transect (2.01%, d.f . = 6), with the remaining found within populations (93.05%, d.f . = 126). These patterns were congruent with those from pairwise θ estimates, with the highest values generally displayed by among transect comparisons, but where all comparisons were significant (Table 3).
The Bayesian clustering analyses based on 3,748 neutral loci revealed strong evidence for two clusters within the dataset (ΔK 2 = 249.0), corresponding to the two transects

Figure 3 STRUCTURE bar plots depicting the model-based clustering results for Thornton Lake (TL) and Pyramid Peak (PP) sites based on outlier loci (above) and neutral loci (below).
Analyses for the TL transect revealed evidence for both K = 2 (ΔK = 473.3) and K = 3 (ΔK = 314.6; plot shown) based on 37 outlier loci, and K = 1 (K = 2 plot shown for display purposes) based on 3,748 neutral loci. Analyses for the PP transect revealed evidence for K = 2 (=123.1) based on 18 outlier loci, and K = 2 (ΔK = 33.1) based on 3,748 neutral loci. (Fig. 1). When analyzing the PP transect separately, additional substructure (K = 2; Fig. 3) was found using neutral (ΔK 2 = 33.1) and outlier (ΔK 2 = 123.1) loci. In both cases, the low elevation site (PP1) represents a largely distinct cluster relative to all other sites. Similarly, substructure was found along the TL transect when using outlier loci, with strong evidence for two (ΔK 2 = 473.3) and three (ΔK 3 = 314.6) clusters. In the K = 2 plot, TL2 represented a distinct cluster, while in the K = 3 plot, the lower elevation sites (TL1, TL2) were each separate clusters relative to the high elevation sites. Although the Evanno, Regnaut & Goudet (2005) method would favor K = 2, the method described by Pritchard, Stephens & Donnelly (2000) for inferring the optimal number of clusters would suggest K = 3 given that ln Pr(X|K) clearly plateaus at this value (Table S1). No substructure was found along the TL transect based on neutral loci.

DISCUSSION
Conservation genomics has become an increasingly popular term in the literature, yet practical examples are limited (Shafer et al., 2015), including explicit consideration of the efficacy of genomic data collection from non-invasively collected starting materials. Here, we demonstrated the ability to identify 3,803 high confidence SNPs and recover genotypic data from low quantity DNA originating from non-invasively collected American pika hair samples. These data allowed us to detect outlier loci across elevational transects, identifying several candidate gene regions that exhibit putative signatures of divergent selection and that can be investigated in future studies for formulating mechanistic hypotheses. Moreover, the broad-scale genomic coverage enabled precise estimation of population-level parameters, including standard diversity indices, inbreeding, and structure within and among sampled transects.
We found genetic variation to be significantly correlated with elevation (Fig. 2), with sites at the lower fringe of American pika distribution in North Cascades National Park exhibiting substantially lower levels of gene diversity. No such associations were found in a previous microsatellite-based study conducted across elevationally-distributed sites in British Columbia, Canada (Henry, Sim & Russello, 2012). The detection of significant genome-wide evidence of heterozygote deficit at low elevation sites in both transects further suggests inbreeding may be leading to the observed patterns (Table 2), a particular concern for PP1, TL1 and TL2 given their apparent distinctiveness from higher elevation sites (Fig. 3). Due to their specific habitat requirements, patchy distribution, and life history, American pikas were long thought to regularly interbreed with close relatives based on observational studies (Smith & Ivins, 1983). Yet, molecular marker based studies have altered our understanding of American pika breeding behavior, revealing evidence for mate choice based on intermediate relatedness in one case (Peacock & Smith, 1997b), while another found no evidence for inbreeding across elevationally-distributed sites (Henry, Sim & Russello, 2012). This latter study conducted in Tweedsmuir South Provincial Park, British Columbia, Canada also found evidence for broad-scale and fine-scale population structure, detecting restricted gene flow among transects as well as among sites within transect and potentially driven by climatic factors (Henry, Sim & Russello, 2012). Although conducted at a different scale, Castillo et al. (2014) found a high degree of connectivity among geographically proximate sites in Crater Lake National Park, Oregon, USA, but restricted gene flow at a broader scale likely driven by topographic complexity and water. Here, we detected similar coarse-level patterns, detecting strong population genetic structure across transects but some evidence of connectivity between sites within transects in North Cascades National Park. It is worth noting that the three studies spanned the distribution of the Cascades lineage of American pika (Galbreath, Hafner & Zamudio, 2009), conducted in the north (Henry, Sim & Russello, 2012), south (Castillo et al., 2014 and central (this study) portions of the range. Given that pikas are considered by some to be sentinels of climate change (Hafner, 1993;Smith, 1974), further investigation is warranted to infer underlying mechanisms associated with dispersal ability in pikas that may be further enhanced by comparative analyses of elevationallyand latitudinally-distributed sites.
More broadly, our results highlight a range of issues that must be considered when pairing genomic data collection with non-invasive sampling. First, sampling protocols must endeavor to minimize non-target DNA during the collection process. In our case, the use of tape-based, non-invasive hair snares allowed us to collect genomic data from 67 individuals of American pika, but also yielded 19 samples that were almost entirely composed of DNA sequence reads from non-target organisms, primarily other small mammals that were likely using the same talus habitat. Precautions to avoid such contamination will vary according to the methods in which samples are obtained, but are of critical importance given the non-targeted nature of NGS approaches. Depending upon study questions, the use of exon-capture or other approaches for preferentially targeting the DNA of study organisms within a mixed sample may help to minimize contamination and maximize cost-effectiveness of downstream sequence data (Avila-Arcos et al., 2011;Carpenter et al., 2013;Good, 2011). Exon-capture, in particular, has been effectively applied to historical DNA collected from museum specimens (Bi et al., 2013), which typically yield DNA of lower quantity and quality similar to non-invasively collected starting materials. Yet, these approaches are substantially more costly and, in the case of exon-capture, limited to expressed regions of the genome. However, for some non-invasively collected DNA such as feces, the use of capture approaches may be obligatory (Perry et al., 2010).
Additionally, rigorous assessments of resulting DNA sequence data must be undertaken to ensure quality control. In the current study, we used a genotyping-by-sequencing approach for reduced representation genomic data collection. We had the advantage of publicly available American pika genomic scaffolds that allowed us to initially filter our data based on SNPs assembling to these references. At present, such resources may not be available for many organisms of conservation interest. In such cases, we recommend using the closest available genome to inform reference assembly of identified SNPs (in our case, this would have been the European rabbit; Lindblad-Toh et al., 2011). If no suitable reference genome is available, investigators may want to consider capture approaches for genomic data collection (as discussed above).
Analytical frameworks must also be carefully considered in relation to recovered sequence coverage depth in studies using non-invasively collected samples. In our case, we used explicit parameters related to coverage and amounts of allowable missing data to confidently reconstruct genotypes from our sampled individuals. While there is no clear standard in the literature, choice of such parameters is a balance between maximizing the number of loci and minimizing null alleles when reconstructing genotypes. Yet, reconstructed genotypes may not be necessary for all study questions, especially those primarily focused on estimating population-level parameters rather than individual-based measures (e.g., admixture coefficients, individual identification, parentage probabilities; but see Buerkle & Gompert, 2013). In such cases, low density genomic scans based on more individuals or sites in the genome may provide highly accurate and precise population parameter estimates, even at as low as 1X coverage (Buerkle & Gompert, 2013;Fumagalli, 2013). Analytical pipelines continue to be developed that implement population genetic analysis methods that account for the statistical uncertainty of NGS data (Fumagalli et al., 2014), with empirical examples now found in the literature (Cahill et al., 2013).
Overall, NGS data and population genomic analyses hold great promise for informing conservation-related studies, substantially increasing the number of markers to allow for more accurate and precise estimates of population structure and demographic parameters (Primmer, 2009), as well as the ability to detect adaptive genetic variation for informing conservation unit delimitation (Funk et al., 2012) and decision frameworks aimed at reducing the long-term impacts of climate change on biodiversity (Hoffmann et al., 2015). Here, we have shown that with careful consideration, genomic data collection is compatible with the non-invasive sampling required in practice for many conservation-related studies.