Development of diagnostic microsatellite markers from whole-genome sequences of Ammodramus sparrows for assessing admixture in a hybrid zone

Studies of hybridization and introgression and, in particular, the identification of admixed individuals in natural populations benefit from the use of diagnostic genetic markers that reliably differentiate pure species from each other and their hybrid forms. Such diagnostic markers are often infrequent in the genomes of closely related species, and genomewide data facilitate their discovery. We used whole-genome data from Illumina HiSeqS2000 sequencing of two recently diverged (600,000 years) and hybridizing, avian, sister species, the Saltmarsh (Ammodramus caudacutus) and Nelson's (A. nelsoni) Sparrow, to develop a suite of diagnostic markers for high-resolution identification of pure and admixed individuals. We compared the microsatellite repeat regions identified in the genomes of the two species and selected a subset of 37 loci that differed between the species in repeat number. We screened these loci on 12 pure individuals of each species and report on the 34 that successfully amplified. From these, we developed a panel of the 12 most diagnostic loci, which we evaluated on 96 individuals, including individuals from both allopatric populations and sympatric individuals from the hybrid zone. Using simulations, we evaluated the power of the marker panel for accurate assignments of individuals to their appropriate pure species and hybrid genotypic classes (F1, F2, and backcrosses). The markers proved highly informative for species discrimination and had high accuracy for classifying admixed individuals into their genotypic classes. These markers will aid future investigations of introgressive hybridization in this system and aid conservation efforts aimed at monitoring and preserving pure species. Our approach is transferable to other study systems consisting of closely related and incipient species.


Introduction
Interspecific hybridization is common in nature (Mallet 2005;Abbott et al. 2013), especially among species in early stages of speciation or in secondary contact (Via 2009;Ellegren et al. 2012). Wild hybrids are a mosaic of phenotypes and genotypes, creating challenges for their taxonomic identification and confusion about their conservation status (Stronen and Paquet 2013). Accurate identification of admixed individuals in wild populations aids evolutionary investigations of introgressive hybridization as well as conservation management.
Studies of genetic admixture are most powerful when they use diagnostic species-specific markers, that is markers that are highly differentiated between the two parental species (Moccia et al. 2007;Hohenlohe et al. 2011). Yet, diagnostic markers are infrequent in the genomes of closely related species, and they are rarely found by anonymous marker development approaches. Current sequencing technologies present solutions to the challenges of identifying diagnostic markers, through efficient development of large genomewide panels of SNPs or microsatellite loci. Despite the advent and potential power of large SNP panels, there are many research questions, including those involving genetic hybrid indices, that can be effectively addressed with a carefully selected suite of highly informative microsatellite markers (Guichoux et al. 2011;Wei et al. 2014;Vukosavljev et al. These young species diverged~600,000 years ago (Rising and Avise 1993), as evidenced by weak genetic divergence (1.2% differentiation at the COI gene and F ST of~0.15 for neutral microsatellite markers; Shriver et al. 2005;Walsh et al. 2011). They co-occur and hybridize in tidal marshes of the New England coast, where they are now in secondary contact.
Hybrid A. caudacutus-nelsoni sparrows are prevalent within the overlap zone and reveal a complex and poorly understood pattern of morphological and genetic introgression (Hodgman et al. 2002;Shriver et al. 2005;Walsh et al. 2011). Currently available microsatellite markers yield low levels of differentiation within and between species (Shriver et al. 2005;Walsh et al. 2012) and lack the resolution to differentiate genotypic classes of admixed individuals (e.g., F1, F2, and backcrossed to each parental species). Difficulties in distinguishing pure and admixed individuals hinder efforts to evaluate the productivity and viability of populations in the hybrid zone, as well as to fully evaluate the geographic extent of introgression. Diagnostic markers are germane for addressing these concerns as well as for investigating patterns and mechanisms of introgressive hybridization.
The aim of this study was to use whole-genome sequence data of A. caudacutus and nelsoni for de novo development of a suite of species-specific diagnostic microsatellite markers with high resolution for identifying pure and hybrid genotypic classes (F1, F2, and backcrosses). To do so, we identified putative diagnostic markers by in silico comparison of repeat sequences in the two genomes, and we screened 37 of them in the laboratory on individuals of both species. We then developed a panel of the 12 most diagnostic markers, which we found through additional screening to be highly suitable for a genetic hybrid index. We evaluated the power of the markers for accurate assignments of simulated individuals to their appropriate hybrid genotypic classes. Our approach and PERL script for identifying diagnostic repeats between two genomes are readily transferrable to other study systems consisting of closely related and incipient species.

Materials and Methods
Sampling and DNA extraction To obtain samples for marker development, we sampled a total of 120 A. caudacutus and nelsoni individuals from multiple putatively allopatric (n = 48 individuals of each species) and sympatric (n = 24 individuals) locations along the northeastern coast of the United States, within and north and south of the species' overlap zone (Table 1). Adult sparrows were captured using 12-m mist nests with size 36-mm mesh. Blood samples (10-20 lL)  were collected from the brachial vein onto Nobuto blood filter strips (Advantec MFS Inc., Dublin CA). For de novo marker development, two additional females, one nelsoni captured from Penobscot, Maine, and one caudacutus captured from Middletown, Rhode Island, were bloodsampled for whole-genome sequencing. These individuals were assumed to be "pure" for each parental species, as they were sampled from locations outside of the currently recognized hybrid zone (Hodgman et al. 2002). DNA was extracted from blood samples using a DNeasy Blood Kit (Qiagen, Valencia, CA).

Genome sequencing and assembly
Illumina TruSeq DNA libraries were generated including electrophoretic, gel-based, manual size selection targeting an average insert size of 300 bp. Whole-genome 100-base pair, paired-end sequencing was performed in two separate lanes on an Illumina HighSeqS2000. This resulted in 213,519,998 and 384,563,744 100-base-pair reads for A. caudacutus and nelsoni genomes, respectively. De Novo assembly of each genome was constructed from the paired reads (after filtering out reads with any ambiguous nucleotides -Ns) using the CLC Genomics Workbench 4.5.1 (CLC Bio, Aarhus, Denmark). Assembly parameters were as follows: kmer size = 26, bubble size = 50, mismatch cost = 2, insertion and deletion costs = 3, length and similarity fractions of 0.5 and 0.8, respectively, and scaffolding set to true. The draft assembly for A. caudacutus is comprised of 237,108 contigs (largest contig is 188,803 bp) and the A. nelsoni assembly is comprised of 142,556 contigs (largest contig is 442,557 bp). N50 contig sizes are 12,145 and 30,931 bases, with 21X and 37X average coverage for the A. caudacutus and A. nelsoni genomes, respectively. Total assembled genome sizes were approximately 1 GB for each species.

Diagnostic loci identification and primer development
We used the program MSATCOMMANDER version 1.0.8 (Faircloth 2008) to identify repeat motifs (tri-and tetranucleotides) within assembled contigs of the A. nelsoni genome that were larger than the N50 contig length. To identify diagnostic repeat sequences, a custom PERL script (Appendix A1) was developed to identify repeat sequences that were common to both species and to compare the repeat numbers between the two genomes. Our script searched the assembled A. caudacutus genome for the same 20-base-pair flanking sequences on either side of the repeat regions identified in the A. nelsoni genome. Reverse complement sequences were similarly searched.
Using this filtering process, we identified 1030 tri-and tetranucleotide loci that were common to both genomes. To increase the probability of finding diagnostic markers, we focused on sequences with at least four matching repeats and that differed by 3-10 repeats between species. This resulted in 79 loci; we narrowed this list down further to include only those loci (n = 42) that differed by 4-10 repeats. Primers were designed with PRIMER 3 version 0.4.0 (Rozen and Skaletsky 2000), using default parameters, for 37 of these putatively diagnostic loci. To assess the distribution of the 37 loci across the genome, we used BLASTn, with an E value of <1e À75 and >80% identity score, to identify the chromosome in the Zebra Finch genome where each repeat sequence was located and annotations when available (Table 2). We use the Zebra Finch because it is a well-annotated genome and synteny is high in avian genomes (Warren et al. 2010;Ellegren et al. 2012).

Genotyping and microsatellite characterization
To test the 37 diagnostic loci for amplification, we chose two individuals of each species. Polymerase chain reactions were prepared in 12.5 lL reactions and contained 2 lL of eluted genomic DNA, 0.4 lmol/L of each primer, 2.5 mmol/L MgCl 2 , 5X PCR buffer (Promega, Madison, WI, USA), 0.2 mmol/L of deoxyribonucleotides, and 1 unit of Taq DNA polymerase (Promega). Cycling conditions were as follows: initial denaturation at 94°C for 4 min, followed by 30 cycles of 94°C for 30 sec, 56°-63°C for 45 sec, 72°C for 1 min, and a final extension step at 72°C for 5 min. PCR products were resolved on a 1% agarose gel. Of the 37 primers, 34 consistently amplified the target regions in both species and were used for an initial screening of 24 individuals from eight allopatric marshes (Table 1). PCR was repeated with the addition of 0.04 mmol/L of fluorescently labeled ChromaTide Alexa Fluor 488-5-dUTPs (Invitrogen, Life Technologies, Grand Island, NY, USA) to allow for the visualization of amplified products on an automated DNA sequencer (ABI 3130 genetic analyzer, Applied Biosystems, Foster City, CA).
To characterize the diagnostic potential of these 34 loci, we counted the number of alleles shared between the species across the 24 allopatric individuals (Table 2). We chose 12 loci with the fewest number of shared alleles and the most variation in the distribution of alleles to screen further as a panel of putatively diagnostic loci. These 12 chosen loci were screened in an additional 96 individuals (36 allopatric and 12 sympatric individuals of each species), using dye-labeled primers (HEX, FAM, or NED) in two multiplex PCRs. The 15 lL polymerase chain reactions contained 3 lL of eluted genomic DNA,   0.1-0.3 lmol/L of each dye-labeled primer, 2.0 mmol/L MgCl 2 , 5X PCR buffer (Promega), 0.1 mmol/L of deoxyribonucleotides, and 1 unit of Taq DNA polymerase (Promega). We used the same cycling conditions described above with a 60°C annealing temperature for all loci. Amplified products were again electrophoresed on an ABI 3130 automated DNA sequencer, and individual genotypes were scored manually using PEAKSCANNER software (ABI). For the 12 diagnostic loci, the number of private alleles, allele frequencies, and estimates of expected and observed heterozygosities were calculated for allopatric individuals in GENALEX, version 6.41 (Peakall and Smouse 2006). The proportion of shared alleles was calculated for each locus as the number of alleles shared between allopatric A. nelsoni and A. caudacutus divided by the total number of alleles. The frequency of the most common allele in each species was calculated in GENALEX. We performed selection tests for the 12 loci using an F ST outlier approach (Beaumont and Nichols 1996) in LOSITAN (Antao et al. 2008). Tests of Hardy-Weinberg equilibrium and linkage equilibrium were conducted in GENEPOP, version 4.2 (Raymond and Rousset 1995). Significance was assessed with a Bonferroni correction for multiple tests. Locus-specific F ST values were also calculated for all pairwise combinations of allopatric and sympatric sparrows in GENEPOP. We used a Bayesian clustering method implemented in the program STRUCTURE v. 2.3.4 (Pritchard et al. 2000) to assess how membership proportions differed between allopatric and sympatric populations of both species. We ran ten replications with K = 2, using the admixture model with correlated allele frequencies and a 100,000 burn-in followed by 100,000 iterations.

Power assessment of diagnostic marker panel
We assessed the power of the panel of 12 diagnostic markers by evaluating the accuracy of each locus in assigning known individuals to hybrid classes. We simulated 100 genotypes for each of six genotypic classes (pure A. nelsoni, pure A. caudacutus, backcrossed A. nelsoni, backcrossed A. caudacutus, F1 hybrids, and F2 hybrids) using the program HYBRIDLAB 1.0 (Nielsen et al. 2006). Simulated individuals were analyzed using the program NEWHYBRIDS 1.1 BETA (Anderson and Thompson 2002); we used the z and s option to identify the 36 pure individuals of each species as known reference individuals. We ran NEWHYBRIDS using the default options with 200,000 sweeps and a 200,000 burn-in. We calculated mean posterior probabilities of the individuals assigned to each category and the percentage of correctly assigned individuals. Individuals were considered correctly assigned when their true category was the category with the highest posterior probability.

Marker development and characterization
Sizes of the repeat regions for the 34 markers ranged from 112 to 284 bp, and loci were variably polymorphic with 2-12 alleles (Table 2; see Appendix A2 for allele frequency data). Mean observed and expected heterozygosities ranged from 0.133 to 0.917. Eighteen loci showed significant deviations from Hardy-Weinberg in one or both species at P < 0.05, and 6 loci showed deviations in one or both species after Bonferroni correction (P < 0.0007; Table 2). These deviations are not unexpected and most likely result from a Wahlund effect (Wahlund 1928), given that samples for each species were collected from a diversity of geographic locations, potentially comprised of distinct populations. The number of shared alleles between species ranged from 0 to 6 across the 34 loci. Across the panel of 12 diagnostic loci, no pairs showed significant deviations from linkage equilibrium. Two loci (Ammo012 and Ammo015) were candidates for positive selection.

Resolution and power of the diagnostic marker panel
The proportion of shared alleles between allopatric A. caudacutus and A. nelsoni at the 12 diagnostic loci ranged from 0.11 to 0.95, with the frequency of most common alleles as high as 1.0 in A. caudacutus and 0.984 in A. nelsoni (Table 3). The number of private alleles ranged from 1 to 12 among allopatric populations. Locus-specific F ST values between allopatric A. nelsoni and A. caudacutus ranged from 0.21 to 0.81 with a global F ST of 0.46 (Table 4). Differentiation between sympatric A. nelsoni and allopatric A. caudacutus was similar to that of the two allopatric populations; however, differentiation between allopatric A. nelsoni and sympatric A. caudacutus and between sympatric populations of each species was slightly lower (Fig. 2). F ST values for within-species comparisons were much lower (0.004 to 0.027 overall; Table 4; Fig. 2). STRUCTURE Q values (proportion of the genome attributed to the parental species, with 1 being pure caudacutus and 0 pure nelsoni) for allopatric individuals were above or below the pure species cutoffs of 0.9 and 0.1, respectively. Introgression was apparent in sympatric individuals, however, with slightly lower Q values, including some above/below the pure species cutoffs, especially for sympatric A. caudacutus (Fig. 3). Based on the 12 diagnostic microsatellite markers, NEWHYBRIDS assigned 92% of all the simulated individuals to their true category. Assignment accuracies varied for the categories, ranging from 76% (F2) to 100% (pure A. caudacutus), with posterior probabilities for the correctly assigned categories ranged from 0.75 (F2) to 0.991 (A. caudacutus; Table 5). Pure individuals had the highest percentage of correct assignments with 98% (A. nelsoni) and 100% (A. caudacutus) of individuals correctly assigned with posterior probabilities of 0.947 and 0.991, respectively. F1 individuals were also assigned with high accuracy (97% and posterior probability of 0.936). F2 individuals were the most difficult to assign, especially with respect to distinguishing them from backcrossed individuals, with 76% of individuals correctly assigned with a mean posterior probability of 0.75. For backcrossed individuals, 91% were assigned at nearly identical mean posterior probabilities of 0.844 and 0.843. The majority of misassignments were between backcrossed and F2 individuals. There were no instances where backcrossed A. nelsoni were assigned as backcrossed A. caudacutus and vice versa.

Discussion
While current sequencing technologies afford the potential for generating tens of thousands of genomewide markers for population genomics research (Davey et al. 2011), not all research and conservation applications will require genomewide data (Allendorf et al. 2010). For such applications, including research questions focused on discerning processes for closely related individualssuch as dispersal, kinship, population structure, and admixturean informative panel of microsatellite markers will remain a valuable tool (Guichoux et al. 2011;Wei et al. 2014;Vukosavljev et al. 2015). In our case study of an avian hybrid zone, we highlight the utility of a carefully selected, high-resolution panel of microsatellite markers for discriminating genotypic classes of pure and admixed individuals. Our strategy for diagnostic marker discovery via in silico screening for microsatellite repeat differences in two species' genomes eliminates the laborious process of manually screening markers in the laboratory. As such, this efficient and highly effective approach should prove useful for other studies requiring diagnostic microsatellite markers for closely related species.
From whole-genome sequence data, we identified 34 polymorphic and diagnostic or partially diagnostic microsatellite markers that amplified in both A. caudacutus and A. nelsoni. We developed a panel of the 12 loci with the fewest shared alleles between species. All markers in this diagnostic panel amplified consistently using the same routine PCR conditions, making them highly conducive  for multiplexing. We demonstrated the power of these loci for the identification of pure and admixed individuals in this avian hybrid zone. After screening the 12 diagnostic markers on 96 sparrows from allopatric and sympatric sites, we found them to be highly informative for species discrimination. This panel of loci had high resolution for classifying pure and admixed individuals into their genotypic classes. The markers were most powerful for distinguishing among F1, backcrossed, and parental groups (with 91-100% accuracy), while F2s were difficult to distinguish from either F1 or backcrossed groups (76% accuracy). For all loci, the most common allele differed between the two species; this allele was typically rare (<0.05%) in the other species. Only one locus, Ammo030, showed a fixed allele in either specieswith a single allele in A. caudacutus and three private alleles in A. nelsoni. An additional four markers had an allele with >90% frequency in one of the two species. While most of the markers exhibited a relatively large portion of shared alleles between species (0.11-0.98), allele frequency distributions differed strongly between the species, and all loci had at least one private allele. Locus-specific F ST s indicated strong divergence (ranging from 0.2144 to 0.819, with overall F ST = 0.4667)  between allopatric populations of each species. In comparison, anonymous neutral loci yielded a between-species F ST of 0.15 (Shriver et al. 2005). The lack of microsatellite loci with fixed differences between the two species is notable, given our whole-genome approach. By comparing all of the microsatellite repeats identified from whole-genome shotgun sequencing, we only found 79 loci to differ in repeat numbers between the two genomes (and of these, only 42 met our criteria of differing by four or more repeats). Our resulting panel of 12 diagnostic loci therefore likely represents the largest microsatellite differences across the genomes of these two species. The overall similarity in microsatellite repeats between the two species exemplifies their close relationship as recently diverged sister species and suggests that high genetic similarity is characteristic not only at the mitochondrial level (Rising and Avise 1993;Walsh et al. 2011), but also potentially on a genomewide level. This finding gives insight into genomic similarity of hybridizing taxa and highlights the challenges of identifying diagnostic markers for recently diverged species, as well as the utility of whole-genome sequencing in highresolution marker development.
The elevated divergence of the diagnostic panel relative to neutral loci previously used in this system (Shriver et al. 2005;Walsh et al. 2012) suggests that these loci may be under selection (Storz 2005;Strasburg et al. 2012). Selection tests identified two of the markers to be under selection in this dataset; however, further research with more targeted sampling schemes may identify additional selected loci. This is supported by the fact that 10 of the 34 (6 of 12 diagnostic) loci aligned with an annotated protein-coding region of the Zebra Finch genome (Table 2). These markers may be associated with a region of the genome with a functional role that diverges between the two species; accordingly, these may be  important portions of the genome with respect to speciation. As allelic changes induced by natural selection occur faster than those due to neutral processes (Nei 1987), high-resolution gene-associated markers are more powerful than neutral markers for applications that require assigning individuals to distinct population or species groupings (Nielsen et al. 2009(Nielsen et al. , 2012. The pattern of between-species divergence that we found using the 12 diagnostic markers in this study was fairly consistent across sympatric and allopatric populations. While F ST s were highest for allopatric comparisons of the species, they were only slightly lower for comparisons that included sympatric populations, suggesting divergence at these loci is maintained in the face of interspecific gene flow in the hybrid zone (Walsh et al. 2011(Walsh et al. , 2015. These markers therefore appear to be associated with gene regions that do not introgress freely between the two species. For within-species comparisons, the F ST s are slightly lower within A. caudacutus than within A. nelsoni, supporting the hypothesis that introgression is biased in the direction of the A. caudacutus genome (Shriver et al. 2005;Walsh et al. 2011).
The low within-species divergence we found in this study is an expected outcome, especially for diagnostic markers. F ST among A. caudacutus in this study is similar to that previously reported by Walsh et al. (2012) using anonymous neutral loci. Despite high levels of gene flow, Walsh et al. (2012) found evidence for fine-scale population structure within A. caudacutus. The sampling scheme in the current study, however, was not designed for evaluating within-species population structure, as pooling across many geographically separate sympatric or allopatric marshes likely masks some of the underlying population differentiation. Nonetheless, the higher within-species F ST found in A. nelsoni compared to A. caudacutus suggests that a finer scale population genetic structure may be characteristic of the former. More pronounced population structure in A. nelsoni relative to A. caudacutus is consistent with differences in the species' distributions and demographynelsoni typically occur in smaller numbers in small marshes that tend to be more spatially disjunct than the larger, more continuous coastal marshes typically occupied by caudacutus (J. Walsh and A. Kovach, pers. obser.). These are the first population genetic data collected on A. nelsoni; future research with a more robust sampling scheme is warranted to characterize population genetic structure in this species. The preliminary data in this study suggest that these markers will be useful for such within-species population comparisons.
In conclusion, our comparative, whole-genome approach has proven useful for identifying high-resolution diagnostic markers in sister species with high genetic similarity. This approach is superior to anonymous mar-ker development, not only because it enables pinpointing species-specific differences, but also because it links the markers to large contigs that can be mapped to genomic regions. The markers identified in this study will aid future research that requires distinguishing pure and admixed individuals in the A. caudacutusnelsoni hybrid zone, as doing so from morphology alone is unreliable (Walsh et al. 2011(Walsh et al. , 2015. A hybrid index based on 12 diagnostic microsatellite markers provides an inexpensive and simple genetic assay. This diagnostic assay for hybrid identification will prove valuable in efforts that seek to track shifts in species distributions, which is of particular relevance to the conservation of threatened A. caudacutus populations (Shriver et al. 2005;Walsh et al. 2011). The diagnostic marker panel will also be useful for studies of evolutionary ecology, such as providing insight into the rates and direction of introgression and estimates of the width and center of the hybrid zone (Barton and Gale 1993). Our marker development approach is easily transferable to other studies, and we provide our PERL script for comparing repeat sequences of two genomes as an appendix.
reads are deposited in the Short Read Archive (SRS893469 and SRS897222). Our PERL script for screening repeat sequences across the two genomes is provided as Appendix A1, and allele frequencies for the panel of 12 diagnostic loci are in Appendix A2. Genotypes of the 96 individuals, simulated genotypes used to evaluate the marker panel, and the assembled N50 contigs of A. nelsoni and A. caudacutus are deposited in Dryad (http:// dx.doi.org/10.5061/dryad.f8m84).
#-target <input.fasta> Name of the target fasta file to scan through with flanking sequences.
#-flank_length <integer> Specifies the number of flanking nucleotides used in the regex search.