Genotype by sequencing identifies natural selection as a driver of intraspecific divergence in Atlantic populations of the high dispersal marine invertebrate, Macoma petalum

Abstract Mitochondrial DNA analyses indicate that the Bay of Fundy population of the intertidal tellinid bivalve Macoma petalum is genetically divergent from coastal populations in the Gulf of Maine and Nova Scotia. To further examine the evolutionary forces driving this genetic break, we performed double digest genotype by sequencing (GBS) to survey the nuclear genome for evidence of both neutral and selective processes shaping this pattern. The resulting reads were mapped to a partial transcriptome of its sister species, M. balthica, to identify single nucleotide polymorphisms (SNPs) in protein‐coding genes. Population assignment tests, principle components analyses, analysis of molecular variance, and outlier tests all support differentiation between the Bay of Fundy genotype and the genotypes of the Gulf of Maine, Gulf of St. Lawrence, and Nova Scotia. Although both neutral and non‐neutral patterns of genetic subdivision were significant, genetic structure among the regions was nearly 20 times higher for loci putatively under selection, suggesting a strong role for natural selection as a driver of genetic diversity in this species. Genetic differences were the greatest between the Bay of Fundy and all other population samples, and some outlier proteins were involved in immunity‐related processes. Our results suggest that in combination with limited gene flow across the mouth of the Bay of Fundy, local adaptation is an important driver of intraspecific genetic variation in this marine species with high dispersal potential.


| 8059
METIVIER ET al. The glacial cycles during the Pleistocene resulted in major range changes, local extirpations, and the establishment of refugial populations for many species of coastal marine invertebrates in the North Atlantic (Hewitt, 1996(Hewitt, , 2000. For species that did not exclusively survive in North American refugia, recolonization events from Europe have influenced, in part, the contemporary species composition of the intertidal section of the Atlantic coast following the LGM, especially for obligate rocky shore species (Ingolfsson, 1992;Wares & Cunningham, 2001). In contrast, species found on soft substrate have a higher level of endemism on the northwestern Atlantic coast, which indicates that these species likely survived the LGM in refugia (Cronin, 1988;Ingolfsson, 1992). The main southern refugial areas for northwestern Atlantic species occurred along the Carolinas, south into Florida and the Gulf of Mexico for marine taxa, with northern refugia suggested for some species (Maggs et al., 2008;Marko, 2004;Wares & Cunningham, 2001;Young, Torres, Mack, & Cunningham, 2002).
One or more refugia have been proposed within the Atlantic Canadian coastline itself, including genetic evidence for a Nova Scotia refugium (Bernatchez, 1997;Maggs et al., 2008;Young et al., 2002). For species surviving in southern refugia, postglacial recolonization is characterized by expansion of a subset of the southern genotypes, resulting in less diversity in the newly colonized northern populations (Hewitt, 1996(Hewitt, , 2000Maggs et al., 2008). However, such genetic differentiation is not expected to be directly correlated with distance between populations in marine environments as a result of dispersal abilities, including planktonic larvae and rafting of marine invertebrates, or recent recolonization events (White et al., 2010).
Dispersal ability of species also plays an important role in determining population structure over time. Panmixia is predicted for species having long-lasting planktonic larval stages as these larvae may disperse with surface currents, although the frequency of longdistance dispersal is dependent on multiple factors, including the size of the parent population, the probability of larvae being carried offshore into the major ocean current systems, and larval mortality (Doherty, Planes, & Mather, 1995;Scheltema, 1971). In addition, dispersal distance depends on the length of time larvae spend in the plankton and the duration of time they are physiologically capable of settling (Siegel, Kinlan, Gaylord, & Gaines, 2003;Watson, Kendall, Siegel, & Mitarai, 2012). A direct correlation has been suggested between planktonic larval duration and larval displacement (Siegel et al., 2003;Waples, 1987). However, pelagic larval duration has been only weakly correlated to F ST (Weersing & Toonen, 2009) likely due to active vertical migration in the water column, temporal variability, and oceanographic features such as mesoscale eddies, alongshore jets, and squirts (Watson et al., 2012;Weersing & Toonen, 2009;White et al., 2010). For species with nonplanktonic larval development, transportation of egg cases and juveniles by rafting on microalgal mats or ice appears to be common (Grantham et al., 2003;MacFarlane, Drolet, Barbeau, Hamilton, & Ollerhead, 2013;Marko, 2004).
The Bay of Fundy (BF) in the northwest Atlantic is an important phylogeographic break for at least two species of marine invertebrates lacking a pelagic larval stage, Hediste diversicolor and Corophium volutator (Einfeldt & Addison, 2013;Einfeldt, Doucet, & Addison, 2014), as well as two species of planktonic dispersers, Macoma petalum and Tritia obsoleta (Einfeldt, Zhou, & Addison, 2017). Restricted gene flow across this boundary provides the opportunity for populations to diverge as a result of both genetic drift and natural selection. The Gulf of Maine (GOM) coastal currents flow in a counterclockwise circular pattern, impeding the mixing of waters in the Gulf of Maine and the Bay of Fundy (Pettigrew et al., 2005;White et al., 2010; Figure 1). However, reversals of flow direction caused by westerly winds are relatively common in the coastal zone, and can spread egg and larval drift in the opposite direction (Brickman, 2014). Patterns of mtDNA variation in all four species studied to date suggest a strong role for limited gene flow and genetic drift in maintaining the genetic break across the mouth of the Bay of Fundy.
In addition to neutral processes driving genetic differences among populations, the genetic split at the Bay of Fundy could also result from local adaptations caused by natural selection. Locally, Chu, Kaluziak, Trussell, and Vollmer (2014) identified putative selection for heat stressmediating genes using next-generation sequencing (NGS) techniques based on thermal tolerance differences in Nucella lapillus, a species lacking a planktonic life stage. They found relatively high levels of gene flow, likely due to rafting in this species, but also genetic splits between the East and West Gulf of Maine Coastal Currents and Nova Scotia.
Semibalanus balanoides, an intertidal barnacle with pelagic larvae, was found to exhibit genetic structure both between the Gulf of Maine and the Gulf of Saint Lawrence (GSL) and within the GSL, and this differentiation within the GSL was attributed to the potential effects of selection due to temperature differences (Holm & Bourget, 1994). Therefore, limited gene flow across the Bay of Fundy/Gulf of Maine biogeographic break may promote the accumulation of genetic divergence driven by natural selection and local adaptation in marine invertebrates.
Macoma petalum's long-lived pelagic larvae, and thus high dispersal potential, lack of recolonization from Europe following the LGM, and endemism to the northwestern Atlantic, make it the ideal candidate to study the evolutionary mechanisms driving the biogeographic break at the Bay of Fundy.
We examined M. petalum from the Gulf of Maine (GOM), the Gulf of Saint Lawrence (GSL), the Bay of Fundy (BF), and Nova Scotia (NS) to study the roles of selection and genetic drift driving patterns of genetic differentiation between these regions. We aimed to build on previous works that used mtDNA and limited nuclear markers to investigate the phylogeographic break at the Bay of Fundy (Einfeldt & Addison, 2013;Einfeldt et al., 2014Einfeldt et al., , 2017. Our main goal was to determine whether this genetic break was primarily a result of limited gene flow and random genetic drift, or whether differential selective pressures within the Bay of Fundy were responsible for driving genetic change. As previous research employed putatively neutral mtDNA variation, we similarly expect small but significant differentiation across the nuclear genome as a result of limited gene flow and genetic drift throughout the region. In contrast, if populations of M. petalum are locally adapted to the mudflats in the Bay of Fundy, then we expect to detect a handful of loci with signatures of strong differentiation between regions. By assessing the levels of genome wide genetic structure, our research will provide insight into the importance of both dispersal limitation and local adaptation to the formation of the biogeographic break observed at the community level across the mouth of the Bay of Fundy.

| Preparing and assembling transcriptome reference
Pooled reads from the M. balthica transcriptome were downloaded from the NCBI SRA database (Accession SRX145744-6; Pante et al., 2012) and imported into CLC Genomic Workbench 8.0.3 (CLC Bio).
We trimmed the raw reads for a maximum of two ambiguous nucleotides, a minimum PHRED score of 20, and a minimum length of 50 bp.
Additionally, we trimmed out short population-specific 5′ repeats and adapters from the original MINT cDNA Synthesis Kit (Evrogen) used to make the cDNA library.
F I G U R E 1 Map of sites sampled for Macoma petalum between May 2011 and July 2014 in four main regions: the Gulf of Saint Lawrence, Nova Scotia, the Bay of Fundy, and the Gulf of Maine (see Table 1 for abbreviations). Pie charts represent the frequency of mtDNA haplotypes (704-bp COI/COIII, n = 194) from each site belonging to the four main lineages described in Einfeldt et al. (2017). Eastern Maine Coastal Current (EMCC) and the Western Maine Coastal Current (WMCC) are indicated, and all current patterns are adapted from Aretxabaleta et al. (2008), Pettigrew et al. (2005), and Drinkwater and Gilbert (2004). Map data ©2016 Mapbox (www.mapbox.com/about/maps/) Trimmed reads were assembled in CLC 8.0.3 with default settings, and were remapped with a length fraction of 0.5 and a similarity fraction of 0.8 as these criteria yielded the most contigs with two or more contributing reads. We used less stringent parameters than Pante et al. (2012) in creating the transcriptome reference to avoid excluding informative SNPs due to genetic differences between M. balthica and M. petalum. We kept contigs with a minimum of two contributing reads and retained high-quality singletons.
We filtered out duplicate contigs in CLC 8.0.3 by performing a BLASTn search of the transcriptome assembly against itself.
Sequences were discarded if they were a perfect match to another, longer sequence. We retained only representative contigs and singletons to create a partial reference transcriptome for analysis of Illumina reads.

| Library construction and purification
We digested 100 ng of genomic DNA from each individual with two restriction enzymes, 5 U/μl of Sau96I and 10 U/μl of MluCI (BioLabs) with 1× buffer in a 20 μl reaction. Digestion was completed in a thermocycler at 37°C (2 hr), 80°C (20 min), and 4°C (hold). Barcoded Illumina adapters were ligated to digested DNA at 22°C (2 hr), 65°C (30 min), and 4°C (hold). We used 60 unique barcodes (0.0125 μmol/L) Four of the five 40 μl PCRs were pooled for each library prior to an additional purification using the QIAquick PCR Purification Kit (QIAGEN), with a final elution volume of 25 μl per library. The remaining PCR product for each pool was used to verify success of PCR in a 1.5% agarose gel and to determine the concentration of DNA after purification.
We ran the two pooled libraries of PCR products in a 1.5% agarose gel (75 V, 1 hr) for a gel extraction. We used the section of the gel between 200 and 700 bp to ensure removal of nonspecific bands

| Processing raw Illumina reads
Raw Illumina reads were downloaded from the Nanuq portal and imported into CLC 8.0.3. We demultiplexed the reads based on Yadapter and unique barcodes, discarding reads lacking barcodes.
Reads were trimmed for a minimum PHRED score of 30, a limit of two ambiguous nucleotides, and for a minimum length of 30 bp. Reads were then mapped to our partial M. balthica transcriptome reference using the CLC Assembly Cell 4.3 algorithm with default parameters, a length fraction of 0.9, and similarity fraction of 0.8. A consensus sequence was extracted from this mapping, with any unmapped regions filled in by the transcriptome reference assembly, and used as the reference for all future bioinformatics analyses (e.g., variant detection and BLASTX to identify outlier loci). Variants were detected for each individual and for the reference as a whole using CLC's Basic Variant detection tool. All variants except single nucleotide variants (SNVs) were excluded, and only those SNVs that had at least two reads of coverage for each of the 60 individuals were kept. Therefore, only biallelic SNPs with coverage on the transcriptome reference as well as in all individuals were kept for future use. SNPs were filtered for a minor allele frequency of ≥0.05 to obtain a final SNP dataset for use in future analysis.

| Assessing population structure
We tested for population structure using the program STRUCTURE 2.3.4 (Falush, Stephens, & Pritchard, 2003;Pritchard, Stephens, & Donnelly, 2000). STRUCTURE probabilistically assigns individuals to k detected clusters based on their allele frequencies. The program was independently run on the total SNP dataset and the dataset containing only putatively neutral unlinked SNPs. Simulations were run with the admixture model, correlated allele frequencies, a burn-in period of 50,000 and 500,000 MCMC repetitions for 1 < k < 5 with five iterations for the full dataset and the neutral dataset, and 10 iterations for the outlier datasets. The Evanno, Regnaut, and Goudet (2005) method was implemented in STRUCTURE HARVESTER (Web 0.6.94; Earl & vonHoldt, 2012) to infer the k clusters that best fit the data.
We also performed an analysis of molecular variance (AMOVA) in Arlequin 3.5.2.2 with 10,000 permutations using the full SNP dataset, the outlier dataset, and the neutral dataset to partition variance among regions, among populations within regions, and within populations.
Populations were assigned to one of the two regions indicated by the structure analysis.
We performed a principle components analysis (PCA) to visualize population structure at a greater resolution. PCA reduces the genetic variation of multivariate data into principle components, which are then visualized in a plot to infer genetic relationships between individuals. We used the adegenet 2.0.0 package (Jombart, 2008;Jombart & Ahmed, 2011) in R (version 3.3.2; R Core Team 2016) to perform the analysis with the dudi.pca algorithm. PCA was run on the combined SNP dataset, the neutral dataset, and the outlier loci datasets. The assembled M. balthica partial transcriptome reference yielded 9,795 contigs, with 60.70% of reads remapped. The N50 statistic for these contigs was 633, meaning that 50% of the bases were found on contigs equal to or >633 bp in length. After filtering for duplicate or highly similar sequences, 9,299 contigs were retained. From this assembly, 90,600 high-quality singletons were also retained and used with the contigs to create the partial transcriptome reference.

| Assembly and descriptive NGS results
The partial M. balthica transcriptome assembled by Pante et al. We detected a total of 1.199 million variable sites in the reference sequence, 454,264 of which had at least two contributing reads, including indels, replacements, and multinucleotide variants. We filtered this down to 14,310 SNPs with a minimum of two reads of coverage per individual, and the average depth of the heterozygous SNPs (no. of heterozygous SNP-detected reads/no. of heterozygous SNPs;

| Summary statistics
Summary statistics for each sampling site and region consisted of the proportion of polymorphic loci, observed heterozygosity, and expected heterozygosity averaged across all loci ( Table 1)

| Outlier detection
Using the results of the STRUCTURE analysis as a guide, we

| Population structure
We analyzed the full (2,583 loci), neutral (1,089 loci), and outlier (93 loci) SNP datasets in STRUCTURE and used the method described by Evanno et al. (2005) to determine the most likely number of k clusters.
The analysis of the LnP(D) for all three datasets indicated that k = 2 was the most likely number of clusters (Figures 2 and S1). AMOVA for all three datasets yielded significant fixation index values (p < .05). The full SNP dataset (2,583 loci) yielded a Φ CT of 0.0122 between the two regions, indicating a small but significant differentiation between BF and elsewhere (Table 3) Table 2).
The results of PCA using the putative directionally selected outliers between BF and all other regions are shown in Figure 3. This outlier dataset clearly showed that the BF was separate from all other regions and that the GOM, GSL, and NS tended to group together. The first two principle components here explained 22.7% of the total variation.
For the neutral dataset of 1,089 SNPs, the BF overlapped greatly with the GOM and NS regions and much less structure overall was present, although the GSL seemed to be less similar to the other three regions ( Figure 4). The first two principle components of the neutral dataset explained only 6.37% of the total variation.

| Outliers and evidence of selection
Outlier detection methods as well as BLASTX were used to identify genes putatively under selection and to determine potential targets of selection in M. petalum (Table 4). The 2,583 SNPs used for analyses were found on 1,415 unique contigs and singletons. Of these unique sequences, 439 returned a BLAST hit with an e-value ≤10 −4 , 279 of these sequences additionally had a match using Interpro scan, and 231 returned a GO ID. The top hit BLAST results came from several species, including multiple hits from Crassotrea gigas, the Pacific Oyster, Strongylocentrotus purpuratus, the purple sea urchin, Branchiostoma floridae, the amphioxus, and Tribolium castaneum, the red flour beetle.
Other hits came from arthropods, vertebrates, and from a gill symbiont.
Ninety-three outliers from 83 contigs and singletons putatively under directional selection were compiled from LOSITAN and BayeScan results between the Bay of Fundy and the other three regions. Of these outliers, 14 contigs returned a BLAST hit and had GO terms assigned (Table S1). From the consensus between LOSITAN and BayeScan, six of the putatively directionally selected outliers returned a BLAST hit.
The six outlier sequences with BLAST results contained nine outlier SNPs and two neutral SNPs (Table 4). Of these, four contigs had a locus-specific F ST > 0.5 and two had GO terms assigned (Table 4). No alternative alleles were fixed between regions.

| DISCUSSION
The recent advances of reduced representation next-generation suggesting that natural selection and local adaptation play a key role in driving genetic change among the regions. Outlier tests indicate that the allele frequencies at a subset of the SNPs appear to be driven by natural selection, in particular for genes whose functions include immunity (Table 4).

| Population genetic structure
Our results are broadly consistent with analyses of mtDNA sampled from the same individuals (Einfeldt et al., 2017). We detected greater pairwise F ST among the northern populations (GSL and NS) than those sampled in the Bay of Fundy and Gulf of Maine (Table 4), suggesting that higher levels of gene flow connect populations in the south.   Figure 1).
Both of these branches of the coastal current exhibit cyclonic circulation (Pettigrew et al., 2005). More importantly, the EMCC cycles counterclockwise from the mouth of the Bay of Fundy along the eastern coast of Maine to the Penobscot Bay (Pettigrew et al., 2005). Circular currents are known to prevent mixing (White et al., 2010). Therefore, the EMCC likely minimizes gene flow between the Bay of Fundy and the Gulf of Maine, as seen in C. volutator, H. diversicolor, and T. obsoleta (Einfeldt & Addison, 2013;Einfeldt et al., 2014Einfeldt et al., , 2017. Tilburg et al. (2012) showed that the EMCC can be a barrier to larval dispersal in the absence of wind-driven waves, but does not completely prevent

| Putative selection in protein-coding genes
In addition to reduced gene flow and genetic drift, our results indicate that natural selection may also be driving differentiation between the  Table   4). The signal of differentiation between the BF and the GOM was present for neutral loci (Figure 2), but is stronger for putatively selected loci, as visualized in the PCAs of neutral-only and outlier-only loci (Figures 3 and 4). Additionally, Φ CT among regions was 20 times greater for outlier SNPs than for neutral SNPs (Table 3).
No alternative alleles were fixed between regions; therefore, allele frequencies were used to investigate the differential presence of the alternate allele between regions. Due to the nature of reduced representation genome sampling with restriction enzymes, wherein null alleles are excluded, and that only SNPs are considered here, allele frequencies are biased to homozygosity (Arnold, Corbett-Detig, Hartl, & Bomblies, 2013;Davey et al., 2013;Puritz et al., 2014). This bias is the result of polymorphisms in the restriction enzyme cut sites being excluded along with structural variants. Therefore, actual frequencies should be considered with caution, and the trend between regions should be the main focus.
The outliers with BLAST results had functions generally related to immunity. These results include asialoglycoprotein receptor subunit 2 (ASGPR2) and 1,2-dihydroxy-3-keto-5-methylthiopentene dioxygenase (membrane-type 1 matrix metalloproteinase cytoplasmic tail binding protein-1 or MTCBP-1). Two top BLAST hits represented predicted uncharacterized proteins, and two were related to transposable elements. Frequencies of the alternative and the reference alleles are available in Table 4. Additionally, a truncated form of MCTBP-1 allows the hepatitis C virus to replicate in nonpermissive cell lines (Yeh, Lai, Chen, Chu, & Liaw, 2001). Overall, ASGPR2 and MTCBP-1 may be under selection in M. petalum against unique infective agents among the sampled regions driving the variation in allele frequencies at these loci.
Two of the outlier contigs represent predicted, uncharacterized proteins. One of these uncharacterized proteins had two SNPs, only one of which was under putative selection with a locus-specific F ST > 0.5 (Table 4). The top BLAST hit organism for this protein was from the Solemya velum, the littoral-dwelling Atlantic awning clam.
Gamma proteobacterial gill symbionts are chemoautotrophs that live intracellularly within the gills of clams in species-specific interactions (Eisen, Smith, & Cavanaugh, 1992). Macoma petalum is unlikely to host such symbionts as this species possesses siphons. The gills of several species of siphon-possessing bivalves were examined and no bacteriocytes have been observed to date, possibly because the siphons of such species provide irrigation of the pallial cavities, removing sulfides, which are the metabolic substrate for these chemoautotrophs T A B L E 3 Variance partitioned between populations of Macoma petalum as determined by a standard analysis of molecular variance (AMOVA) using Arlequin 3. protozoa, bacteria, archea, retroviruses, and plant viruses with a wide range of functions, from gastric digestion, intracellular protein digestion to specific processing of precursor proteins (Davies, 1990;Rao, Erickson, & Wlodawer, 1991). Therefore, the activity of this putative protein in M. petalum likely involves protein digestion or processing of precursor proteins and may be important for a wide range of functions. which are widespread throughout eukaryotic genomes (Ivanov, Melnikov, Siunov, Fodor, & Ilyin, 1991). The jockey-like retrotransposons account for the formation of many pseudogenes (Bebikhov, Postnoc, & Nikinenko, 1998;Rashkova, Karam, & Pardue, 2002).
Transposable elements are biased to regions of the genome in which reduced rates of recombination occur (Dolgin & Charlesworth, 2008;Rizzon, Marais, Gouy, & Biemont, 2002). Therefore, it is likely that the Tcb1 transposase and the RNA-directed DNA polymerase from a jockey-like mobile element are not themselves experiencing selective pressures, but that they are linked to loci that are under putative selection in M. petalum.
A major limitation of using reduced representation genomic methods with restriction enzymes is exclusion of some variable sites.
Mutations in the enzyme cut sites themselves would result in an alternate allele being unknowingly discarded (Arnold et al., 2013;Davey et al., 2013;Puritz et al., 2014). In addition, we only used SNPs to investigate patterns of genetic differentiation when other variable sites, such as indels, are known to be informative in a variety of organisms and for a variety of purposes (Murphy, Pringle, Crider, Springer, & Miller, 2007;Soltis et al., 1997). This study seeks to explore the genomics of M. petalum in the Atlantic and propose some correlations between observed phylogeographic patterns and putative selection, not claim outright the causation of these patterns or to rigorously determine all genes putatively under selection. which are carcinogenic and mutagenic (Nagai, Kano, Funasaka, & Nakamuro, 2002), polychlorinated biphenyls (PCBs) which are carcinogenic (Faroon, Keith, Jones, & de Rosa, 2001), and chlorinated pesticides (Hellou, Haya, Steller, & Burridge, 2005). Toxic metals have also been detected at high levels in BF sediments, including mercury, arsenic, lead, and vanadium (Hung & Chmura, 2006. Bivalves, being filter feeders, may accumulate high concentrations of pathogens, heavy metals, or other pollutants in their tissues (Song, Wang, Qiu, & Zhang, 2010). Therefore, differentiation of M. petalum in the Bay of Fundy could be caused by selection for immunity to such pollutants and their immunological effects, or other disease causing agents, as well as limited gene flow from the EMCC.

CONFLICT OF INTEREST
None declared.

AUTHOR CONTRIBUTIONS
SLM, JHK, and JAA conceived and designed the study; SLM and JHK performed the experiments and collected the data; SLM, JHK, and JAA analyzed and interpreted the data; JAA contributed resources; SLM and JAA drafted the manuscript; and SLM, JHK, and JAA edited and revised the final draft.