The population genomic basis of geographic differentiation in North American common ragweed (Ambrosia artemisiifolia L.)

Abstract Common ragweed (Ambrosia artemisiifolia L.) is an invasive, wind‐pollinated plant nearly ubiquitous in disturbed sites in its eastern North American native range and present across growing portions of Europe, Africa, Asia, and Australia. Phenotypic divergence between European and native‐range populations has been described as rapid evolution. However, a recent study demonstrated major human‐mediated shifts in ragweed genetic structure before introduction to Europe and suggested that native‐range genetic structure and local adaptation might fully explain accelerated growth and other invasive characteristics of introduced populations. Genomic differentiation that potentially influenced this structure has not yet been investigated, and it remains unclear whether substantial admixture during historical disturbance of the native range contributed to the development of invasiveness in introduced European ragweed populations. To investigate fine‐scale population genetic structure across the species' native range, we characterized diallelic SNP loci via a reduced‐representation genotyping‐by‐sequencing (GBS) approach. We corroborate phylogeographic domains previously discovered using traditional sequencing methods, while demonstrating increased power to resolve weak genetic structure in this highly admixed plant species. By identifying exome polymorphisms underlying genetic differentiation, we suggest that geographic differentiation of this important invasive species has occurred more often within pathways that regulate growth and response to defense and stress, which may be associated with survival in North America's diverse climatic regions.


Introduction
Ecosystem disturbance by human activity has substantially altered the evolutionary trajectories of many plant species. The evolutionary mechanisms that grant some plants the ability to become invasive arise from complex, synergistic processes. These include rapid adaptation to novel environments (Colautti and Barrett 2013;Vandepitte et al. 2014), morphological and other phenological changes (Dlugosch and Parker 2008a), demographic shifts such as bottlenecks and population expansions (Dlugosch and Parker 2008b), changes in genetic structure (Bossdorf et al. 2005), as well as the mode and tempo of introductions from the native range (Baker 1974;Van Kleunen et al. 2010). Introduced genotypes are drawn from the native-range populations' pools of standing genetic variation; thus, assessment of native-range patterns of spatial-genetic variation can potentially enable the reconstruction of the history of introduction of exotic species to new locales (Lavergne and Molofsky 2007;Prentis et al. 2009). When multiple introductions occur from different locations, formerly disparate genotypes have opportunities to form hybrids that may better exploit niches than parent genotypes (Ellstrand and Schierenbeck 2000;Dlugosch et al. 2015).
The North American native common ragweed (Ambrosia artemisiifolia; Asteraceae, Compositae) has experienced data remarkable success in Europe, providing opportunities to study the population genetic differentiation between native and introduced ranges (Hodgins and Rieseberg 2011;Hodgins et al. 2013). Ragweed is an important invasive weed that is costly both in terms of its toll on human health and the financial burden of controlling its spread (Storms et al. 1997). Its extremely allergenic airborne pollen is a major cause of allergic rhinitis during flowering season and is forecast to increase in Europe in the near future (Hamaoui-Laguel et al. 2015). Although today the species is nearly continuously distributed across eastern North America's disturbed landscapes, sedimentary pollen paleo-records indicate this is a recent phenomenon (Grimm 2001). Specifically, the species' range and population density have increased substantially since European settlement of North America, with historical census data documenting its close association with human activity via the clearance of forests and expansion of agriculture in its native range (Martin 2011;Hodgins 2014).
By analyzing nuclear microsatellite and chloroplast SNP data from historical herbarium specimens, Martin et al. (2014) described differential success of ragweed populations during the westward expansion of intensive agriculture and a continental-scale shift in the geography of genetic structure that had occurred since approximately 1940. Modern spatial genetic structure showed weak, but significant differentiation of a formerly expanded genetic cluster in the northeast and another more widespread cluster extending throughout southeastern and midwestern USA regions. This rapid major shift is a curiosity that lacks a satisfying explanation, although Martin et al. (2014) did offer some possibilities. One suggestion was a genetic shift in response to reforestation of eastern USA over the last century. Another was introgression of new successful genotypes from western regions of the USA. A final possibility was that, after a cessation of unidirectional, humanmediated propagule movement, locally adapted western genotypes recovered from dormant seed banks to outcompete introduced eastern genotypes. Regardless of its underlying explanation, Martin et al. (2014) further suggested that prior to common ragweed's introduction worldwide, alteration of native-range genetic structure might have enhanced the species' invasive abilities via the formation of novel genotypes. This "hybridization in native range" hypothesis contrasts with the traditional view that ragweed's invasiveness rapidly evolved upon introduction to new locales (i.e., the "rapid evolution" hypothesis; Hodgins et al. 2013). Thus, several key questions regarding the demographic and evolutionary history of this noxious weed remain unanswered.
In this study, we utilize population genomic data in order to assess native-range genetic differentiation of common ragweed populations in more detail. Through principal components analysis (PCA) and estimation of genetic clustering with genotyping-by-sequencing (GBS) data, we explore the genomic basis of geographic differentiation in native-range common ragweed populations. Our approach thus seeks to evaluate the contemporary population structure inferred by Martin et al. (2014), by estimating the degree of population differentiation in GBS data from a similar sample of individuals collected across eastern USA. Subsequently, we performed an enrichment analysis of gene ontology (GO) terms associated with SNPs highly weighted in principal component space in order to investigate the functional basis for geographic differentiation between populations that developed during the recent shift in spatial genetic structure.

Sample collection and DNA extraction
A. artemisiifolia individuals from 37 locations were sampled in 2009-2010, predominantly from roadsides and agricultural fields across the eastern portion of the species' native range (Table S1). Leaf tissue samples were immediately desiccated by collection into silica gel. Silicadried leaf tissue was pulverized with silicon beads in a Qiagen (Hilden, Germany) TissueLyser â and then extracted using plant-specified reagents with the AutoGen (Holliston, MA, USA) AutoGenprep 965 Automated DNA Isolation System.

Reduced-representation library construction and Illumina sequencing
Genotyping-by-sequencing (GBS) library construction was performed by the Institute for Genomic Diversity and Computational Biology Service Unit at Cornell University (Ithaca, NY) according to a published protocol (Elshire et al. 2011). Briefly, a total of 190 A. artemisiifolia genomic DNA extracts were digested with the restriction enzyme ApeKI (4.5-bp recognition site 5 0 -GCWGC-3 0 ). Following ligation of unique, individual-specific indexing adapters, libraries were pooled along with other samples from a different study and multiplex sequenced on two flowcell lanes (95 samples and one blank per lane) on the Illumina (San Diego, CA, USA) HiSeq2000 platform using 100-bp Single Read chemistry.

Genomic SNP calling and genotyping
Sequence reads were sorted by individual using GBSX (Herten et al. 2015), and demultiplexed sequence data were processed with the Universal Network-Enabled Analysis Kit (UNEAK) pipeline (Lu et al. 2013) within the software TASSEL 3.0 (Bradbury et al. 2007). This pipeline enables variant locus identification and genotyping of GBS data lacking a reference genome via (1) reciprocal mapping of Illumina sequences for de novo discovery of likely allele pairs at the same loci, (2) employing network analysis and filtration (Lu et al. 2013), and ultimately (3) using a binomial likelihood ratio method (Glaubitz et al. 2014) for quantitative genotype calling at polymorphic sites. We required a minimum minor allele frequency of 5% to retain a SNP locus in later analysis.
Although the pipeline identified 250,425 putative diallelic loci containing at most one SNP, the number of loci genotyped per individual was highly variable, as was the number of taxa in which particular SNPs were genotyped (Fig. S1). Consequently, we further filtered the data set of genotypes by removing all individuals for which fewer than 10,000 SNP positions had been genotyped, leaving 60 individuals from 23 sampling locations. We then filtered loci by removing all loci genotyped in at least 50% of the remaining individuals. These filtering steps yielded 6337 SNP loci, genotyped in 60 individuals, and with a per-individual mean read depth of 10.8X (Table S1). This 6337-SNP data set was used in later enrichment tests.
Because the presence of loci in genetic linkage in a genomic data set can lead to the detection of spurious genetic structure, we used PLINK v1.09 (Purcell et al. 2007) to test for linkage disequilibrium (LD) between all pairs of the 6337 filtered SNPs. Within the entire 60-individual data set, we found only a very small portion (0.070%) of SNP pairs in strong LD (r 2 ≥ 0.5). However, a majority of SNPs were in strong LD with at least one of the other 6336 SNPs. Thus PLINK-facilitated LD pruning removed a large portion of the data, and subsequent analyses of genetic structure were performed on a multiple sequence alignment of 2829 SNP sites assessed in 60 individuals spanning eastern North America.

Genetic structure from genomic data
Genetic structure in the LD-pruned genomic SNP data set was initially investigated through principal component analysis (PCA) using SMARTPCA within the software package EIGENSOFT v4.2 Price et al. 2006). Only the first two principal components were considered, because their eigenvalues showed a sharp break from the lesser principal components in a scree plot (Fig. S2). In addition, we performed PCA using the 6337-SNP data set without LD pruning (Fig. S3), identifying the top-weight SNP that were used in later tests of functional enrichment. ADMIXTURE v1.22 (Alexander et al. 2009) was used on the LD-pruned data to identify genetic structuring through simultaneous estimation of population allele frequencies and ancestry proportions. A total of 20 randomly seeded replicate runs with fivefold cross-validation (CV) were each performed using a different assumed number of clusters K from 1 to 15. Normally, the value of K with the lowest CV error should be chosen as the best fit to the data. For our data, however, CV error was uninformative as it increased with the number of clusters. Thus, we applied the STRUC-TURE HARVESTER (Earl and vonHoldt 2012) implementation of the Evanno et al. (2005) method to choose the value of K that shows a modal peak in the absolute value of the second-order rate of change of L(K). For each value of K, the replicate of highest likelihood was chosen for further analysis as well as plotting.

Association of genomic SNPs with transcriptome sequences
Given the geographic signal in our analyses of genetic structure, we thought that SNP loci making large contributions to primary principal components may fall within genes that are geographically divergent in ragweed populations. In order to further explore the genomic basis of geographic differentiation in common ragweed, we tested whether specific top-weight SNPs were carried by transcripts enriched for genes associated with particular molecular functions. Specifically, individual SNP loci were ranked by the absolute value of their loading/weight in the first two principal component vectors from the 6337-SNP PCA, and the 5% highest-weight SNPs were retained for analysis. The 33-to 64-bp nucleotide sequences (hereafter "tags") flanking sequenced ApeKI restriction sites associated with these SNPs were used in a BLASTALL v2.2.26 (Zhang et al. 2000) blastn search against an unannotated, normalized, 33.2-Mbp expressed sequence tag (EST) library assembly derived from a single North American A. artemisiifolia individual collected in Minnesota (Lai et al. 2012), using an E-value threshold of 10 À10 and reporting at most five best hits. Transcript sequences that matched the entire length of restriction site-associated tag sequences with greater than 95% identity were chosen for further analysis. In some cases, the entire length of a query tag sequence did not align to a target transcript. These transcripts were included if there was greater than 95% identity between the two sequences when all unaligned bases were considered mismatches. In cases where tag sequences matched equally well to multiple transcripts, all transcript matches were included in the enrichment analysis. This process resulted in a base set of 1921 tag alignments to the transcriptome assembly, which was used to construct a base reference set for each of the enrichment tests described below.

Tests of functional enrichment from genetic structure
Using blastn, nucleotide sequences of transcript hits were used as queries in a blastx search using against the NCBI nonredundant (nr) protein database with an E-value threshold of 10 À10 and limited to 20 hits per query sequence. The software BLAST2GO pipe v2.5.0 (Conesa et al. 2005; Conesa and G€ otz 2008) was used to associate GO terms with each hit. In the BLAST2GO analysis, we used an annotation rule cutoff of 55 and a GO weight of five. BLAST results were imported into BLAST2GO Pro (v. 3.1.3), and sequence GO term annotations were processed using Annex annotation augmentation, ensuring that each branch's lowest term alone remained in the set of annotations. The transcriptome assembly published by Lai et al. (2012) consists of 62,946 sequences. A total of 38,307 of these could be annotated with GO terms in the absence of an annotated ragweed transcriptome, so further analysis of transcriptome assembly was limited to the annotated sequences. Fisher's exact tests were used to identify enrichment in GO terms associated with transcripts to which various test sets of outlier SNPs (identified in analyses of genetic structure based on both PCA and ADMIXTURE) were well aligned. In each test, a test-specific reference set was constructed by excluding genes common to the base reference set (described above) and the test set, and a multiple-measure false discovery rate (FDR) filter was applied with a significant cutoff of P < 0.05.

Results
Illumina sequencing, SNP discovery, and genotyping Two lanes of Illumina sequencing of the 190 samples produced 381.5M sequence reads passing Illumina filters. Of these, 298.4M contained an unambiguous, individual-specific barcode. The number of sequences per pooled individual was highly variable (Fig. S4), which is consistent with both nonequimolar sample pooling and unequal library performance in the sequencing run. We note that Ambrosia species are known for an abundance of diversity of terpenoid compounds in their tissues (Herz et al. 1969;Martino et al. 2015), so the differential success of samples may have resulted from coextraction of varying amounts of these residual compounds that can interfere with downstream enzyme reactions.
Sequence data processing with the UNEAK pipeline revealed that 205.5M (53.8% of total) reads carried both the ApeKI restriction site and an unambiguous barcode. The number of unique diallelic, single-polymorphism tags identified by TASSEL's network analysis was 4,824,895, which is approximately equal to twice the expected number of ApeKI cut sites in the~1.1-Gbp A. artemisiifolia genome (1.1 Gbp/(4 4.5 ) = 2.15M; Kube sov a et al. 2010). Each of these tags was observed 32.9 AE 1113 (mean AE SD) times, indicating that mean sequencing depth in this data set was low and highly variable. We excluded insertion/deletion-based tags, considering only diallelic, single-SNP tags in further analyses. Thus, after filtering, the dataset consisted of 6337 SNP loci, genotyped in 60 individuals.

Population genetic structure
Consistent with previous observations based on nuclear microsatellite loci (e.g., Genton et al. 2005;Gladieux et al. 2010;Chun et al. 2011;Martin et al. 2014), we observed widespread heterozygote deficiency in the discovered nuclear loci. Considering the 60 individuals as a single meta-population, HWE was rejected for 46.6% of the genomic loci (Fig. S5). Deviations from HWE are commonly observed in A. artemisiifolia populations and have been discussed extensively (e.g., Martin et al. 2014;Genton et al. 2005).
To reduce the dimensionality of our genomic SNP data set and to obtain initial view of genetic structuring in our data, the 2829 filtered, LD-pruned SNP variants from 60 individuals were used to perform a PCA. This analysis revealed that the first two principal component axes reflect the geography of wild populations remarkably well, while accounting for around 6% of the samples' genomic variance (Fig. 1). Individual sample positions on the principal coordinate axes clustered in a pattern that corresponds to their geographic sampling and that could be subjectively divided into three clusters of individuals uniting populations from: Florida and southern Georgia (southeastern); along the Atlantic Coast from Tennessee to New Brunswick (northeastern); all the remaining area (western). Differentiation of the western and northeastern clusters (F ST = 0.010, P < 0.01) was low and half that of the western and southeastern clusters (F ST = 0.012, P < 0.01; Table S2).
Estimation of ancestry proportions with ADMIXTURE showed that a single, panmictic genetic cluster is the most likely, although lower cross-validation errors were achieved with very high values of K approaching the number of individuals in the data set (Fig. S6). Implementation of the Evanno et al. (2005) method revealed a clear peak in ΔK when K = 2, suggesting the most likely scenario involves recent admixture of two ancestral genetic clusters (Fig. S7), although the assigned proportions of genetic ancestry were strongly correlated with geography for K = 2-5 (Figs. 2, 3). When K = 2, individuals segregate into weakly differentiated eastern and western clusters (F ST = 0.0109, P < 0.01) whose diffuse boundary is approximately aligned along the axis of the Appalachian Mountains. The ancestries of individuals from Arkansas, Louisiana, and Tennessee indicate roughly 50% admixture.
When K = 3, the eastern cluster subdivides into northeastern and southeastern clusters, and the three genetic clusters are more highly differentiated (pairwise F ST = 0.012-0.019) than the K = 2 clusters (Table S3). The southeastern cluster is the most divergent and encompasses individuals from Florida, Louisiana, Georgia, and Arkansas, with mixed ancestry appearing in all populations except those in Florida. When K = 4, the western genetic cluster subdivides into southwestern and midwestern clusters that are mixed throughout the western range and more poorly correlated with geography. These four clusters are the most highly differentiated (mean pairwise F ST = 0.022, P < 0.01), with the midwestern cluster most divergent (Table S4). At K = 5, individuals collected in Louisiana are assigned their own ancestral cluster, and at K > 5, geographic correlations appear to break down.

Functional basis of population genomic and geographic differentiation
The 5% top-weight loci from the principal components analysis were represented by 632 tag sequences, 38-64 bp in length. The BLAST analysis aligning these sequences with the native transcriptome assembly identified 222 transcript sequences with unambiguous homology to the query tag sequences. Transcripts from this data set were also enriched for particular GO terms of varied function, including defense response, biosynthesis and metabolism of triterpenoids, catabolysis of glucosinolates (aromatic compounds related to defense against pathogens and herbivores), regulation of growth, RNA metabolism, sulfur/ phosphorus metabolism, response to the plant growth hormone indolebutyric acid, processes relating to protein localization to the mitochondrion, and drug transport/ responses (Table 1).

Discussion
It has been suggested that populations of A. artemisiifolia experienced rapid evolution upon introduction to Europe, Complicating our understanding, we know that human-mediated shifts in A. artemisiifolia genetic structure occurred very recently (Martin et al. 2014). Novel genotypic combinations produced by mating between diverse, divergent populations may produce populations able to arrive rapidly at new fitness optima. By characterizing the correlation of geography and native-range population genetic differentiation, we can begin to understand how genotypes introduced from different sources may fare in novel locales.

Geographic differentiation of ragweed populations
Not surprisingly, we have shown that our genomic SNP data have increased power to detect population genetic structure over a small number of traditional microsatellites and chloroplast SNPs used in previous studies (Genton et al. 2005;Gaudeul et al. 2011;Martin et al. 2014). In this report, we used PCA and population genetic clustering-based analyses to identify weak genetic structure with clear geographic components. The degree of genetic structure we identify is on the order of previous estimates     (Genton et al. 2005;Gaudeul et al. 2011;Martin et al. 2014). Our major findings are that at least two major genetic clusters exist in the native range, and further genetic differentiation makes it possible to identify up to five clusters. This finer-scale genetic differentiation in North American A. artemisiifolia populations at regional scales was supported by both PCA and allele frequency-based clustering analyses using ADMIXTURE. Geographic mirroring of the principal genetic components space has been observed in genomic data, most notably in SNP data from human populations in Europe (Novembre et al. 2008;Leslie et al. 2015). However, this clear geographic signal was somewhat unexpected for a wind-pollinated, pioneer plant with high potential for dispersal, and especially in A. artemisiifolia, which has also experienced extreme migration and population admixture in the recent past.
The geographic division between the genetic clusters defined here fits well the pattern of the Appalachian Mountain discontinuity described by Soltis et al. (2006), which reviewed a number of species whose populations were shown to have east-west phylogeographic breaks defined by the Appalachian Mountain range and the Apalachicola/Chattahoochee River drainages. Indeed, river systems are a major dispersal mechanism for ragweed and thus would be likely to reinforce phylogeographic break points (Fumanal et al. 2007;Lavoie et al. 2007). This overall phylogeographic pattern also has been seen in other plant species, including Virginia pine (Pinus virginiana), Atlantic white cedar (Chamaecyparis thyoides; Mylecraine et al. 2004), and the American groundnut (Apios americana; Joly and Bruneau 2004). Although these patterns were previously explained in the context of expansions out of glacial refugia, we suspect similar genomic patterns of differentiation could develop between populations due to differential environmental stresses, as discussed below.

Investigating the functional basis of geographic differentiation
The low density of genomic markers utilized in our study, as well as our inclusion of SNPs in potential genetic linkage, introduces some caveats for interpreting the enriched functional pathways we identify. Pruning or "thinning" loci in LD is a common way to avoid the detection of spurious genetic structure in data sets that contain linked loci. However, this conservative approach is wasteful in that it discards a large fraction of potentially informative sites (Zou et al. 2010) and could be highly deleterious for inference of functional enrichment in our low-depth GBS study of only filtered 6337 SNPs. This is especially true in the absence of a reference genome, as it is not possible to test directly whether SNP loci are physically clustered along the chromosome and additional stringency in LD pruning is necessary. Perhaps partially due to recent admixture of A. artemisiifolia populations, a substantial portion of this study's SNP markers depart from HWE, and statistical independence of alleles cannot be assumed at loci being tested for LD. But because measures of LD are poorly understood under HWE departure in unphased data, unfortunately common LD tests are not necessarily appropriate for this data set (Schaid 2004).
Our tests detected strong LD in only 0.070% of all SNP pairs, and we found that the substantial correlation of genetic structure with geography is robust to the removal of linked SNPs. Under these considerations, we hypothesized that pruning informative markers could be more detrimental to our study of functional enrichment than the inclusion of several SNPs potentially in genetic linkage. Thus, we elected to perform the functional enrichment analyses using the unpruned data set, and our methods may identify particular SNPs underlying geographic differentiation when selection is actually acting on loci with which they are in genetic linkage. To mitigate this confounding issue, our functional enrichment tests were performed not with all annotated A. artemisiifolia transcripts, but instead with a reference subset of those associated with our 6337 filtered GBS tags.
General stress response (light stress and oxidative stress) and biosynthesis of plant secondary compounds (defense against pathogens and herbivores) were previously identified as major factors in gene expression differences between introduced ragweed populations and native populations sampled from southeastern Canada and midwestern USA (Hodgins et al. 2013). In agreement, our analysis of SNPs associated with extreme geographic genetic variation identified transcripts involved in diverse processes, especially metabolism of glucosinolates and terpenoids, defense response, and regulation of growth and development, implying that genetic differentiation in genes related to these pathways partly underlie geographic variation between A. artemisiifolia populations in their native range. Overrepresentation of GO terms relating to processes within the mitochondrion indicates that differentiation in basic energetics also may be at play.
As Hodgins et al. (2013) found enrichment of RNA metabolism-associated transcripts across all of their lightand nutrient-stress treatments, our observations of significant over-representation of transcripts involving RNA processing and metabolism were not surprising. Major fluctuations in RNA stability are known to occur in plant regulatory responses to stress (Narsai et al. 2007 this species' eastern North American range. These extremes are particularly pronounced in the Great Plains region, which, with its shorter growing season and higher variance in precipitation and temperature, likely presents a more challenging environment for common ragweed (Grimm 2001;Hodgins et al. 2013). We suspect that, in addition phylogeographic patterns formed by mountains and river drainages, adaption to local environments is a major driver of the genetic differentiation among geographically distinct A. artemisiifolia populations. The results from Martin et al. (2014) suggest that the current genetic structure has developed just in the past 100 years, following an extensive westward expansion of a historical northeastern genetic cluster associated with the peak of deforestation in the US around the year 1900. This rapid change is interesting and calls for future work with genomic data from herbarium collections. Those samples may reveal further population genetic structure in past populations, which could further elucidate the role of human-mediated admixture in the history of this successful invasive species.

Final remarks
Here, we provide a preliminary catalog of SNP loci involved with local differentiation in ragweed's native range. These SNPs should be targets for further investigation in larger population samples from both North America and introduced ranges around the world. However, in this study, we analyzed only approximately 6300 SNPs, many of which do not fall within coding regions. Much more information could be gleaned from the entire catalog of transcriptome SNPs. An expanded approach that sequences more genomic loci from larger population samples would likely yield very interesting results with regard to the functional basis of the differential success of these populations.
Our study provides clues about the basis for the large phylogeographic shift observed in North American common ragweed populations over the last 80 years (Martin et al. 2014). Our enrichment analyses of GO terms suggest that environmental selection (local adaptation; Savolainen et al. 2013) on genes within pathways involved in responses to stress and defense factors may have been a major determinant of geographic genetic variation in present-day populations. These important factors might be both biotic and abiotic, including local differences in herbivory, microbial pathogenesis, and the considerable diversity of eastern North American climate regimes. However, the enriched GO terms we report will not always indicate adaptation to local conditions because of the potential effects of genetic linkage, population differentiation is often nonadaptive, and local adaptation in the native range has not been established in ragweed. As a result, the evolutionary processes and genomic loci we identify should be investigated with whole-genome sequencing or genomic data sets of higher density and sequencing depth.
Still we lack a definitive explanation for the curious historical distribution of a human-associated ragweed genetic cluster that apparently expanded from the northeastern USA. Other researchers should seek to obtain genomic data from historical collections of this species in order to identify high-resolution temporal trends in population genetic structure. Future efforts should characterize loci that may have been under differential selection in the past, especially regarding the human-associated cluster that may have gone on to invade Western Europe.

Supporting Information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1. The number of genotyped loci per sample after analysis of sequencing data with the UNEAK pipeline. Figure S2. Eigenvalues for each principal component axis of the filtered genomic SNP genotype alignment dataset. Figure S3. Multivariate analysis of genomic 6337 SNPs segregates samples by geographic sampling location. Figure S4. Sample-specific bias in the number of generated sequence reads with expected barcodes. Figure S5. Distribution of exact tests for Hardy-Weinberg Equilibrium at all 6337 genomic loci using 1000 MCMC replicates. Figure S6. Mean cross-validation (CV) error for different numbers of assumed ancestral population clusters (K). Figure S7. Implementing the Evanno et al. (2005) method for determining the optimal value of K genetic clusters. Table S1. Provenance of A. artemisiifolia individuals included in this study. Table S2. Pairwise F ST estimated between three genetic clusters defined by principal components analysis.  Table S3. Pairwise F ST estimated between genetic clusters defined by ADMIXTURE analysis with K = 3. Table S4. Pairwise F ST estimated between genetic clusters defined by ADMIXTURE analysis with K = 4.