Plastid genome sequencing reveals biogeographical structure and extensive population genetic variation in wild populations of Phalaris arundinacea L. in north‐western Europe

New and comprehensive collections of the perennial rhizomatous reed canary grass (Phalaris arundinacea) were made in NW Europe along north‐to‐south and east‐to‐west clines from Denmark, Germany, Ireland, Poland, Sweden and the United Kingdom. Rhizome, seed and leaf samples were taken for analysis and genetic resource conservation. A subsample covering the geographical range was characterized using plastid genome sequencing and SNP discovery generated using a long‐read PCR and MiSeq sequencing approach. Samples were also subject to flow cytometry and all found to be tetraploid. New sequences were assembled against a Lolium perenne (perennial ryegrass) reference genome, and an average of approximately 60% of each genome was aligned (81 064 bp). Genetic variation was high among the 48 sequenced genotypes with a total of 1793 SNPs, equating to 23 SNPs per kbp. SNPs were subject to principal coordinate and Structure analyses to detect population genetic groupings and to examine phylogeographical pattern. Results indicate substantial genetic variation and population genetic structuring of this allogamous species at a broad geographical scale in NW Europe with plastid genetic diversity organized more across an east‐to‐west than a north‐to‐south cline.


Introduction
Phalaris arundinacea, or reed canary grass, is a C 3 photosynthetic perennial patch-forming species with circumpolar boreotemperate distribution (Anderson, 1961;Casler et al., 2009a). It is important for grazing but has also come under scrutiny as a biomass and bioenergy crop (Gyulai et al., 2003;Lewandowski et al., 2003;Sanderson & Adler, 2008;Wrobel et al., 2009;Kukk et al., 2011). It is native to Asia, Europe and North America and is believed to have been cultivated in Europe since the mid-18th century, starting in Scandinavia (Alway, 1931;Sahramaa, 2004). It is also widely naturalized outside its native range Voshell & Hilu, 2014) and populations from Europe were introduced to North America after European colonization and have mixed with native populations (Casler et al., 2009a;Waggy, 2010). In fact, agricultural expansion combined with hybridization of multiply introduced Eurasian genotypes has increased genetic variability of the species in North America (Lavergne & Molofsky, 2007;Lavergne et al., 2010).
Reed canary grass is an outbreeding species with good dispersal ability (Nelson et al., 2013;Voshell & Hilu, 2014). In addition to seed dispersal, it can also spread via extensive rhizome systems and can dominate large areas with a dense broad-leaved canopy. It is also very deep rooted, a trait that allows it to occupy apparently dry areas in open ground. It tolerates flooding and can grow in shallow water due to increased aerenchyma in its roots (Cope & Gray, 2009;Jørgensen et al., 2012). Its closest relatives are P. caesia, P. rotgesii and P. lindigii (which are sometimes considered conspecific), but it is also closely allied with P. aquatica, P. minor, P. maderensis, P. coerulescens, P. appendiculata and P. paradoxa in a lineage with x = 7 as its basic chromosome number compared with some other Phalaris that have x = 6 (Ferreira et al., 2002;Quintanar et al., 2007;Voshell et al., 2011;Voshell & Hilu, 2014).
Phalaris arundinacea has traditionally been used for grazing, hay production and soil conservation and has a popular ornamental variegated variety (var. picta). It has attracted attention as a bioenergy feedstock because of its high biomass yields and broad adaptation to a wide tures, damp woodland, forest margins and disturbed areas. It thrives in a range of soils but particularly in clays and is rarely found in soils below a pH of 5 (Cope & Gray, 2009). It is therefore a prime candidate species for growth on marginal land that is unsuitable or unproductive for other agricultural uses. The challenge is to develop new high yielding genotypes suitable for growth in a range of habitats and on marginal land (Jing et al., 2012;Nijsen et al., 2012;Hodkinson et al., 2015).
Biomass yield and stress tolerance are key limiting factors for the production of P. arundinacea as a bioenergy feedstock (Casler et al., 2009b;Clifton-Brown et al., 2011;Jones et al., 2015). It generally has better cold and water-logging tolerance than some other potential bioenergy crops such as Miscanthus and has been utilized in Scandinavia for bioenergy and fodder (Heinsoo et al., 2011). One limitation is that it is considered invasive in some countries (Galatowitsch et al., 1999;Green & Galatowitsch, 2001;Lavergne & Molofsky, 2004;Barney & DiTomaso, 2011;He et al., 2011). However, there is huge potential to utilize natural genetic variation of this phenotypically variable species in breeding programmes (Brodersen et al., 2008;Cogliatti et al., 2011;Olmstead et al., 2013;Ramstein et al., 2015).
Several methods exist for plastid genome sequencing (Diekmann et al., 2009;Stull et al., 2013). However, we examined the utility of a long-read PCR and MiSeq DNA sequencing approach developed by Uribe- Convers et al. (2014) to sequence a large proportion of the plastid genome in the sampled populations of Phalaris included here as part of an EU funded GrassMargins project (grassmargins.com) that aimed to develop new bioenergy crops suitable for marginal lands. No reference plastid genome for Phalaris is available so we aligned the data to the related grass Lolium perenne (Diekmann et al., 2009). We conducted extensive sampling of new germplasm across north-west Europe over latitudinal and longitudinal distances of 3000 and 2500 km, respectively. Populations were sampled for ex situ conservation from a broad range of habitats and assessed for plastid genome variation, diversity, phylogeographic pattern and gene pool structure.

Sample collection and ploidy measurement
Samples of P. arundinacea were collected in Denmark, Germany, Ireland, Poland, Sweden and the United Kingdom (Table 1, Fig. 1) over a distance of 3000 km along a latitudinal gradient (51°37 0 41.92″N to 65°59 0 58.04″N) and over a distance of 2500 km along a longitudinal gradient (8°45 0 42.29″W to 23°11 0 13.20″E). Ten populations per country were collected, each comprising 30 accessions. Seed from 30 plants, rhizome from 15 plants and representative herbarium specimens were taken at each site. Seed was stored in paper envelopes and rhizomes kept fresh in approximately 50 mL of potting compost in an aerated ziplock bag until they could be grown up from rhizome. Plants were screened by flow cytometry for ploidy estimation compared to P. arundinacea genotypes with known tetraploidy determined by standard Feulgen staining of mitotic root tip meristems (following Hodkinson et al., 2002) and light microscopy at 91000 magnification (Philpot, 2011). Fresh leaves of 80 genotypes and four Phalaris standards were sent by courier to a commercial company, Plant Cytometry Services, the Netherlands (www.plantcytometry.nl), for flow cytometry analyses. DNA content of samples was compared to known internal controls (Vinca major) with DAPI fluorochrome staining and the resulting ratios used to determine ploidy with reference to Phalaris standards with known ploidy. Because of time and money constraints, not all samples were sequenced to assess plastid DNA variation. Instead, a subset of 48 accessions grown from rhizome were used for sequencing with approximately one to two plants per sampled population based on survival and health of the rhizome grown material while maximizing population number (Table 1).
Long-read PCRs and next-generation sequencing DNA was extracted from fresh or dried leaf tissue with a DNeasy Plant Extraction kit (Qiagen, Valencia, CA, USA), and quantity and quality of DNA of each sample was checked using a NanoDrop 2000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA) and diluted to between 30 and 100 ng lL À1 . The primers and the long-read PCR protocol were as in Appendix 1 of Uribe-Convers et al. (2014). All regions (named 1-16; Fig. S1) described in the paper were tested and amplified on four samples of P. arundinacea. After two rounds of optimization (increasing/decreasing the number of cycles, the annealing temperature and/or the time of the elongation step), it was decided to leave out regions 2, 5, 11 and 14 because of inconsistent amplification. The other regions were successfully amplified in the 48 samples with or without optimization (Table S1).
PCR products were checked on a 1% agarose gel and purified by precipitation in a 20% polyethylene glycol 8000 (PEG)/

Data analyses
The Galaxy public online server was used to analyse the MiSeq data (http://usegalaxy.org/) (Giardine et al., 2005;Blankenberg et al., 2010;Goecks et al., 2010). FastQ files were imported into the server. The read quality was checked using FastQC (Andrews, 2010). Reads were then cleaned at high stringency (minimum quality = 30, maximum number of bases allowed outside of quality range = 5) and assembled against a reference genome (L. perenne L., GenBank accession no. NC009950, Diekmann et al. (2009)) using BWA with default parameters (Li & Durbin, 2009). SNPs and INDELs were called using MPileup tool , visually checked on IGV V2.3 (Broad Institute, www.broadinstitute.org) then filtered (minimum coverage depth = 5) using SnpSift filter tool (Cingolani et al., 2012). The individual files were combined, filtered again (maximum-likelihood estimate of the 1st alternative allele frequency AF1 = 1) and then recoded into Excel. Genetic structure was investigated using STRUCTURE v2.3.4. (Pritchard et al., 2000;Falush et al., 2003), which applies the Markov chain Monte Carlo (MCMC) algorithm. This procedure clusters individuals into populations and estimates the proportion of membership in each population for each individual. An admixture model with correlated allele frequencies was used, the K value was set from 2 to 15, and 10 runs were performed for each value of K. The length of the burn-in period was set to 3000, and the MCMC chains after burn-in were run for an additional 2000 times. The optimal value of K was determined by examination of the DK statistic (Evanno et al., 2005) using Structure Harvester (Earl & vonHoldt, 2012). The frequency distribution for the most probable number of K clusters was mapped into ARCGIS 10.1 (ESRI) for each sample. Finally, principal coordinate analyses were run into GENALEX 6. 5 (Peakall & Smouse, 2012).

Results
All genotypes screened for DNA content by flow cytometry were uniformly tetraploid (Table S2). DNA content ratios relative to the internal standard were 1.41, 1.42, 1.44 and 1.46 for the known tetraploid plants and ranged from 1.42 to 1.50 for the other 80 Phalaris samples included here. All samples screened for ploidy and collected in NW Europe were therefore confirmed as tetraploid. Long-read PCR was successful for most primer combinations of Uribe-Convers et al. (2014). With just 10 PCRs per sample, it is possible to amplify approximately twothirds of the plastid genome. The amplicons were quantified, combined and sequenced using an Illumina MiSeq. Of 10 106 895 reads, 9 762 130 passed the filter and 97.75% of the reads were assigned to an individual, for an average of 198 802 reads per individual. The reads were not trimmed due to excellent per base sequence quality (from the FastQC reports). Over 89% of the reads passed the high-stringency filter (8 509 126 reads). Reads were assembled to the reference genome, and on average, 57.9% of the genome (78 294 bp) was aligned (ranging from 41 981 to 97 555 bp, Table 2). The resulting assemblies had an average depth of coverage of 979 (minimum 309-maximum: 2419). Sequences can be found in GenBank at Bioproject ID PRJNA301092.
A total of 93.3% of the SNPs passed the filter for a total of 2385 SNPs and INDELS. After removing the SNPs only found between Phalaris and the reference, 1793 SNPs were discovered among the Phalaris samples (Table 3), with an average of 23 SNPs per kbp. The average number of SNPs in coding regions per kbp was 13. SNP content per gene was highly variable ranging from 0 in the photosystem II reaction centre protein J Fig. 1 Sampling locations of Phalaris arundinacea in NW Europe. Plants were collected across north-to-south (3000 km) and east-towest gradients (2500 km). (psbJ), the ribosomal protein L2 (rpl2), the ribosomal protein S18 (rps18) and the transfer RNA gene trnI-GAU to 113 in the ribosomal protein S12 (rps12). The SNP data for Phalaris genotypes were subject to principal coordinate and Structure analyses to define cytoplasmic gene pools and examine clustering of individuals in a phylogeographic context. Axes 1 and 2 of the principal coordinate analysis (PC1 and PC2) explained 19.8% and 7.1% of the variation, respectively, and, although not entirely discrete (separated), a number of general groupings can be determined. The samples from the eastern part of the sampling range (Poland and Germany) are clustered to the right part of the plot and the western samples from Ireland and Britain towards the left part of the plot. Some rough groupings by country were also resolved. Structuring is more evident across an east-to-west than north-to-south cline.
The Structure analysis revealed four clusters ( Fig. S2) with the DK statistic (Evanno et al., 2005), and membership of genotypes to these clusters follows some but not strong geographical pattern (Figs 3 and 4). One cluster type is more common in the east (blue in Fig. 3) and another more common in the west and Sweden (red in Fig. 3). Some structure can be seen when cluster membership is mapped in ARCGIS (Fig. 4) especially for the Polish and Swedish material. The yellow and blue cluster types are less common in the west and the blue type particularly prevalent in the east.

Discussion
The utility of the long-read PCR MiSeq whole-genome plastid DNA sequencing method for Phalaris A number of methods exist for whole-genome sequencing of plant plastid DNA (Diekmann et al., 2009;Uribe-Convers et al., 2014). We applied an amplification-based method of  plastid DNA isolation and enrichment using the long-read PCR primers of Uribe-Convers et al. (2014). We found that amplification and MiSeq sequencing was highly effective using the protocols and 'universal' primers described in Uribe-Convers et al. (2014) with nearly 90% of reads passing the high-stringency filter. The universal primers were designed for the eudicot and showed good application to the monocot grass species studied here. In addition, when the MiSeq reads were assembled to the reference genome for each genotype, an average of 57.9% of the genome (78 294 bp) was aligned (ranging from 41 981 to 97 555 bp, Table 3). The resulting assemblies had an average depth of coverage of 979 (minimum 309, maximum: 2419). Previous studies have used molecular markers to study Phalaris diversity including microsatellites (Li et al., 2011), minisatellites (Gabor et al., 2014), ISSRs (McRoberts et al., 2005), and AFLP (Casler et al., 2009a), but these are not as easily transferrable among laboratories as the SNP markers developed here. We have submitted the sequences to GenBank with the Bioproject ID PRJNA301092, and the data are available for future comparisons and analyses. It would, for example, be useful to extend the sampling to southern and eastern Europe, and the data can be used directly for that purpose. No complete plastid genome has been published for P. arundinacea, and we had to align our data to a related grass species L. perenne (Diekmann et al., 2009). However, we have sequenced a maximum of 85% of the plastid genome of Phalaris in this manuscript. Such data will be highly valuable for various aspects of comparative genomics in the future with the added value of having population level data (not just a single genotype).
Nuclear SNP markers have also been developed and used for P. arundinacea by Ramstein et al. (2015) and M. Klaas, S. Barth & T.R. Hodkinson (in preparation) using genotyping by sequencing (GBS). Ramstein et al. (2015) used GBS to generate 5228 nuclear SNP markers and applied them in a genomewide association study with 35 traits, but they did not assess plastid genome variation nor assess population genetic variation at broad geographical scale. Populations of P. arundinacea have been assessed by M. Klaas, S. Barth & T.R. Hodkinson (in preparation) with approximately the same population of plants included here. It will be useful to compare and interpret our plastid DNA results with those of nuclear SNP analysis in future studies. Combined they will provide an excellent resource to organize and manage the genetic resources of Phalaris and to study its molecular ecology including its invasion dynamics.

Diversity and plastid DNA variability
The extent of SNP discovery of the 48 sampled plants from the European populations indicates that it is not necessary to amplify and sequence the entire plastid genome of each P. arundinacea genotype because sufficient diversity is recorded in the partial genomes sequenced in this study (approximately 60% of the genome). Plastid DNA is maternally inherited and each locus would be expected to be linked and provide similar population genetic and phylogeographic signal, apart from some differences that could be attributed to variation in mutation rate. In fact, mutation rate is high and not uniform among genes (Table 3). The samples of P. arundinacea were collected in a broad range of habitats separated by large geographical distances. Genotypes were sampled from Poland, Germany, Denmark, Sweden, Ireland and the United Kingdom over a distance of 3000 km along a latitudinal gradient and over a distance of 2500 km along a longitudinal gradient ( Fig. 1; Table 1). The sampling strategy aimed to encompass maximum geographical variation with plants collected from waterlogged and dry well-drained sites, from freshwater and saline habitats and from large and small stands ( Table 1). The results of our plastid DNA study indicate that a high genetic diversity of Phalaris has been collected.
High levels of plastid variability have also been recorded in other grasses such as L. perenne using nuclear SSR markers by McGrath et al. (2007). Lolium is also an outbreeding wind-pollinated species but differs in its habitat preferences to P. arundinacea. High variability was found in genic as well as nongenic regions. It is not known what adaptive significance these variants might have in European ecosystems, but it is clear that the breeders have a high genetic variability of cytotypes to work from. The plastid types will, but not always because of uniparental inheritance, plastid capture and lower mutation rates (Hodkinson et al., 2002), reflect patterns of nuclear genetic variability.

Population structure and differentiation
The principal coordinate analysis (Fig. 2) showed some geographical patterning of the genotypes despite the percentage of variation explaining each axis being low (PC1 20% and PC2 7%). Groupings can be seen in the PC1 vs. PC2 plot, but there is also considerable overlap among country groups in ordination space reflecting a relatively high degree of gene flow. The samples from the eastern part of the sampling range (Poland and Germany) are grouped to the right-hand part of the plot and the western samples from Ireland and Britain towards the left of the plot. Structuring is more evident across an east-to-west than north-to-south cline. This might reflect the high proportion of more northern samples that had an oceanic (Britain, Ireland, Denmark, Sweden) rather than a more continental climate of eastern Germany and Poland.
The structure analysis identified four major clusters (Figs 3 and S2), and similar pattern to the principal coordinate analyses is recovered (Figs 3 and 4). One cluster type is more common in the east including Poland and Germany (shown in blue in Fig. 3) and another is more common in the north Europe including Ireland, Sweden and the United Kingdom (red in Fig. 3). Considerable admixture is also evident in some genotypes but not in others. The third and fourth clusters (green and yellow) do not relate to geographical distribution and are relatively randomly distributed across the range (Fig. 3) except that yellow is rarer in Ireland and the United Kingdom.
Dilution of population genetic structuring relative to geographical proximity is expected because reed canary grass is a common outbreeding, wind-pollinated species with a gametophytic self-incompatibility system (Nelson et al., 2013). It also has good dispersal ability via its  Estimated population structure of Phalaris arundinacea using plastid SNP data and Structure analysis. Each individual is represented by a vertical bar. SNP variation partitioned into coloured segments that represent the individual's estimated membership fractions in each of the K = 4 clusters (cluster 1 green; cluster 2 red; cluster 3 blue; cluster 4 yellow) determined as optimal by the DK statistic of Evanno et al. (2005). Samples arranged according to country. seeds that are produced in profusion and are dispersed over the water and germinate in spring (Cope & Gray, 2009;Voshell & Hilu, 2014). It can also regenerate and spread via extensive rhizome systems over large areas.
It is also possible that genotypes have been deliberately or inadvertently moved by humans for forage or habitat restoration. Unlike many other perennial rhizomatous grasses such as Dactylis, Lolium and Festuca, P. arundinacea is not extensively planted and moved around Europe by seed. We sampled populations with no known history of cultivation and do not expect to have sampled commercial cultivars or material introgressed via cultivar/wild population hybridization. In a separate study as part of the grassmargins.com project, we sampled Dactylis glomerata across the same geographical range and did not detect the same extent of population genetic structuring relating to geographical location (data not shown) possibly due to the higher level of planting and human-mediated seed movement in that forage species.
Polyploidy does not explain any of the witnessed structuring of genotypes as all samples were recorded as tetraploid (Table S2). A range of ploidy levels have been recorded in European P. arundinacea, but the most common type is tetraploid. Our wide sampling of north-western European material indicates that tetraploid genotypes are indeed the dominant type and perhaps the only type. Phalaris is variable morphologically in its vegetative and reproductive characters, and because of this, some taxonomists recognize infraspecific categories, including subspecies and varieties (Anderson, 1961). Cytologically, diploid (2n = 2x = 14), tetraploid (2n = 4x = 28) and hexaploid (2n = 6x = 42) cytotypes have been recorded (Anderson, 1961;McWilliam & Neal-Smith, 1962;Ferreira et al., 2002). These three cytotypes were recognized as different taxa (Baldini, 1993(Baldini, , 1995 on the basis of morphological differences, chromosome number, ecology and geography: P. rotgesii (Husn.) Baldini (2x), restricted to Corsica and Sardinia, P. arundinacea L. (4x), with holarctic distribution, and P. caesia Nees (6x), originated in the Mediterranean area and spread to Eastern and Southern Africa. Also, several studies indicate the presence of aneuploids and other meiotic irregularities (Ferreira et al., 2002).

Conclusions
Our analysis of near-whole-plastid genome data in P. arundinacea clearly defines groupings for genetic resource characterization and has helped define useful Fig. 4 Cluster membership (K = 4) for the Structure analysis mapped in ARCGIS. Coloured segments that represent the individual's estimated membership fractions in each of the K = 4 clusters (cluster 1 green; cluster 2 red; cluster 3 blue; cluster 4 yellow) determined as optimal by the DK statistic of Evanno et al. (2005). breeding material for future evaluation. It has provided valuable data on plastid genome variation of P. arundinacea across north-west Europe that can be compared to nuclear GBS data in future studies. Plastid genetic SNP variation is high and genotypes show some broad-scale population genetic structuring particularly in an east-towest cline despite the good dispersal ability of P. arundinacea via seed and its allogamous wind-pollinated breeding system.

Supporting Information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1. Long-read primers used for plastid genome sequencing. Figure S2. Optimal value of K determined by the DK statistic (Evanno et al., 2005) using Structure Harvester (Earl & vonHoldt, 2012). Table S1. Parameters used for long-read PCR for each primer pair of Uribe-Convers et al. (2014). Table S2. Flow cytometry results for Phalaris arundinacea ploidy determination.