Demographic history and local adaptation of Myripnois dioica (Asteraceae) provide insight on plant evolution in northern China flora

Abstract The flora of northern China forms the main part of the Sino‐Japanese floristic region and is located in a south–north vegetative transect in East Asia. Phylogeographic studies have demonstrated that an arid belt in this region has promoted divergence of plants in East Asia. However, little is known about how plants that are restricted to the arid belt of flora in northern China respond to climatic oscillation and environmental change. Here, we used genomic‐level data of Myripnois dioica across its distribution as a representative of northern China flora to reconstruct plant demographic history, examine local adaptation related to environmental disequilibrium, and investigate the factors related to effective population size change. Our results indicate M. dioica originated from the northern area and expanded to the southern area, with the Taihang Mountains serving as a physical barrier promoting population divergence. Genome‐wide evidence found strong correlation between genomic variation and environmental factors, specifically signatures associated with local adaptation to drought stress in heterogeneous environments. Multiple linear regression analyses revealed joint effects of population age, mean temperature of coldest quarter, and precipitation of wettest month on effective population size (Ne). Our current study uses M. dioica as a case for providing new insights into the evolutionary history and local adaptation of northern China flora and provides qualitative strategies for plant conservation.


| INTRODUC TI ON
The Sino-Japanese floristic region (SJFR) is one of the oldest and richest area of temperate floral elements in the North Hemisphere, and the northern China flora composes the main part of the SJFR Gao et al., 2003;Lu et al., 2018;Wu & Wu, 1996).
Northern China has served as an important region composing the north-south vegetation transects from tropical to cold forests and taiga, and paleovegetation evidence suggests that the current components of the northern China flora are mainly derived from the Neogene-Quaternary period (Gao et al., 2003;Manchester et al., 2009;Wang, 1999). Covering a relatively wide geographic location, along with past climatic oscillations and geological changes, a high number of species are found in northern China (Wu et al., 2005;Wu & Wu, 1996;Yu et al., 2000), of which are an abundance of endemic species, such as Xanthoceras sorbifolium Bunge (Sapindaceae), Opisthopappus taihangensis (Ling) Shih (Compositae), and Taihangia rupestris Yu et Li (Rosaceae) (Wang, 1999). Genetic evidence has suggested that plants in the northern China region contracted to southern glacial refugia during the last glacial maxima (LGM) and then expanded northwards during postglacial migration. This hypothesis is supported in present-day phylogroups with northern region populations originating from recent range expansions of a southern refugium, which may present founder and bottleneck effects with reduced genetic diversity within and between populations (Cao et al., 2015;Chen et al., 2008;Hao et al., 2018;Tang et al., 2018;Tian et al., 2009). Alternative hypotheses include the remnants of multiple geographically isolated refugia occurring in northern China causing population differentiation of many plant species in the region (Hou et al., 2020;Wang et al., 2017;Ye et al., 2019). Studies addressing both hypotheses have predominately been limited to using several loci and almost exclusively conifer species, which inadequately address the questions. Thus, genome-level data and plants with shorter life cycles can help to better demonstrate the evolution of plants making up the northern China flora.
Physical barriers containing multiple mountains and drainage systems have acted on species differentiation in northern China, such as the Taihang Mountains promoting intraspecies divergence (Hou et al., 2014;Ye et al., 2018). In addition, the uplift of the Qinghai-Tibetan Plateau (QTP) has aggravated monsoon intensity and enhanced aridity in the Asian interior, which profoundly influenced the climate of northern China (Guo et al., 2008;Li et al., 2001;Shi et al., 1999). With the formation of an east-west arid belt, this ecological barrier played a significant role in plant divergence and can be tested with phylogeographic studies (Bai et al., 2016;Guo et al., 2008). Thus, northern China is a model region for studying how these intricate factors together influence the evolution of plants.
In response to selective pressures from environmental heterogeneity, species undergo local adaptation that can be identified using population genetic analyses based on genome-environment association data (Joost et al., 2007;Li et al., 2017). Many cases based on nonmodel plant species have illustrated environmental factors including soil, temperature, and precipitation have played important roles in driving differentiation in locally adaptive species (Chao et al., 2020;De Kort et al., 2014;Jones et al., 2013). The regional climate impacting the local flora of northern China is arid with small quantities of precipitation that may be the major environmental factors effecting adaptive evolution. Previous studies in Phyteuma hemisphaericum L.
(Campanulaceae), Campanula barbata L. (Campanulaceae), and Carex sempervirens Vill. (Cyperaceae) (Basu et al., 2016;Jones et al., 2013;Manel et al., 2012) identified genetic variation that is highly related to precipitation factors, emphasizing local adaptation within species to cope with drier climatic conditions. However, few environmentalgenetic studies have been reported for the flora of northern China, and the link to whether/how the evolution of plant populations is driven by adaptation to drought in this region has not fully been investigated. In this study, we use genome-wide data to find correlations between loci and ecological factors based on a representative species to explore adaptive evolution pattern of regional vegetation in the northern China flora.
Myripnois dioica Bunge is a temperate, deciduous shrub species ranging from 35°N to 45°N (Figure 1), geographically overlapping the arid belt. The species belongs to the daisy family and is wind-dispersed with seeds having long pappi, but is distinguishable from other closely related species by its shrub life form (Gao & Nicholas, 2011). There is little known about this long-neglected species except for its ornamental value (Xie & Zhang, 2012;Xu et al., 2007). The species is widely distributed along the Taihang Mountains in habitats of uneven altitude from 200 m a. s. l. to 1,500 m a. s. l. and can survive in dry, cold, and rocky conditions (Gao & Nicholas, 2011;Xie & Zhang, 2012). Since M. dioica is one of the dominate components of shrub community in northern China, the species has undergone complicated geographic and climatic changes together with extreme ecological adaption. Thus, this species is an ideal system to investigate the joint effects of Pleistocene climatic oscillations, biogeographic barriers, and patterns of ecological adaption in the flora of northern China. Given the complex factors contributing to the evolution of M. dioica in northern China, efforts in identifying and quantifying important factors will undoubtedly lead to better conservation strategies, providing an evolutionary diversity reference for the framework of conservation genetics (Jarzyna & Jetz, 2016;Moritz & Potter, 2013). Phylogeography has been shown to be useful in plant conservation by investigating historical and evolutionary questions utilizing genetic structure, spatial, and demographic perspectives (Médail & Baumel, 2018). Studies in plants have also suggested that both expansion history and abiotic climate directly impact effective population size variation (Braasch et al., 2019;Micheletti & Storfer, 2015). In this case, inherited mechanisms for how specific historical and climatic factors affect species evolution and population size changes could protect plant species from extinction under disturbances by human encroachment and global climate change (Pauls et al., 2013;Polechová & Storch, 2008;Pritchard et al., 2000).
Genome-wide data derived from restriction enzyme-associated DNA sequencing (RAD-seq) have been effective and well-suited for phylogeographic studies and investigating adaptive evolution

| Sampling, DNA extraction, and RAD sequencing
We sampled 77 individuals from 16 populations spanning the geographic range of M. dioica. Fresh leaves from each individual were dried in silica gel. Total genomic DNA was extracted using a modified cetyltrimethylammonium bromide (CTAB) method (Doyle et al., 1996). The quality of DNA was visualized on a 1% agarose

| Bioinformatics pipeline for SNP calling
All 77 individuals from 16 populations were used for de novo SNP calling (Table S1 and Figure 1). Only the forward reads of the paired ends were used for analyses due to low coverage of the reverse reads.
Reads were filtered for quality and demultiplexed using process_radtags with default parameter in Stacks 2.0 (Catchen et al., 2013). We then used the denovo_map.pl wrapper to identify SNPs by clustering reads with a minimum number of four raw reads and allowing two mismatches between loci. Given the limited number of individuals within each population, the populations command was used to filter loci so that polymorphic loci in at least 50% of the individuals and twelve populations were retained. We further removed loci with minor allele frequencies <0.05 and only included the first SNP per locus in the final analysis to avoid linkage bias. Finally, a maximum observed heterozygosity was set to 0.5. PGDSPIDER 2.1.1.5 and VCFtools 0.1.15 were used to generate input files for downstream analyses (Danecek et al., 2011;Lischer & Excoffier, 2012).

| Summary statistics and population structure
We used the populations command of Stacks to estimate nucleotide diversity (π), observed heterozygosity (H O ), expected heterozygosity (H E ), and fixation index (F IS ) for each population. We performed pairwise population F ST values using an analysis of molecular variance (AMOVA) with 1,000 permutations as implemented in ARLEQUIN 3.0 (Excoffier et al., 2005). Population structure analyses were performed to infer the most likely number of ancestral populations using ADMIXTURE 1.23 (Alexander et al., 2009) by determining the optimal partitioning of the populations according to cross-validation error value of clusters 1-10. A principal component analysis (PCA) using the genome-wide complex trait analysis (GCTA) was used to explore the genetic structure and visualized in R (Pluess et al., 2016;RC, 2013). RAxML v8.2.10 (Stamatakis, 2014) was used to construct a maximum-likelihood (ML) phylogeny of the 16 populations from unlinked SNPs using the ascertainment bias correction for the GTRGAMMA model (-m ASC_GTRGAMMA, -asc-corr = lewis) with 1,000 bootstrap replicates.

| Demographic history and species distribution modeling
Effective population size (Ne) for each population was estimated using the molecular co-ancestry method implemented in NEESTIMATOR 2.01 (Do et al., 2014). To explore the historical demography of M. dioica, we employed DIYABC v. 2.0 software based on an approximate Bayesian computation algorithm (ABC) (Cornuet et al., 2014). The 16 populations samples were divided into two groups (group N and group S) based on genetic structure, PCA, and phylogenetic relationships of M. dioica, and we only used a single SNP per locus and neutral SNPs and without missing data for the ABC simulation. Most populations are distributed in the northern Taihang Mountains, and we first tested four possible divergence scenarios to estimate whether M. dioica originated from the northern region ( Figure S1).
We employed a uniform prior probability and selected all summary statistics to generate a reference table based on 4 × 10 6 simulated datasets with 1% of the simulated datasets closest to the observed data to estimate the relative posterior probabilities for all scenarios.
Based on our field observations and closely related species (Zhao & Gong, 2015), we set a conservative estimate for generation time to 5 years to estimate M. dioica demographic history. Additionally, we investigated demographic scenarios of changes in population size in two groups of M. dioica. Five models of population size changes were used with the same parameter settings strategy as above to choose the best fit demographic scenario and parameter estimation ( Figure S1; Fan et al., 2018;Wang et al., 2016). Gene flow patterns among different populations were analyzed with setting the number of migration events from 1 to 3 in the TreeMix 1.11 software (Pickrell & Pritchard, 2012). To further test the effects of glaciation on M. dioica, species distribution modeling was generated with MAXENT 3.2 (Phillips et al., 2006) using occurrence data gathered from field collections and herbarium specimens. After eliminating highly correlated variables, six bioclimatic variables from WorldClim (http://www.world clim.org) (Hijmans et al., 2005) were used as environmental predictors for species distribution modeling (SDM). The following two layers from the Model for the Community Climate System Model (CCSM, (Hasumi & Emori, 2004)) were employed for distribution modeling: the current layer and the last glacial maximum (LGM: c. 21 ka) layer. For evaluation, models were calibrated on 75% of the data and evaluated on the remaining 25% using area under the curve (AUC) and replicated 10 times.

| Genomic variations and outliers related to climate-driven local adaptation
To assess the correlation between geographic distance and genetic distance, we conducted isolation-by-distance (IBD) analyses using ARLEQUIN (Excoffier et al., 2005). We also tested the contribution of environmental differences to population differentiation (IBE) by comparing pairwise F ST values and environmental distances among populations using a Mantel test in GenAlEx 6.3 with 1,000 permutations (Peakall & Smouse, 2006). A total of 19 bioclimatic variables for all populations under current conditions were obtained from of WorldClim at 30-arc seconds resolution , http://www.world clim.org). We first conducted a principal component analysis (PCA) to obtain the first two principal components of the 19 bioclimatic variables (Clim_PC1 and Clim_PC2), and these values were used as points in two dimensions to calculate a pairwise distance matrix (Pluess et al., 2016).
To evaluate genomic variation and its association with geographic and environmental factors, redundancy analyses (RDA) were used to partition the variance of genomic variation explained by variables (Peres-Neto et al., 2006) as implemented in the VEGAN 2.5.7 package in R (Oksanen, 2020). Our initial model was as follows: Y (individual genotype) ~ Latitude + Longitude + Clim_PC1 + Clim_ PC2, and we also tested the significance of the estimated variance explained by all variables and separate variables with 999 permutations. Furthermore, a gradient forest (GF) analysis was performed using the R package gradientForest 0.1-17 (Ellis et al., 2012), which investigated the relationships between genomic variables and the six irrelevant bioclimatic variables.
To detect signatures of natural selection, we first used BayeScan 2.10 following the Bayesian-likelihood method of reversible-jump MCMC, which is based on a Dirichlet distribution decomposing locus-population F ST coefficients into a populationspecific component (beta) (Foll & Gaggiotti, 2008). Moreover, a burn-in of 50,000 iterations followed by 50,000 iterations was used for estimation using a thinning interval of 20. We then used BayPass 2.1 to identify climatic associations with all SNPs (Gautier, 2015) and to detect signature of adaptive selection associated with population-specific covariates (Clim_PC1 and In order to estimate divergence time among populations, molecular dating was performed in BEAST 1.6 using uncorrelated log-normal (UCLN) relaxed-clock model (Drummond & Rambaut, 2007). The root node was constrained using a normal prior distribution using secondary calibration based on the results of Funk et al. (2014).
Sister populations sharing one node were identified as having a con-

| RAD processing
After demultiplexing, and filtering low-quality reads of 16 populations (Table S1 and Figure 1), we obtained an average of ~1.6 million reads per sample with detailed information presented in Table S2.
Our Stacks pipeline with subsequent filtering generated 22,868 SNPs within M. dioica.

| Population structure
Based on the 22,868 SNPs, the average within-population nucleotide diversity (π) was 0.0347 ranging from 0.0303 to 0.0384 in populations JL and JW, respectively (Table 1) Table S3). Analysis of all SNPs with admixture, ML phylogeny, and PCA revealed a well-supported split at K = 2 corresponding to geography, and the populations included in the southern group (BA, GSJ, and HX; hereafter group S) and northern group (the rest populations; hereafter group N) were separated by the Taihang Mountains (Figures 1 and S2).

| Demographic history and species distribution modeling
Among populations, N e estimates ranged from 1.9 to 15.6 with the lowest and highest N e detected in population GSJ and HY, respectively (Table 1). In southern populations, N e estimates ranged from 1.9 to 4.7 (populations GSJ and HX), while in northern populations the values ranged from 3.3 to 15.6 (populations JH and HY). DIYABC estimations of the divergence history of M. dioica indicated that the scenario 3 had the higher posterior probability (posterior probabilities = 0.805, 95% CI: 0.579-1.000) ( Figure S1 and Table S4). For demographic history, the best fit scenario for both group N and group S was Scenario 4 of M. dioica (Figures 3,   S1 and Table S4). Group N was found to be the ancestral population and started to expand its distribution at c. 0.62 Ma (95% CI: 0.14-4.54 Ma), followed by a bottleneck at c.

| Genomic variation and outliers related to local adaptation
The IBD analyses indicated a significant correlation between geographic distance and genetic distance across populations (r = 0.33, p < 0.001) (Figure 4a). The first two principal components (Clim_PC1 and Clim_PC2) for all populations summarized 59.61% and 28.27% of the variation in the 19 climatic variables used in this study. A Mantel test among populations also determined significant correlation between genetic distance and environmental distance (r = 0.46, p < 0.001, Figure 4b). Our full RDA model indicated a significant role of geographic and environmental conditions in shaping the distribution of genotypes (p = 0.001; R 2 = 0.16). The first two RDA axes were significant and explained 12.20% of the constrained variance (Table S5). When single variable effects were conditioned by other variables effect in partial RDA analysis, our results revealed a significant effect of Clim_PC1 and Latitude on genomic variations (Table S6). Gradient forest analyses indicated that the most two important predictors were temperature seasonality and precipitation seasonality ( Figure 5). Our results indicated a shift temperature seasonality (standard deviation × 100) was 101 and precipitation of driest month was 5.5 mm ( Figure 5).
A total of eight and 21 F ST outliers were identified in M. dioica using BayeScan and BayPass, respectively, with no overlapping outlier loci found between the two methods ( Figures S4-S5). The BayPass analysis indicated a total of one and four loci significantly associated with Clim-PC1 and Clim-PC2, respectively ( Figure S5).
Based on the H. annuus genome, 22 loci associated with environment variables were successfully aligned to reference genome. A majority of the genes closest to the outlier loci were related to environmental stress response under drought tolerance, such as DREB and EAR motif protein 3, lipid transfer protein 3, and F-box family protein (Table S6).

| Joint effects of expansion dynamics and climatic factors on Ne
No association between the number of individuals sampled and N e was observed (r = −0.50, p = 0.07). We independently employed climate factors and population age for predicting N e in M. dioica populations. Our results showed no significant correlation between N e and population age (r = −0.24, p = 0.41, Figure S6).
Although principal component analysis revealed that the first two cumulative contribution rates of variance for bioclimatic factors explained the majority of the variance, no significant correlation was found between PC1 of all climate factors and N e (r = −0.24, p = 0.42, Figure S6). In addition, there was no significant correlation between PC1 of energy/moisture and PC2 of energy/moisture to predict N e ( Figure S7). The above results suggest that a single factor within either population history or climatic factors could not predict N e using a linear model. Thus, we further tested the multiple linear regression model of all the climate factors and

| Phylogeographic structure and demographic history
Our genomic data of M. dioica indicated a strong genetic population structure consisting of the northern and southern groups representing two distinct lineages geographically (Figure 1). Our best-fitting ABC model identified the northern lineage as being ancestral and the southern group having been derived from it, which suggests that M. dioica originated north of the Taihang Mountains and then expanded south of the Taihang Mountains ( Figure 3). The estimated divergence times for the southern and northern lineage occurring during the Pliocene to Pleistocene coincides with the intense uplift of the Taihang Mountains during the Late Pliocene to Pleistocene (Gong, 2010;Wu et al., 1999). With no detected migration events between the two groups, the division is inferred to be associated with restricted gene flow caused by long-term isolation of geographic barrier. Visible mountains as barriers driving population subdivision and divergence have been reported in the Qinghai-Xizang Tibet Plateau and the Hengduan Mountains (Gao et al., 2003;Wu & Wu, 1996). significantly lower than those during the Pliocene and also became drier (Shi et al., 1999). This altered climate condition could be con- Note: "*" is corresponding to significant to Ne. Bio11 = mean temperature of coldest quarter; Bio13 = precipitation of wettest month.

| Genomic variations and climatedriven adaptation
Our results based on IBE analyses indicated a significant effect of climatic factors on genetic differentiation of M. dioica, highlighting the contributions of environmental variables to the landscape genomics. Meanwhile, our RDA results indicated that the environment explained more genomic variations than geography, which is consistent with previous results (Orsini et al., 2013;Sexton et al., 2014). This also suggests that adaptation to the environment plays a key role in shaping plant divergence in the flora of northern China, and we used a landscape genomics approach to detect candidate genes for local adaption. A total of 22 SNPs were annotated to associate with climatic variables and a high proportion of located genes (over 50%) are involved in abiotic stress response of drought.
Similar results have been presented in previous physiological studies in which M. dioica showed superior drought resistance of the root system (Dai et al., 2014;Xu et al., 2007). Among the outlier loci, we found genes involved in regulation of reduced lateral root formation. Plants decrease the metabolic cost of soil exploration, improving water acquisition and plant growth, which has been demonstrated to improve drought tolerance in Zea mays L. (Poaceae) and Oryza glaberrima Steud. (Poaceae) (Inukai et al., 2005;Lynch, 2013).   (Table 2). First, we recognized population age of M. dioica to be one of the direct contributing factors for estimating population change (Braasch et al., 2019;Excoffier & Ray, 2008

DATA AVA I L A B I L I T Y S TAT E M E N T
The raw data were deposited in the NCBI Short Read Archive (SRA) (Bioproject No. PRJNA607823).