AlleleShift: an R package to predict and visualize population-level changes in allele frequencies in response to climate change

Background At any particular location, frequencies of alleles that are associated with adaptive traits are expected to change in future climates through local adaption and migration, including assisted migration (human-implemented when climate change is more rapid than natural migration rates). Making the assumption that the baseline frequencies of alleles across environmental gradients can act as a predictor of patterns in changed climates (typically future but possibly paleo-climates), a methodology is provided by AlleleShift of predicting changes in allele frequencies at the population level. Methods The prediction procedure involves a first calibration and prediction step through redundancy analysis (RDA), and a second calibration and prediction step through a generalized additive model (GAM) with a binomial family. As such, the procedure is fundamentally different to an alternative approach recently proposed to predict changes in allele frequencies from canonical correspondence analysis (CCA). The RDA step is based on the Euclidean distance that is also the typical distance used in Analysis of Molecular Variance (AMOVA). Because the RDA step or CCA approach sometimes predict negative allele frequencies, the GAM step ensures that allele frequencies are in the range of 0 to 1. Results AlleleShift provides data sets with predicted frequencies and several visualization methods to depict the predicted shifts in allele frequencies from baseline to changed climates. These visualizations include ‘dot plot’ graphics (function shift.dot.ggplot), pie diagrams (shift.pie.ggplot), moon diagrams (shift.moon.ggplot), ‘waffle’ diagrams (shift.waffle.ggplot) and smoothed surface diagrams of allele frequencies of baseline or future patterns in geographical space (shift.surf.ggplot). As these visualizations were generated through the ggplot2 package, methods of generating animations for a climate change time series are straightforward, as shown in the documentation of AlleleShift and in the supplemental videos. Availability AlleleShift is available as an open-source R package from https://cran.r-project.org/package=AlleleShift and https://github.com/RoelandKindt/AlleleShift. Genetic input data is expected to be in the adegenet::genpop format, which can be generated from the adegenet::genind format. Climate data is available from various resources such as WorldClim and Envirem.

The environmental data is first transformed to a range of 0 to 100. Environmental centres along the gradient for baseline populations range between 5 and 75. The populations shift +25 along the range in the future conditions.

str(tutorial.C)
## num [1:50] -3. 96 -5.12 3.35 -16.99 16.47  Plots show the counts for the 'A' alleles along the gradient separately for each locus. Whereas the frequency of the A alleles increases clearly along the gradient, there is considerable overlap between occurrences of both alleles over most of the range.

Select alleles that are most clearly associated with gradient
Via a generalized linear model, the five loci with highest percentages of explained variance are selected.
locus.explained <numeric(length=50) Plots of the ranges shows that considerable overlap remains among the two alleles. For several loci, there are cases where at the lowest section of the range, both alleles are observed, not only the B allele.

Create new simulated data sets
For the simulation study here, 8 populations are simulated. In the baseline climate, populations have a centred value on the environmental gradient that ranges between 5 to 75 with a distance of 10 between populations. In the future climate, populations have centred values that range between 30 and 100 with the same distance of 10. A resampling procedure randomly selects 1000 individuals for each population. These individuals are selected from a subset of the allele data set that are within a range of 20 around the centres of the baseline and future populations.

Baseline populations
baseline.ind <-pop.data }else{ baseline.ind <rbind(baseline.ind, pop.data) } } # convert to genind object baseline.genind <genind(baseline.ind) baseline.genind@pop <factor(rep(paste0("P", c(1:8)), each=1000), levels=c(paste0("P", c(1:8)))) poppr::poppr(baseline.genind) Plots of the frequency of the A allele along the environmental gradient show that generally the frequency increases along the gradient. However, the increase is not monotonous for four out of five of the loci. For example, for the second locus, the frequency of the A allele is lower for the fourth population than the third populations.

##
As a consequence of the overlap in occurrence of A and B alleles along the gradient, frequencies lower than 0.15 are not observed.

Future populations
The same procedure is used to select individuals for future populations. future.actuals <-future.means future.genind@pop <factor(rep(paste0("P", c(1:8)), each=1000), levels=c(paste0("P", c(1:8)))) poppr::poppr(future.genind) Graphs of future populations also show a general increase in the frequencies of the A allele. However, as a result from high frequencies of heterozygous individuals at the highest values of the environmental gradient for some loci, there is a significant drop in frequencies for the eight or seventh and eight population for loci 1 and 3.  Comparisons of input frequencies with predicted frequencies show that in most cases differences between predicted and input frequencies are smaller than 0.05. Predictions are generally worse for the fourth population and the first allele, although absolute differences are typically not larger than 0.08.

Compare predicted and simulated future frequencies
The model calibrated in this section will only be used to estimate the actual future allele frequencies. These actual allele frequencies are then used for plotting actual versus predicted (using the model calibrated with baseline frequencies) frequencies in the following figure. The comparison between projected frequencies and those sampled from the environmental gradient generally shows that projections match the expected frequencies well.
As a consequence from peculiarities of the simulated data from LEA, matches are worse for populations 7 and 8, where simulated data had lower frequencies for several alleles. Predictions for these populations were close to 1.0, which conforms more to expected frequencies of alleles that are positively correlated with the environmental gradient. ## selected populations ## [1] "P1" "P3" "P4" "P5" "P7" "P8" plotB1 Predicted frequencies for populations 2 and 6 match the expected frequencies well. The lower percentages of explained variation are a result from the expected frequencies being close to each other, especially for population 6.

Plot predicted shifts in allele frequencies
Graphs show that increases in frequencies are expected for all populations and all loci. This is the expected trend for the simulated data.

Discussion
The simulation procedure was able to retrieve the expected future trend of increasing allele frequencies associated with a shift of populations along the environmental gradient.
This positive trend was predicted for all populations, despite some peculiarities of the simulated data in not having monotonous increments in allele frequencies in the simulated data sets.