Population differentiation of polygenic score predictions under stabilizing selection

Given the many small-effect loci uncovered by genome-wide association studies (GWAS), polygenic scores have become central to genomic medicine, and have found application in diverse settings including evolutionary studies of adaptation. Despite their promise, polygenic scores have been found to suffer from limited portability across human populations. This at first seems in conflict with the observation that most common genetic variation is shared among populations. We investigate one potential cause of this discrepancy: stabilizing selection on complex traits. Counterintuitively, while stabilizing selection constrains phenotypic evolution, it accelerates the loss and fixation of alleles underlying trait variation within populations (GWAS loci). Thus even when populations share an optimum phenotype, stabilizing selection erodes the variance contributed by their shared GWAS loci, such that predictions from GWAS in one population explain less of the phenotypic variation in another. We develop theory to quantify how stabilizing selection is expected to reduce the prediction accuracy of polygenic scores in populations not represented in GWAS samples. In addition, we find that polygenic scores can substantially overstate average genetic differences of phenotypes among populations. We emphasize stabilizing selection around a common optimum as a useful null model to connect patterns of allele frequency and polygenic score differentiation. This article is part of the theme issue ‘Celebrating 50 years since Lewontin's apportionment of human diversity’.


Introduction
Lewontin's foundational early work found that most common genetic variation is shared among human populations [1]. Thus, it strongly refuted the view that genetic variation is partitioned among mostly invariable populations and has become a classic work discrediting discrete human races. Lewontin's finding is reflected by low estimates of F ST among populations [2], i.e. only a small proportion (approx. 10%) of the total allelic variance is attributable to differences in frequency between populations (a finding that has been replicated for many different marker types genome-wide; [2][3][4][5][6][7]). While some loci that are highly differentiated among human populations have been uncovered (e.g. underlying loci in skin pigmentation and infectious disease immunity; see [8] for a review), there are relatively few such strongly selected loci [9,10].
Lewontin's results also have implications for our a priori expectations of the partitioning of phenotypic variation within and among human populations. For phenotypes evolving neutrally in diploids, we expect the proportion of additive genetic variance attributable to among-population differences to be ≈2 F ST when all individuals are measured in a common set of environments [11][12][13][14][15][16]. Thus in humans, we expect only approximately 18% of additive genetic variance to be due to among-population differences when measured in a common set of environments. Compared to this null, phenotypes subject to divergent selection among populations are expected to be over-dispersed, while phenotypes subject to stabilizing selection with the same selective optimum are expected to be less differentiated among populations. To distinguish among potential contributors to genetic differences among populations, researchers often turn to settings like common gardens in an effort to eliminate among-population environmental variation. However, it is not feasible to measure human phenotypes in a common environment (a major drawback of studies that investigated population-level phenotypic variation; [17,18]). Thus for the majority of complex traits, we do not know the role of genetics, let alone natural selection, in explaining phenotypic differences among human populations.
These questions have received renewed interest in human genetics due to genome-wide association studies (GWAS), which have found common variation associated with many phenotypes within populations. GWAS have revealed that most phenotypes are highly polygenic within populations [19][20][21] and confirmed that much of the genetic variance is additive (reviewed in [22]). These observations have motivated phenotype prediction through additive genetic values, the additive contribution of polymorphisms to phenotypic differences among individuals. One common approach to predict an individual's additive genetic value is based on a polygenic score, the sum across trait-associated loci of genotypes weighted by their estimated effects. Predictions based on polygenic scores are being explored in a number of clinical settings and more generally as a tool for understanding the genetic basis of disease and phenotypic variability. However, the generalizability of polygenic scores across populations is concerning because GWAS samples are strongly biased towards European populations and studies of other populations are much smaller [23][24][25][26]. There is wide agreement that these portability issues must be addressed so that the future clinical use of polygenic scores does not further compound inequalities in healthcare [23,24,26].
Currently, for most complex traits, polygenic scores poorly predict additive genetic values, and therefore phenotypes. This issue stands even in samples closely related to the GWAS population, because polygenic scores aggregate across many loci with slightly mis-estimated effects. These problems increase as we move to populations that are genetically and environmentally more distant from the GWAS populations. For one, the associated loci are usually not the causal loci underlying trait variation; instead they tag the effects of linked causal sites. The interpretation of the effect size of an associated variant can be tricky because of: (i) linkage disequilibrium (LD), whereby it absorbs the effects at correlated causal sites [27][28][29][30][31]; (ii) population stratification, whereby it absorbs the effects of covarying environments [32][33][34][35][36]; and (iii) geneby-environment (GxE) or gene-by-gene (GxG) interactions, whereby its estimate is averaged over the interacting environmental contexts or genetic backgrounds in the sample [28,[37][38][39][40][41][42][43][44][45][46][47]. As all of these factors can and will vary across populations, the effect sizes of alleles will differ among them and so polygenic scores will have lower prediction accuracy, i.e. imperfect portability, across populations.
Even with perfectly estimated effects at the causal loci with significant trait associations, the prediction accuracy of polygenic scores will be limited because GWAS only identify loci with common alleles that contribute enough variance to exceed some significance threshold determined by the sample size. Therefore, an allele that is rare in the GWAS sample but common elsewhere will not be discovered. This would lead to a greater reduction in the phenotypic variance accounted for, or prediction accuracy, in populations not represented in the GWAS sample (hereafter 'unrepresented populations'; [30,40,44,[48][49][50][51][52]). Indeed, many variants contributing to trait variation in European GWAS samples are not at a high enough frequency to be detected in other populations, suggesting that different sets of polymorphisms contribute to the trait variance in different populations [53,54]. Genetic differentiation likely contributes to the reduction in the prediction accuracy of polygenic scores in unrepresented populations, as groups with increasing genetic distance from GWAS samples experience a greater loss in prediction accuracy [55][56][57][58].
The factors that reduce the utility of polygenic scores for individual-level prediction may also complicate the interpretation of average polygenic score differences across populations. Such issues arise in studies of adaptation that use polygenic scores to assess the contribution of selection to the genetic basis of phenotypic differentiation among human populations. An early application of this approach found polygenic signals of selection on height within Europe [59,60], where polygenic scores were over-differentiated among populations compared to the neutral prediction based on F ST . Importantly, the null distribution of this test of neutrality at the level of the phenotype does not rely on accurate polygenic scores and so changes in LD, GxG and GxE should not cause false signals. However, the results are very sensitive to slight biases in estimated effect sizes due to population structure, and indeed the signal of polygenic selection on height turned out to be almost entirely due to stratification [61][62][63]. More generally, the imperfect portability of polygenic scores and the fact that the mean environmental contribution to phenotypes can vary greatly between populations raise concerns about over-interpreting differences in mean polygenic scores as genetic differences in the average phenotype among populations [60,[64][65][66][67].
The consequence of genetic differentiation on the low generalizability of GWAS results for phenotype prediction may appear to contradict Lewontin's observation of minor allele frequency (MAF) differences between populations. However, allele frequency differentiation for complex traits under selection can occur at a rate faster than drift. This would lead to noisy estimates of additive genetic values, and so reduced prediction accuracy of polygenic scores, for unrepresented populations. In order to understand the impact of this turnover on portability, we need models of allele frequency differentiation that are informed by plausible forms of natural selection on complex traits.
Stabilizing selection with a constant fitness optimum is a sensible null model for the evolution of complex traits. Indeed, studies of its influence on allelic dynamics have set a conceptual foundation for interpreting and designing GWAS within populations [68]. Under stabilizing selection, intermediate trait values have the highest fitness, with decreasing fitness with distance from that optimum (figure 1a). Many quantitative traits have been shown to experience stabilizing selection in humans (e.g. [69]) as well as across many other species (e.g. [70][71][72]). Stabilizing selection also contributes to the lack of variation in morphological traits within and between closely related species and the morphological constancy of traits in the fossil record [73][74][75][76][77].
Here, we show how stabilizing selection on complex traits reduces portability and increases the chance of false signals of directional polygenic adaptation, despite low overall genetic differentiation and no genetically based trait differentiation among populations. Under parameter ranges estimated from empirical studies, we combine simulations of stabilizing selection on complex traits with analytical models of its effect on genetic differentiation to investigate how stabilizing selection drives these results. We focus only on scenarios in which the trait optimum is shared among populations, leading to levels of trait differentiation among populations much smaller than neutrality. In our baseline scenario, stabilizing selection occurs on a single additive trait, where alleles have the same effect in different populations (reflecting no GxG or GxE interactions and no pleiotropy; we relax these assumptions later).
We study differentiation between a pair of populations, either an ancestral and descendant population or a pair of contemporary populations, in which the results of a GWAS in one of those populations is used to make trait predictions in the other. In doing so we provide a polygenic score perspective on earlier investigations into the relationship between population structure, stabilizing selection and quantitative trait variation (e.g. [78][79][80][81][82][83]). We show how these factors reduce the prediction accuracy of polygenic scores and can readily lead to patterns of polygenic score differentiation rife with the potential for misinterpretation.

Model background
An individual's additive genetic value G i is the sum of the additive effects of all alleles they carry, where a l represents the additive effect of an allele relative to another at locus l and g il is the number of copies of that allele carried by the individual. All of these loci denoted by l are polymorphic within a specified population from which the individual was drawn. Note that the additive genetic value does not represent an absolute measure of an individual's phenotype, and instead represents the additive contribution of the polymorphisms they carry to their deviation from their population's mean. Under a constant selective environment, stabilizing selection keeps the population mean phenotype close to the optimum and decreases the phenotypic variance in the population because individuals on both tails of the distribution have lower fitness (figure 1a,b; [12,84,85]). To understand the process by which stabilizing selection reduces the phenotypic variance, we focus on the additive genetic variance ð2:2Þ in which the first component refers to the additive genic variance (V a ) and the second component accounts for the contribution of linkage disequilibrium (LD) among loci that contribute to the variance. In the short term, stabilizing selection reduces the phenotypic variance by generating negative LD between like-effect alleles, thereby limiting extreme phenotypes (known as the Bulmer effect; [86]). The additive genetic variance quickly reaches an equilibrium reflecting a balance between selection producing negative LD, and recombination and chromosome segregation breaking up that generated LD [87]. The long-term genetic response to stabilizing selection is driven by a reduction in the additive genic variance, in particular the variance in genotypes at a locus (Var(g l ); the expected heterozygosity). Yet if the genetic basis of trait variation was truly infinitesimal, i.e. made up of loci of infinitely small effect, only genetic drift would erode variation in the long term. However, while the loci discovered by GWAS contribute very small effects, these effects are not infinitesimally small and so they can be directly acted on by selection [68,88]. Selection at the individual loci underlying trait variation is in many cases well approximated by underdominant selection, in which the more common allele fixes and the minor allele is lost (figure 2a; [84,89]). Owing to underdominant selection at the locus level, stabilizing selection removes polymorphisms at a faster rate than neutrality, with selection coefficients proportional to their squared effect sizes (see [68,88,90] for recent applications of such models to understand GWAS variation within populations). Meanwhile, under moderate strengths of stabilizing selection and a constant environment, the population mean phenotype stays very close to the optimum through rapid, small fluctuations at many individual loci. We note that the full analysis of models of stabilizing selection is challenging because changes in allele frequencies and covariances must be tracked over many loci (e.g. [87,91]). Indeed, when we simulated physically linked loci with low recombination, we found that selection on an allele was weaker than we predicted due to the persistence of selection-generated LD between alleles with opposing effects (the Bulmer effect; electronic supplementary material, figure  S1). To understand the long-term turnover in the genetic basis of trait variation, from here on we focus on the additive genic variance, V a , under the assumption that loci contributing to the variance are physically unlinked.
We begin by thinking of an ancestral population at equilibrium and the loss of ancestral phenotypic variance over time. When the population is at mutation-drift-selection equilibrium, stabilizing selection and genetic drift remove variation such that as time goes on less and less of the variance in the descendant population is contributed by ancestral polymorphisms. Instead, the variance contributed by new mutations that are private to the descendant population will increase, replenishing what was removed by drift and selection.
We assume a model of Gaussian stabilizing selection where the width of the fitness peak is determined by w (w 2 is equivalent to V S in other models of stabilizing selection, e.g. [92]), which will approximate any symmetric quadratic stabilizing selection model when the mean trait is close to its optimum. We also assume thousands of unlinked loci contributing to trait variation and explore dynamics resulting from three different mutation effect size distributions (figure 2c); see appendix A, §(a) for simulation details.
We can predict the reduction in additive genic variance contributed by ancestral variants (anc) in the descendant population (desc) after t generations for alleles with effect size a using a common approximation for the per-generation loss (e.g. [93]) where N is the population size, V P is the total population trait variance and V P ≪ w 2 , S = 2Na 2 /w 2 is the population-scaled selection coefficient of the allele, and F ST ≈ t/2N for neutral polymorphisms. On the right-hand side of equation (2.3a), the first bracketed term is the per-generation reduction due to drift and the second term is the reduction due to stabilizing  royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20200416 selection. Looking at the exponent in equation (2.3c), we see that the decay of the variance (heterozygosity) contributed by alleles depends on F ST , but will be increased for alleles whose effect sizes are large enough such that their population-scaled selection is appreciable (S > 1). Equations (2.3) offer good intuition; however, in the remainder of the main text we show results from a diffusion approximation that we developed that is better for our purposes (extending from [68], see appendix A, §(c)).
The reduction in additive genic variance contributed by a particular polymorphism with effect size a is the same as the reduction in heterozygosity at that site. Selection causes a stronger reduction when the fitness peak is narrower (i.e. when w is smaller) and when the allele's effect is larger (figure 2a; see electronic supplementary material, figure S2 for the approximation in equation (2.3a)). We can average the reduction in heterozygosity across all sites, weighting by the distribution of effect sizes and genic variance contributed by a given effect size, to predict the total remaining variance and thus total reduction in V a over time (figure 2b; equation (A 17)). In the example shown in figure 2, this total reduction tends to be weaker than what we see for the largest effect polymorphisms, because most sites that contribute to the variance are of small effect. This is because (i) under our mutational distribution most alleles have small effects, and (ii) under stabilizing selection, the equilibrium distribution of observed effects is narrower than the distribution of mutation effects.
The form of the decay in ancestral variance strongly depends on the distribution of mutation effects. Consider a case in which a higher proportion of introduced mutations are strongly selected due to their large effects (S > 10). Some of these alleles are still capable of drifting to intermediate frequencies, and so a higher proportion of the ancestral variance will be contributed by larger effect polymorphisms [68]. Therefore, stabilizing selection will cause a steeper reduction in the ancestral variance (figure 2c). Moreover, if most of the other mutations are nearly neutral, the early and steep decline in ancestral variance will be followed by a decline more consistent with neutrality (yellow lines in figure 2c; see [68] for a discussion of selection regimes). The extent to which ancestral variance is depleted determines the amount of shared additive genetic variance between diverging populations and thus the portability of polygenic scores.

Accuracy of polygenic score predictions
To understand the effect of stabilizing selection and drift on the prediction accuracy of polygenic scores in isolation from other sources of bias, we make the simplifying assumption that GWAS identify associations between polymorphisms and trait variation only at causal loci. For a GWAS within a population to identify a locus as being associated with the trait, the locus has to be polymorphic in that population and its phenotypic association has to achieve some level of statistical significance (i.e. contribute above some level of variance to the trait). For those causal loci with significant associations, we also assume that their effects are estimated perfectly and for the moment that these true effects do not vary within the sample, the population the sample was drawn from, or across populations (i.e. populations experience the same set of environments). We also assume that their effects are strictly additive. At the trait-associated loci, one can sum the additive effects of all alleles that an individual carries at a predefined set of markers to form a polygenic score.
We are interested in the reduction in prediction accuracy for a population not represented in the GWAS sample, relative to the prediction accuracy of those represented populations. To explore this we consider the genetic differentiation between a pair of populations, A and B, in which the GWAS sample is drawn from population A but not population B. When the effects of alleles do not vary between populations, we can quantify the reduction in phenotype prediction accuracy (r 2 ) of polygenic scores, compared to using additive genetic values, constructed for any population as where S i is an individual's polygenic score and G i is their additive genetic value. Since an individual's polygenic score is only part of their additive genetic value when a GWAS estimates true effects, this reduction can be understood as the proportion of the total additive genetic variance, or proportion of the heritability, explained by GWAS-significant sites (appendix A, §(b)). The ratio of r 2 S =r 2 G in population B to population A quantifies the reduction in prediction accuracy due to a lack of GWAS representation for population B. If these populations experience the same selective environment, we expect the same Var(G) for each population, so this reduction due to a lack of representation would simply be the ratio of the variances explained by polygenic scores.
Note that our definition of the prediction accuracy in population B is the squared correlation of the deviation of an individual's polygenic score with its deviation in phenotype, where both of these deviations are with respect to the mean phenotype in population B. This definition matches typical polygenic score practices where predictions are statements about the departure of an individual from their ancestry group's mean genetic value, rather than a prediction of their departure from the mean genetic value of the GWAS population. However, this definition of prediction accuracy does not include discrepancies from the evolution of mean polygenic scores and phenotypes among populations, meaning it does not account for a systematic shift in polygenic scores between populations. We turn to this point in §4.

(a) Ascertainment of all causal loci in the GWAS sample
We begin with the simplified case in which all causal loci that are polymorphic in population A have been identified, such that polygenic scores equal the additive genetic values in population A. As polygenic scores from ancient DNA are being used to investigate the phenotypic diversity of ancient human populations [94], we first consider the prediction accuracy in a population ancestral to population A of a polygenic score constructed using variants found in population A. The reduction in polygenic score prediction accuracy for this ancestral population is approximately the reduction in variance contributed by that ancestral population to the present (figure 2b,c; electronic supplementary material, figure S4; see appendix A, §c(ii) for an explanation). While ancient individuals were likely not drawn from populations directly ancestral to present-day populations, we should observe the same general patterns with genetic differentiation royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20200416 between the ancient and present-day population (see also [95] for approximations of this decline).
For the remainder of this article, we consider A and B to be contemporary populations; we assume for simplicity that they split from a common ancestral population without subsequent gene flow. In figure 3a, we use simulations and analytical predictions to show how the prediction accuracy in population B decreases with increasing time since its common ancestor with population A. The span of neutral genetic differentiation was chosen to reflect a scale along which various human populations could fall. We see slightly weaker reductions in prediction accuracy with time when the variance of the mutation effect size distribution is quartered and stronger reductions when the mutation effect size distribution has a heavy tail (figure 3b; electronic supplementary material, figure S5). All variance-contributing loci were ascertained in population A, so the only reason the full genic variance in population B was not captured is because private polymorphisms contribute to the phenotypic variance in each population. At the time of divergence between the pair of populations, they entirely share their genetic basis of trait variation. Then as stabilizing selection and drift remove polymorphisms (at equilibrium), new mutations replenish them at different sites in each population, leading to the same total variance but different genetic bases of that variance (assuming a very large mutational target). Without gene flow between the pair of populations, the polymorphisms that arose since their common ancestor will remain private to each population and will thus not contribute to the variance in the other population. Therefore, the polymorphisms in population A that also contribute to the variance in population B must be ancestrally shared polymorphisms, specifically at those loci in which ancestral polymorphisms were not removed by drift or selection in either descendant population. Note that these shared polymorphisms on average contribute less variance in the descendant populations than in their ancestor. Thus we see a loss in the additive genetic variance shared between populations over time due to a reduction in both the number of shared ancestral polymorphisms and the variance they each contribute. In appendix A, §c(ii), we describe our analytical predictions for this process, which match well to simulations.

(b) Ascertainment of a subset of causal loci in the GWAS sample
To more realistically explore the impact of GWAS ascertainment, we explore power-and frequency-based variant discovery. In the main text, we focus on a case in which a GWAS uncovered the top 5% of variance-contributing polymorphisms in population A. Under our choice of effect size distribution and strengths of stabilizing selection, these polymorphisms explain just under 50% of the additive genic variance in population A (similar to that of height in Europeans; [96][97][98]). We also consider a case in which a GWAS uncovered all causal loci with an MAF that exceeds 1%. As expected, under any ascertainment scheme we observe a substantial drop in the variance explained in population B compared to the case in which all polymorphisms in population A are ascertained (figure 3b). If this reduction in variance for each population is the same, then the decline in prediction accuracy in population B relative to population A over time will be similar to the case when all polymorphisms in population A are ascertained. We find that this is

Difference in polygenic means among populations
In the previous section, we described how stabilizing selection and ascertainment in the GWAS sample can reduce the prediction accuracy of individual genetic values, but what are the consequences for the mean polygenic score of populations?
The mean polygenic score of the population is twice the sum of population allele frequencies weighted by effect sizes. If a trait is neutrally evolving, the loci contributing to its variation are just like other neutrally evolving loci, and so differences among populations in their mean polygenic score just reflect a weighted sum of neutral allele frequency differences. Naively, as trait-increasing alleles underlying a neutral trait are equally likely to drift up or down, one might think that over many loci we expect only a small mean difference between populations. However, the polygenic score is a sum rather than a mean, and so each locus we add into the score is like an additional step in the random walk that two populations take away from each other [99]. We expect the variance among populations, i.e. the average squared difference between population means and the global mean, to be 2V A F ST [11,14]. We first explore the differentiation of mean polygenic scores under a constant optimum and constant environment, and then relax these assumptions.

(a) Stabilizing selection to a constant optimum
Under a constant selective environment, stabilizing selection keeps population mean phenotypes close to their optimum in the face of genetic drift and mutation, such that the difference in mean phenotypes among populations with the same optimum should be minor relative to neutral expectations (even accounting for the lower genetic variance within populations under stabilizing selection; figure 4a). This reduction in the divergence of the mean phenotype between populations reflects the fact that if trait-increasing alleles accidentally drift up in frequency, thus pushing the population mean away from its selective optimum, traitdecreasing alleles are subject to directional selection in their favour (and vice versa).
To explore population differences in mean polygenic scores, we can use the polygenic score equivalent of Q ST / 2F ST [100,101], where Q ST measures the proportion of total variance in additive genetic values attributable to amongpopulation differences. Under neutrality, for strictly additive traits we expect Q ST to equal 2F ST estimated from neutral polymorphisms [14,15], and their ratio should be χ 2 -distributed with degrees of freedom equal to one fewer than the number of populations (an extension of the Lewontin-Krakauer test; [102]). The polygenic score analogue is Q X [60], where additive genetic values are substituted by polygenic scores. For our pair of populations, our Q X statistic is where Z is the mean polygenic score in the population denoted in the subscript. When population mean phenotypes are over-dispersed relative to neutral expectations, Q X will be larger than 1, potentially resulting in a statistically significant p-value under the null distribution. When population means are under-dispersed relative to neutral expectations, as we would expect for traits under stabilizing selection with the same optimum across populations, Q X will be much smaller than 1, and the p-values under the null will be large. Stabilizing selection to the same optimum tightly constrains the difference in mean additive genetic values between populations, i.e. the difference in mean polygenic scores using all of the variation, to be much lower than the  royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20200416 difference for neutral traits, even after standardizing for the lower overall levels of variation (figure 4a). This leads to a distribution of Q X that is skewed toward lower values than the neutral χ 2 -distribution (electronic supplementary material, figure S12). However, when we ascertain the top 5% of variance-contributing polymorphisms in population A, the level of standardized polygenic score differentiation under stabilizing selection becomes more similar to neutral levels (figure 4b), with a comparable level of false positive signals of adaptive differentiation as in the neutral case (electronic supplementary material, figure S12). While this result is more specific to our choice of mutation effect size distribution and strengths of stabilizing selection, we can generally conclude that for cases of stabilizing selection with limited ascertainment, estimates of standardized mean polygenic score differences will approach and perhaps exceed what we observe under neutrality. Note that under the neutral case, the distribution of standardized mean polygenic score differences stays about the same between ascertainment levels, whether ascertaining all polymorphisms from both populations or just the top 5% in population A. This is because, under neutrality, all alleles exhibit the same behaviour and do not evolve in coordination with one another. Thus stabilizing selection, combined with incomplete and asymmetric ascertainment, causes the inflation of estimated mean standardized population differences above that seen for the underlying genetic values.
Since under stabilizing selection to the same optimum we expect only small differences between populations in their mean genetic value, the disparity between the (true) differences in mean genetic values and (estimated) differences in mean polygenic scores represents the mean difference contributed by non-ascertained polymorphisms. Intuitively this occurs because if population B has a larger value of an ascertained polygenic score than population A, then the ascertained trait-increasing alleles have by chance drifted up in population B (compared to A) and this imbalance will have induced directional selection for the rest of the trait-increasing alleles to decrease their frequency to keep the population close to the optimum (under high polygenicity). In line with our expectations, we find that the mean polygenic score differences calculated from ascertained sites and mean polygenic score differences calculated from non-ascertained sites are close to opposite one another (figure 4b). The countervailing effect of the non-ascertained loci is noisier when stabilizing selection is weaker because with weaker selection, population means can drift further from their optimum. These results confirm that mean polygenic scores calculated from all polymorphisms (i.e. additive genetic values) should closely match between populations experiencing stronger and similar selection pressures and highlight how incomplete ascertainment can by chance generate misleading differences between them.

(b) Adapting to a changing optimum or environment
The combination of changes in the stabilizing selection regime and incomplete ascertainment of causal polymorphisms can also generate misleading differences in mean polygenic scores and signals of differential selection among populations. When the optimal phenotype shifts from its ancestral value equally in both descendant populations, this directional selection generates responses in the true mean genetic values in each population that track each other extremely closely. However, because our polygenic score based on ascertained polymorphisms explains less of the variance in population B than in population A, we capture less of its response to selection and thus artificially generate a difference between populations in their mean polygenic scores (figure 5a). This shift in the mean polygenic score between populations, as well as a decrease in the total variance explained, can push the distribution of Q X towards larger values than the neutral case (figure 5b). The chances of getting a p-value below a significance threshold of 0.05 are highest when stabilizing selection is strongest (w = 4; up to 30% chance depending on time since the optimum shift) but tend to be greater than neutrality for the strengths of selection we investigated (electronic supplementary material, figure S6). These signals of polygenic selection are consistent with the idea that Q X informs us about directional selection on polygenic scores, as selection has driven an increase in the polygenic score of A relative to B. However, they also highlight the very incomplete picture that we obtain about selection and genetic differences in the phenotype.

Discussion
Our work builds on population genetics theory of stabilizing selection to investigate why polygenic scores perform poorly in populations not represented in GWAS samples (unrepresented populations) and how they can be misleading about average genetic differences among populations. Specifically, we provide a theoretical foundation to understand how stabilizing selection on complex traits impacts the population differentiation of polygenic scores without a difference in environments or trait optima. Negative selection has been invoked as an explanation for low portability, consistent with large effect alleles discovered by GWAS being rare in the population [30,54]. Here we explore these predictions under a model of stabilizing selection that has the advantage of offering clear-cut connections between an allele's phenotypic and fitness effects. With this approach to modelling allele frequency dynamics, we obtained a detailed understanding of how a lack of population representation in GWAS can reduce the utility of polygenic scores in explaining phenotypic variation within and among populations. We discuss the dynamics contributing to low portability, considerations for those interested in explaining variation beyond the GWAS sample, and necessary discretion when using polygenic scores to understand group differences.

(a) Dynamics contributing to low portability
While stabilizing selection around the same optimum among populations reduces phenotypic differentiation relative to neutrality, somewhat counterintuitively it increases genetic differentiation at trait-influencing loci compared to neutral polymorphisms. Genetic differentiation increases because stabilizing selection drives the turnover of polymorphisms that contribute to trait variation within populations; thus polymorphisms that are ancestrally shared between populations are eventually lost and replaced by private mutations. The consequences of this differentiation at large-effect QTLs and a few loci have been explored by various authors [82,83,103], and here we investigated its implications on GWAS results and downstream analyses. Over time, the sets of polymorphisms that contribute substantial variance in each population diverge, such that stabilizing selection reduces the additive genic variance in one population that can be explained by polymorphisms in another. Thus, polygenic scores constructed from polymorphisms ascertained in a GWAS will have reduced prediction accuracy in unrepresented populations compared to represented ones. The prediction accuracy for unrepresented populations decreases with increasing strengths of stabilizing selection and time since the common ancestor with represented populations. Occasional fluctuations in the fitness optima would negligibly influence our portability results, because such directional selection would cause very minor shifts in frequency over the short timescales we consider [90]. The distribution of mutation effect sizes is critical to understanding the effects of stabilizing selection on the turnover of genetic variation among populations. With increasing weight towards larger mutation effects, stabilizing selection leads to a faster reduction in the shared additive genetic variance, and thus polygenic score portability. In addition, the decline in prediction accuracy for unrepresented populations relative to represented ones will depend on frequency-and power-based ascertainment schemes; since polymorphisms shared between populations will more likely be of small effect and common, the discovery of polymorphisms based on MAF will lead to a weaker decline than discovery based on variance contributed. Alternatively, if the trait was truly infinitesimal, stabilizing selection would not contribute to the loss of heterozygosity, such that the sharing of genetic variance would be well predicted by genetic drift, or neutral F ST . While the loci mapped by GWAS are often of small effect, and the traits highly polygenic, we know that their effects are not infinitesimal, as loci must make a reasonable contribution to the variance to be discovered by a GWAS (see [68] for a detailed population genetic model).
While we do not consider migration between represented and unrepresented populations here, the general decline in portability with increasing genetic differentiation between groups should hold, though the exact prediction will differ. Pleiotropy and GxE complicate the predictions of stabilizing selection for portability, which we discuss in electronic supplementary material, §1.3. Differences in the environment among populations can lead to changes in the effect sizes of alleles (GxE), which reduces the portability because (i) effect sizes will be only partially correlated between populations and (ii) stabilizing selection will purge more of the ancestral variation shared between populations since GxE increases the overall trait variance when the interacting environment changes. Pleiotropy, in our implementation, leads to the opposite effect. Holding the average strength of selection on all alleles constant, when an allele can independently affect multiple traits under stabilizing selection, the correlation between its effect size on the trait of interest and its selection coefficient weakens. Thus stabilizing selection purges less of the shared variation and causes a weaker reduction in portability.

(b) Considerations for applications of GWAS results across groups
Like much of the recent work on portability, our work emphasizes the reduction in prediction accuracy for populations not represented in GWAS [29,31,48,104,105]. A natural conclusion is that polygenic predictions that work well across populations will require GWAS across a range of diverse ancestries, in line with other calls to reduce Eurocentrism in GWAS [23,24,26]. In addition to changes in allele frequency and linkage disequilibrium, the relationship of polygenic scores to phenotypes will vary across populations due to variation in genetic effects, assortative mating, and differences in GxG and GxE. Indeed the prediction accuracy of polygenic scores for some traits was recently shown to be quite variable across different groups within an ancestry, suggesting that GxE and environmental variation are quite prevalent for some traits [45]. In addition, associated variants on the same haplotype were found to have differing effects in European-Americans and admixed African-Americans, suggesting that genetic or environmental interactions modify additive effect sizes across groups [47]. Thus, we caution that a better understanding of the portability of polygenic scores across populations also requires a stronger understanding of the causes of variation in prediction accuracy within populations.
With the rise of ancient DNA sequencing, GWAS from contemporary populations have also been used to construct polygenic scores for ancient individuals (reviewed in [94]). These scores have been used to provide a window into the phenotypic diversity of past populations [106][107][108] and to disentangle genetic and environmental contributors to temporal phenotypic variation (e.g. at the Neolithic royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20200416 transition; [109][110][111]). Such studies are most convincing when there are relevant phenotypic measurements on at least some ancient individuals and polygenic prediction accuracies can be judged. However, investigators will often not have this luxury, leaving unclear the insight these approaches can provide. Some studies using ancient DNA have identified reduced rates of disease alleles in the past compared to present-day populations. While we have focused here on quantitative traits, rather than disease traits, we caution that purifying selection against risk alleles will lead modern day populations to systematically underrepresent the diversity of disease alleles in the past [107,[112][113][114].
(c) Misinterpretations of group differences based on polygenic scores When using polygenic scores constructed from the ascertained set of polymorphisms, we increase the possibility of generating misleading signals of differentiation between populations. Stabilizing selection to a constant optimum alone does not generate more false signals of directional selection than what we expect under the neutral evolution of complex traits. However, the standardized difference between populations will be systematically over-estimated with ascertained polygenic scores. We see this result because with stabilizing selection, the unascertained portion of the variance tends to act exactly counter to the trend seen in the ascertained portion. Misleading signals of adaptive polygenic differentiation can also be generated when the stabilizing selection regime shifts in the same way in each population (such that there is still minor phenotypic differentiation between them) as the ascertained polymorphisms only capture a shift in the ascertainment population. This issue arises because stabilizing selection lowers the proportion of the additive genic variance explained in the unrepresented population and so we capture a lower proportion of that population's response to directional selection. Thus this issue of missing the parallel adaptive response across ancestries can be expected in many situations with imperfect portability. Such signals of polygenic adaptation can be useful as Q X is correctly detecting that directional selection has acted on the genetic variation along the branch leading to the represented population, but the signal is very open to the misinterpretation that the unrepresented population has not also responded to the same selection pressures. Many traits have likely experienced a mixture of stabilizing selection and bursts of directional selection across human history. Even if the populations share the same phenotypic optimum, if an environmental change systematically shifts one population away from this optimum, there would be directional polygenic adaptation to move that population back towards the optimum, resulting in a difference in polygenic scores but no difference in the mean phenotypes between populations [67]. Therefore, under current ascertainment schemes and pervasive stabilizing selection, the difference in mean polygenic scores among populations provides very unreliable information about the potential role of selection in generating phenotypic differences among populations.
A polygenic score is a prediction of how an individual's phenotype is expected to deviate away from the sample mean given their genotype at some (large) number of polymorphisms. Sometimes they are quoted as absolute values, but that is always based on an empirical phenotypic mean. The mean polygenic score in a population cannot inform us of the mean phenotype in a population, even if it was constructed from all polymorphisms and averaged across all individuals. This is because these scores are based on polymorphic genotypes alone whereas the absolute phenotype of an individual represents the end product of the entire genome and environment played out through a vast number of developmental processes. Yet it is easy to fall into the trap of believing that a difference in polygenic scores between groups is a strong statement about the difference in the mean phenotype of those groups, which also differ in a myriad of environmental and cultural factors (a related set of issues are present in epidemology; [115]). Thus while the field of human genetics is increasing its power to predict phenotypic variance among individuals within groups, it remains a poor guide to the causes of phenotypic variance among groups with greater environmental and genetic differentiation [116,117].
Data accessibility. Code used to generate simulations and process output can be found at https://github.com/SivanYair/SLiMsims_Stabili-zingSelection. The data are provided in electronic supplementary material [118].

Appendix A (a) Simulation details
In SLiM version 3.6 [119], we simulated stabilizing selection on additive traits and divergence between a pair of populations with the same optimal phenotype (set at zero). Each simulation consisted of a constant population size of 10 000 diploid individuals and genomes comprising 4000 1bp quantitative trait loci and 1000 1bp neutral loci, with free recombination between loci. Mutations arose at rate 2 × 10 −6 per base pair and for quantitative trait loci their effect size was randomly assigned from a normal distribution or a mixture of normal distributions. We used a Gaussian fitness function setting the fitness of an individual to where r = | |r| | is the phenotypic distance from the optimum of an individual with a vector of phenotypes r and w quantifies the width of the fitness peak. The form of Gaussian selection is consistent with any form of quadratic selection if the population mean is near the optimum (see §Aa(ii)).
royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20200416 We simulated quantitative traits under neutrality (w = ∞) and under varying widths of the fitness peak, w, in which larger values of w indicate weaker stabilizing selection. We chose strengths of stabilizing selection and an effect size distribution that correspond to observations in human populations (see § §Aa(i) and (ii) for details). We calculated an individual's fitness in each generation based on equation (A 1), where their phenotype was the sum of the effects of alleles they carried at both variable and fixed sites. We burned in each simulation for 60 000 (6N) generations before splitting the population into two for 3500 generations. We recorded mutation effect sizes and frequencies in each population every 100 generations for the first 500 generations, and then every 500 generations. To reduce computation times, groups of 10 simulation replicates shared the first 40 000 generations of their burn-in. The remaining 20 000 generations provided more than enough time for a complete turnover in the genetic basis of trait variation, such that simulations within a group were effectively independent. We simulated 100 or 200 replicates of each parameter combination.

(i) Choice of parameter values for strength of selection on the trait
We simulated under four different strengths of stabilizing selection (w), chosen to range from no selection (neutral trait) to the highest strength of stabilizing selection estimated from the UK BioBank [69]. These estimates were determined from regressions of relative fitness (based on estimates of lifetime reproductive success) on squared standard normal phenotypes, r 2 , where half of the coefficient of the quadratic term is the quadratic selection gradient, γ, in which negative values imply stabilizing selection, and the intercept is the mean fitness in the population, w. The authors report estimates of γ from their regressions, and so to get strengths of stabilizing selection (w) under our slightly different fitness model, we make the following approximations to fitness. At equilibrium, the distribution of phenotypes is closely centred around the optimal phenotype relative to the width of the fitness gradient, and therefore we assume that mean fitness w % 1. Additionally assuming that the mean phenotype in the population is the optimal phenotype, we can then approximate that the quadratic fitness function is equivalent to ours when γ = −1/w 2 , because γ/2r 2 + 1 ≈ exp (γr 2 /2). Based on incremental γ-values of 0, −0.02, −0.04 and −0.06 (the maximum estimated), we use the following values of the strength of stabilizing selection on the trait: w = ∞ (neutral trait), w = 7, w = 5 and w = 4.

(ii) Choice of effect size distribution
Effect sizes of new mutations were drawn from a normal distribution or mixture of normal distributions with mean zero. Provided that the strengths of selection we chose correspond to selection on standard normal phenotypes, we used a standard deviation that would result in a genetic variance close to 1 at equilibrium. However, we note that the size of the mutational target provides a simple inflation factor to the variance and cancels out when calculating the reduction in prediction accuracy; the mutation effect size distribution that we use is more important for determining the distribution of selection coefficients. See equation (A 13) for the solution to the equilibrium genic variance. Since the strength of selection helps determine the equilibrium genic variance, we used the weakest strength of selection to determine the variance of this effect size distribution (w = 7). For a Gaussian distribution of effect sizes, this leads to a solution for the standard deviation (σ a ) of 0.1. This distribution generates mutations that are mostly nearly neutral or weakly selected. To explore how different selection regimes lead to different declines in prediction accuracy, we also simulated mutation effect size distributions that involve a higher density of nearly neutral mutations or a higher density of strongly selected mutations. We used a Gaussian distribution with σ a = 0.05 for the former and a mixture of Gaussians for the latter, which we call the 'heavy tail' distribution. The heavy tail distribution generates mutations from a Gaussian distribution with σ a = 0.05 that has mean 0 with probability 0.65 and mean ±0.25 with probability 0.175. We chose these specifications for the heavy tail distribution in order to match the median absolute deviation of selection coefficients for the standard Gaussian with σ a = 0.1.

(iii) Extension: directional selection
We extended our baseline stabilizing selection scenario for directional selection by imposing a positive optimum shift of either one or two standard deviations of the phenotypic distribution. The optimum shifted in the same way 2500 generations after the pair of populations diverged. We started directional selection simulations from the states of the baseline scenario that were recorded at the time of the optimum shift. To calculate the extent of the optimum shift in each simulation, we averaged the phenotypic standard deviation of each population in that simulation. Simulation details and results for other extensions can be found in electronic supplementary material, §1.3.

(b) Relative prediction accuracy of polygenic scores
The prediction accuracy of a polygenic score is the proportion of the phenotypic variance explained by that polygenic score (the squared correlation R 2 ). We write out the correlation between an individual's polygenic score constructed from unlinked polymorphisms discovered by GWAS, S i , and the true additive genetic value, G i , in a particular target population. We assume that a GWAS discovers associations only at causal loci, and that at these loci with significant associations, it estimates their true effects. There are L s loci with significant associations in the study and a remaining L r that contribute to trait variation in a particular target population. Therefore, an individual's predicted and true genetic values are in which their true genetic value is a sum of their predicted genetic value and genetic value contributed by the remainder of sites contributing to trait variation. Because the predicted genetic value is a portion of the true genetic value, the covariance between them is the variance in predicted genetic values meaning that the correlation between the predicted and true additive genetic values is the square root of the proportion of the trait variance explained by polymorphisms ascertained by the GWAS, or that the variance in true genetic values explained by polygenic scores is the proportion of the genetic variance explained by the polymorphisms used to construct the polygenic scores. This value also represents the reduction in the prediction accuracy of polygenic scores compared to additive genetic values, when predicting true additive phenotypes (with contributions from both genetics and environment). We denote the additive environmental contribution to the phenotype in an individual as E i . The prediction accuracy of an individual's full additive phenotype (G i + E i ) when using polygenic scores is and when using additive genetic values is such that the reduction of prediction accuracy when using polygenic scores instead of additive genetic values is Thus r 2 S =r 2 G represents the prediction accuracy of polygenic scores for additive genetic values, as well as the reduction in prediction accuracy for full additive phenotypes due to incomplete ascertainment.

(i) Modification for gene-by-environment interactions
While we assume that the effects of a causal allele are perfectly estimated in the GWAS sample, these effects may differ in populations that experience different environments (GxE). Thus for populations not represented in the GWAS, the correlation between polygenic scores and true additive genetic values will be lower than without GxE. Imagine that we use the set of effect sizes from GWAS population A (a l,A ) to construct polygenic scores for population B where the true effect sizes (a l,B ) differ. The covariance of the polygenic scores with true additive genetic values is then CovðS, GÞ ¼ X l a l,A a l,B Varðg l Þ: ðA 8Þ Therefore, the correlation between polygenic scores and additive genetic values is

(c) Modelling details (i) Background
Assuming Gaussian stabilizing selection and that the population mean stays close to its optimum value, selection at an individual locus is well described by a model of underdominance where the per-generation change in an allele's frequency x can be described by its mean and variance EðDxÞ % sxð1 À xÞ x À 1 2 and VarðDxÞ % xð1 À xÞ 2N , where s = a 2 /w 2 is the selection coefficient, with a being the additive effect of the allele with frequency x relative to the alternate allele, and N is the diploid population size. This approximation further assumes that the phenotypic variance is much smaller than the width of the fitness peak under stabilizing selection (σ 2 ≪ w).
Assuming an infinite sites model, the mean time that a mutation spends in a certain frequency interval before fixation or loss (at a single site) multiplied by its population mutation rate is equivalent to the expected number of alleles segregating at those frequencies at a single time point [120]. The mutation rate of an allele with effect a is 2NUPrðaÞ, where U is the per-generation mutation rate for the entire mutational target and PrðaÞ is the proportion of mutations with effect a (with density f (a|σ a ), where σ a is the standard deviation of the mutation effect size distribution). The mean time τ spent in frequency interval (x, x + dx) for a new mutation with effect a can be solved using the Green's function (see eqn A 18 of [68]), such that tðx j a, wÞ ¼ ffiffiffiffiffiffiffiffi ffi Np 2s r exp (2Ns(x À 1 2 ) 2 )(erf(g) þ erf(g(1 À 2x)))(erf(g) À erf(g(1 À 1 N ))) erf(g)xð1 À xÞ , (ii) Results In our models, we focus on divergence without gene flow between a pair of populations, in which case they only share polymorphisms that arose in their common ancestor and were not lost in either population. To predict the loss of ancestral heterozygosity and probability that the descendant populations share polymorphisms, and thus describe the proportion of additive genic variance in one population explained by polymorphisms identified by a GWAS in the other, we consider the frequency trajectory of ancestrally segregating variants. To fully calculate these quantities, we would need the diffusion transition density with underdominant selection to calculate the distribution of the frequency in the present given the ancestral frequency, but we can build simple approximations of the combined effects of selection and drift. If we assume that the frequency of our allele x is close to 0 such that 1 − x ≈ 1, then we can approximate its expected per-generation change (equation (A 11)) as EðDxÞ % sx x À 1 2 , ðA 14Þ such that its deterministic frequency trajectory follows the logistic function from starting frequency x 0 , Eðx t j x 0 , a, wÞ % 1=2 1 À ( exp (st=2Þ Â ð1 À ð1=2x 0 ))) , where s ¼ a 2 w 2 : ðA 15Þ To consider the effects of genetic drift on the frequency trajectory, we assume that x t is normally distributed around its deterministic frequency trajectory as follows: gðx t j x 0 , a, wÞ N Eðx t j x 0 , a, wÞ, x 0 ð1 À x 0 Þ t 2N : ðA 16Þ Reduction in variance due to drift and selection. We can predict the additive genic variance contributed by all ancestral polymorphisms (anc) to the descendent (desc) after t generations by averaging over their effect sizes, starting frequencies and trajectories, EðV a½anc; desc ðt; w; s a ÞÞ ¼ 2NU The total reduction in the additive genetic variance contributed by ancestral polymorphisms after time t is thus given by E(V a[anc,desc] (t, w, σ a ))/E(V a (w, σ a )).
To obtain the reduction in heterozygosity used in figure 2a, we use the ratio of expected heterozygosities for the midpoints of the effect size bins evaluated. The equations we use have a similar structure to those where we predict the total variance explained by ancestral polymorphisms at time 0 and t, except that we do not integrate over all effect sizes and remove a 2 to calculate the heterozygosity instead of variance H t½anc, desc ðw, aÞ H 0 ðw, aÞ where H t refers to the heterozygosity at time t.

Variance in ancestor explained by polymorphisms in descendant.
To understand the ancestral variance attributable to presentday polymorphisms, we need to think of the properties of present-day alleles backward in time. For alleles with effect sizes large enough for selection to act strongly, selection (usually) constrains them from reaching appreciable frequencies and selection acts approximately like additive selection against the allele (eqn (A 14)). Conditional on the present-day frequency, the distribution of a deleterious, additive allele's trajectory backward in time is the same as the process forward in time, such that g(x t |x 0 , a, w) = g(x 0 |x t , a, w) [121]. Therefore, we can write the expected additive genic variance in the ancestral population explained by variation segregating in the descendant population, in which they have diverged for t generations, as EðV a½desc; anc ðt; w; s a ÞÞ ¼ 2NU royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20200416 which is identical to the additive genic variance contributed by all ancestral polymorphisms to the descendent (with just a flip in the subscripts; see equation (A 17)). Thus, the decay in the additive genic variance segregating in an ancestral population to the present day is the same as loss in prediction accuracy when using present-day polymorphisms to predict phenotypes in the ancestral population. Variance in population B explained by polymorphisms in population A. When the population in which we make predictions is contemporary with the population represented in the GWAS, we again consider the variance explained by all polymorphisms that contribute to the variance in the prediction population. In our model of population divergence, these sites that contribute to the variance in the prediction population are sites that were segregating in the common ancestor that remain segregating in both the GWAS and prediction population. We first account for the variance contributed by ancestrally segregating sites in the descendant population in which we make predictions (B), as explained in equation (A 17)), and then for the possibility that each site is segregating, and thus ascertained, in the population from which the GWAS sample was drawn (A), EðV a½A,B ða,t,w,s a ÞÞ ¼ 2NU where x 0 is the frequency of an allele in the common ancestor of populations A and B, x t,A and x t,B are its frequencies after time t in each of those descendant populations, respectively, and the probability that the ancestral variant is segregating in population A for ascertainment is given by Prð0 , x t,A , 1 j x 0 , a, wÞ ¼ ð 1Àð1=2NÞ 1=2N gðx t,A j x 0 , a, wÞd xt,A : ðA 21Þ