Improving polygenic prediction with genetically inferred ancestry

Summary Genome-wide association studies (GWASs) have demonstrated that most common diseases have a strong genetic component from many genetic variants each with a small effect size. GWAS summary statistics have allowed the construction of polygenic scores (PGSs) estimating part of the individual risk for common diseases. Here, we propose to improve PGS-based risk estimation by incorporating genetic ancestry derived from genome-wide genotyping data. Our method involves three cohorts: a base (or discovery) for association studies, a target for phenotype/risk prediction, and a map for ancestry mapping; successively, (1) it generates for each individual in the base and target cohorts a set of principal components based on the map cohort—called mapped PCs, (2) it associates in the base cohort the phenotype with the mapped-PCs, and (3) it uses the mapped PCs in the target cohort to generate a phenotypic predictor called the ancestry score. We evaluated the ancestry score by comparing a predictive model using a PGS with one combining a PGS and an ancestry score. First, we performed simulations and found that the ancestry score has a greater impact on traits that correlate with ancestry-specific variants. Second, we showed, using UK Biobank data, that the ancestry score improves genetic prediction for our nine phenotypes to very different degrees. Third, we performed simulations and found that the more heterogeneous the base and target cohorts, the more beneficial the ancestry score is. Finally, we validated our approach under realistic conditions with UK Biobank as the base cohort and Swiss individuals from the CoLaus|PsyCoLaus study as the target cohort.


Introduction
Most common diseases of major public-health importance have a complex genetic architecture. [1][2][3][4][5][6][7][8] A polygenic score (PGS) (sometimes called polygenic risk score) is the weighted sum of risk alleles carried by an individual. By predicting a fraction of the risk of developing a disease, the PGS allows individuals to be stratified into different risk categories, with potential clinical value. For example, people who have a PGS in the upper 0.5% range have a 5-fold increased risk of developing coronary heart disease compared with the remainder of the population. 9 Such information could help reduce the risk of developing diseases by encouraging a healthier lifestyle or through preventive pharmacological interventions. 10 PGSs alone are already equal to or better than clinical risk models for predicting prostate cancer, breast cancer, and type 1 diabetes in the general population. 6,11,12 If their clinical utility is demonstrated, PGSs could be integrated into clinical practice in the coming years. Therefore, the practical limitations of their application must be urgently addressed. 13,14 The phenotypic variance of a trait, V P , is defined as V P ¼ V G þ V E , with V G representing the genetic variance, and V E representing the environmental variance. In a multi-ancestry cohort, it is important to differentiate V G;Individual , the fraction of V G coming from variants shared between ancestries, and V G;Ancestry , the fraction of V G com-ing from variants that are ancestry specific. A fraction of V E is also likely to be associated with ancestry V E;ancestry . Thus, the phenotypic variance can be decomposed as follows: Genetic ancestry can be operationally defined as the systematic difference in allelic frequencies between subpopulations. It can be useful for biomedical applications 15 and is preferred to the concept of ethnicity. 16 For example, the current best integrative-risk model for coronary heart disease, ''QRISK2,'' includes a ''self-reported ethnicity'' risk parameter. Replacing it with genetic ancestry would (1) make the medical investigation more reliable by transforming it into a measurable biological variable detached from the notion of ethnicity, (2) improve the quality of the risk parameter by moving from a categorical to a continuous measure, and (3) allow the inclusion of individuals who do not know their ancestry or whose ancestry composition is uncertain.
Genetically, ancestry can be estimated via principalcomponent analysis (PCA) of genome-wide genotyping data to obtain the genetically inferred ancestry. The PCA produces a series of ordered axes, the first ones of which empirically correspond to the genetic ancestry. 17 By definition, because PCA is an unsupervised machinelearning method, it does not require labels, which avoids confusion with ethnicity. Because it produces continuous variables, it allows a precise definition of the different components of an individual's ancestry.
In genome-wide association studies (GWASs), such genetically inferred ancestry can be used to correct for population stratification. Specifically, the coordinates of the relevant principal-component axes can be included as covariates in the association models together with other demographic or clinical risk parameters, such as age or sex. Subpopulation-specific variants will be responsible for the V G;Ancestry component and will covary strongly with the covariate carrying ancestry. Therefore, the PGS calculated from the resulting GWAS summary statistics only accounts for the V G;Individual component.
We here propose a method that improves PGS-based prediction of complex diseases using additional information about genetic ancestry derived from PCA. Through both simulation and real-life testing in large cohorts, we show that the addition of an ancestry score (AS) based on principal components, which takes into account the V G;ancestry component of the phenotypic variance, improves the results of predictive models.

Materials and methods
Workflow for improving phenotypic prediction with genetic ancestry Association studies, such as a GWASs, are performed in what we call a base cohort. The ''individual risk scores'' or ''phenotypic pre-dictions'' are calculated on what we call a target cohort. When we run a PCA on the base cohort, the populations separated with each principal component (PC) is cohort specific. Therefore, if in the base cohort we run an associations study between a set of PCs and a phenotype, it is necessary to have a similar set of PC variables in the target cohort if we want to use it for phenotypic predictions. Unfortunately, the genetic information of the base cohort is often not accessible due to privacy issues, and thus direct transformation between the two PC spaces can be impractical. Here, we propose a method to circumvent this problem with an intermediate map cohort that must be accessible to all stakeholders and have sufficient diversity to separate the populations present in the base and target cohorts.
The different steps of the method can be visualized in Figure 1. The separation between the two investigating entities is represented by the dotted line, only the map-cohort in the middle is accessible to both. We describe the generic workflow in the following section and specify in parentheses the specific tools and parameters we decided to use.
(1) Mapping the base cohort to the map cohort. First, the base cohort must be mapped to the PC space of the map cohort.
To do this, the PC space of the map cohort is based on an optimal set of SNPs selected with respect to the base-and map-cohorts such that (1) only SNPs present in both are retained; (2) on the base-cohort side, a first filtering of SNPs to be kept in the analysis can be done, but we suggest not to discriminate on the basis of minor allele frequency (MAF) and linkage disequilibrium (LD) at this level (we filtered SNPs for imputation quality with INFO >0.9); on the map-cohort side, SNPs are filtered for MAF (we use MAF >0.01) and pruned (we use an LD threshold based on an R coefficient of r2 ¼ 0.5 and a maximum base pair in the sliding window of kb ¼ 250); 18 (3) the intersection of these two subsets produces a set S1 of s1 reference SNPs. In the following section, we will refer in the mathematical notations to the base, target, map, and calibration cohorts with superscript b/t/m/c. A PCA is run on G m , the genotype data matrix of the map cohort with s1 columns corresponding to the SNP subset S1, to produce the map-PC SNP-loadings L, a p times s1 matrix with p representing the number of PCs retained (we decided to retain 40 PCs). These map-PC SNP-loadings are used to produce set of PCs that we call mapped PCs for the base cohort such that where P b is the matrix of size m b times p and G b here is the corresponding subset S1 of reference SNPs from the genotype data matrix of size m b times s1.
(2) Associations analyses on the base cohort. Second, we perform two association studies for the phenotype of interest on the base cohort: a GWAS (Equation 3) and what we call a principal-component association study (PCAS; as Equation 4).
Let GWAS estimating the b b i effect size for each SNP i on the phenotype be a linear model such that where within the base cohort, Y b is the vector of phenotypes, G b is the full genotype data matrix of size m b times s, G b i is the column vector for the variant i, b b i is the corresponding effect size, C b is the matrix of covariates of size m b times c, with c as the number of covariates, and g is the corresponding column vector of effect sizes. The outcome of the GWAS is the ''GWAS summary statistics.'' In our case, after quality control for imputation and MAFs (such that MAF >1e -4 and INFO >0.8), we run the GWAS with BOLT-LMM. 19 It is worth noting that the GWAS method implemented in BOLT-LMM corrects for the population structure by using a mixed-effect model. 19 For the PCAS, we fitted a multiple linear regression model to estimate for each mapped PC i in p the effect size b b i on the phenotype. Specifically, the model takes the form where P b;i is the column vector for the mapped PC i of size m b , and b b i is the corresponding effect size. For convenience, we call the resulting trained model ''PCAS summary statistics.'' In our case, for both association studies, the covariate C includes age at recruitment, sex, and genotyping array. In the GWAS model, as many as the 40 PCs computed by the UK Biobank are used to adjust for population structure. We chose to include 40 PCs, as it was shown in earlier studies that a deep subpopulation structure was found in UKB. 20 (3) Defining PGS design. Third, to design the best PGS formula, a calibration cohort is needed to determine the SNPs to include in the PGS formula. Practically, it would be a small hold-out subset of the base cohort. We define the calibration cohort with the genotype data G c , mapped PC P c , and phenotype Y c .
In our case, the PGS formula is estimated with PRSice on the calibration cohort by a strategy based on variant clumping and p value thresholding to determine the optimal set of SNP S2 to construct the PGS. 21 In short, the clumping step takes advantage of the LD properties of the genome to construct groups of SNPs (or clusters) below a given maximum p value threshold. These LD properties must either be computed from the target cohort or drawn from an external reference panel. We chose to use 1KG as the reference panel, with a maximum window for each clump of 250 kb and an r 2 cutoff of 0.01. From each clump, one SNP is selected for inclusion in the final set used to construct the PGS. PRSice then determines the optimal p value threshold used to retain the set of SNPs to calculate the PGS. The summary of the PGS is described in Table S2.
(4) Calibration of the predictive model. Fourth, the calculation of phenotypic predictions involves a calibrated predictive model in which the coefficients of different parameters such as age, sexm and other relevant covariates, but also here the PGS (b PGS ) and the AS (b AS ), have been estimated.
To calibrate the predictive model, we use for a second time the calibration cohort.
The PGS is computed, such that for individual j is the vector of the estimated SNP effect sizes and where G c is the m c times s2 matrix of genotype values.
The AS is computed, such that for individual j where c b b is the vector of the estimated mapped-PC effect sizes and where P c is the m c times p matrix of mapped-PC values. Finally, we calibrate our predictive model, a multiple linear regression model where b PGS PGS c and b AS AS c are the PGS and AS predictors, respectively, with their corresponding effect sizes that we want to estimate.
(5) Computation of ASs and PGSs on the target. Fifth, similarly, PGS t and AS t are computed for each of the m t target samples with Equations 5 and 4 with P t and G t , the matrix of size m t times p, and m t times s2, respectively.
But for AS to get the mapped PC of the target cohort, Equation 2 is used beforehand on G t to produce P t . In that case, it is likely that some SNPs will be missing. A strategy based on the LD properties of the map cohort can be used to replace the missing SNPs with the best tagging SNPs (we take the SNP with the maximum INFO 3 r 2 ).
(6) Phenotypic predictions. Finally, the calibrated predictive model can be used to calculate the phenotypic predictions on the target samples, such that the phenotypic prediction for individual j is

Evaluation of AS
PGSs and ASs are two composite variables derived from the same genome-wide genotyping data. Therefore, in order to estimate the predictive value of the AS, it is necessary to evaluate it together with the PGS. Two predictive models are calibrated, one with PGS alone and the other with both the PGS and AS (Equation 7). The two predictive models are used on the target cohort to each generate a set of phenotypic predictions, b Y PGS and b Y PGSþAS (Equation 8). The resulting phenotypic-variance explained (PVE) is calculated by taking the R 2 regression score (coefficient of determination) between the prediction and the actual phenotypes.

Simulations
We generated genotyping and phenotypic data for a base cohort, a calibration cohort, and a target cohort, each based on a mixture of individuals from three genetically equidistant populations A, B, and C. The SNPs present in the simulated dataset can be non-specific (evenly distributed in the three populations), population specific (exclusive to one population), or stratified (present at different frequencies between populations). The alternative allele frequency for each population A PA F , A PB F , and A PC F was generated for each SNP with respect to its category. For population-specific SNPs, an allelic frequency is, with equal probability, either drawn from a uniform distribution R f Uð0:4; 1Þ or set to 1. The other two populations are attributed for the corresponding allele a frequency of 0. For stratified SNPs, we followed the model of Balding and Nichols 22 where a reference allele frequency ðR f ) is first drawn from a uniform distribution R f Uð0; 1Þ and is used to derive the alternate allele frequencies from a b distribution for each population according to their F st 23,24 such that with F st varying between 0.02 for two populations and 0.2 for the third. Finally, for a non-specific SNP, an allelic frequency is drawn from a uniform distribution Uð0; 1Þ and is used by all samples. Genotype data were generated based on A f such that G, a matrix of size m times s with m as the number of samples, and s, the number of SNPs, respectively, with genotypes 0 (homozygous for the reference allele), 1 (heterozygous), or 2 (homozygous for the alternative allele) were assigned for a given population P with probabil- Phenotypes Y were generated based on heritability h 2 ¼ 0:6 by adding two vector components of size m, the genetic basis Y g and the environmental basis Y e , such that, Y ¼ Y g þ Y e . Firstly, Y g was generated from a total of s SNPs selected randomly as 10% of SNPs per category and associated with an effect size b G i drawn from a Gaussian distribution b i N ðm ¼ 0; s 2 ¼ 1Þ, such that Y g ¼ G3b G with b G as the vector of the effect size. Secondly, based on the generated genetic component, m Yg and s 2 Yg , its corresponding mean and variance, are calculated to generate Y e . Finally, Y e was drawn from a normal distribution such as Y e N 0; s 2 We computed a PCA on the base-cohort-generated genotyping data (G b ) to produce the corresponding PC P b and PC loadings that we use to project the calibration and target cohorts into the same PC space and produce the mapped PCs P c=t . We performed a GWAS as given by Equation 3 with PC1 and PC2 as covariates C and b b G i as the estimate of the SNP i effect size. On the calibration cohort, we determined the best set of SNP S 0 to build the PGS with a simple implementation of the p value thresholding method of PRSice. Based on this set of SNP S 0 , we constructed the PGS of the calibration and target cohorts as given by Equation 5.
We performed a PCAS on the base cohort for the first two PCssufficient to differentiate 3 populations-following Equation 4 to estimate the effect sizes of the mapped PC b b P . We calculated the AS based on the top 2 associated PCs in the calibration and target cohorts following Equation 6.
Finally, we used the data from the calibration cohort to calibrate the risk model, following Equation 7. With the calibrated risk model, we performed a phenotypic prediction on the target cohort, so that the prediction for a sample j is given by Equation 8.
The code to reproduce the simulations is available on Github (see Web resources).
Genotyping and imputation of participants in the UKB study were fully described by Bycroft et al. 28,29 Briefly, samples were genotyped using the UK BiLEVE Axiom array (Affymetrix) (10.2%) or the UKB Axiom array (Applied Biosystems). Genotypes were phased using SHAPEIT3 with the 1KG phase 3 dataset as a reference and then imputed using the Haplotype Reference Consortium, 1KG phase 3, and UK10K data as reference panels. Participants were removed if their genetic sex did not match their reported sex, if they had a non-XX/XY sex-chromosome karyotype, or if they had excessive (>5%) missing genotyping.
Phenotypes were selected based on their high degree of differentiation between populations as characterized by the Global Distribution of Genetic Traits (GADGET). 30 Where necessary, phenotypes were normalized by rank-based inverse normal transformation and/or residualization by sex. The categorical phenotypes were turned into discrete variables. The details of the phenotypes are given in the Table 1. Summary statistics for the set of GWASs are available in the supplementary materials in Table S1. External target cohort: CoLaus|PsyCoLaus As external target cohort, we used the Cohorte lausannoise (CoLaus| PsyCoLaus), a population-based research study launched in 2003 in Lausanne, Switzerland, as an additional independent target cohort. It includes a total of 4,781 unrelated individuals of European ancestry after filtering out participants whose genetic sex did not match the reported sex or whose missing genotype rate was excessive (>5%). Participants ranged in age from 35 to 75 years at enrollment (mean 5 SD: 51.1 5 10.9), with 52.5% being female. 31 Genotype imputation was performed using two independent reference panels: the HRC reference panel and the merged 1000 Genomes phase 3 and UK10K reference panel. [32][33][34] Phasing and imputation were performed on the Sanger imputation service (https://imputation.sanger.ac.uk).
We used standing height, body mass index (BMI), and diastolic blood pressure as phenotypic outcomes. Phenotypes were normalized on the basis of the parameters used in the UKB phenotype normalization.

Mapped-PC characterization
As a first step, we mapped individuals from both the base and target cohorts to the PC space of the map cohort. We characterized the resulting mapped PCs by assessing the correspondence between the position of the mapped PCs from UKB or CoLaus|PsyCoLaus and the PCs from the 1KG samples and, second, by testing whether the mapped PCs and the ''regular'' PCs explain the phenotypic variance with similar magnitude.

Visualization of the mapped PCs
We first jointly plot the mapped PCs of UKB/CoLaus| PsyCoLaus and the corresponding PCs of the map cohort (1KG). In Figure 2A, PC1 and PC2 from UKB-all and 1KG overlap widely, showing the diversity present in both cohorts. Figure 2B shows the PC5 and PC7 for UKB-WBO and the European 1KG samples. PC5 and PC7 are the axes discriminating the most samples of different European ancestry. As expected, there is a significant overlap of UKB-WBO with the British cluster (Great Britain [GBR]). Similar to CoLaus| PsyCoLaus, in Figure 2C, PC1 and PC2 validate that Co-Laus|PsyCoLaus is exclusively composed of individuals of European ancestry. Specifically, in Figure 2D, PC5 and PC7 show that CoLaus|PsyCoLaus is broadly consistent with a Central European population, as expected for a Swiss cohort.

PVE by mapped PCs versus PCs
We then compared the PVE by the mapped PCs and by the regular PCs, which comes from a PCA done directly on the cohort (in CoLaus|PsyCoLaus) or provided (in UKB). 28,35 The regular PCs derived directly from the cohort can be considered as the upper bound when predicting a trait. We estimate the PVE by a multiple linear model based on the top 40 PCs of UKB (target-UKB-all) and CoLaus| PsyCoLaus with a 10-fold cross-validation. The results in Figures 3A and 3B show similar levels of PVE by the two approaches for the different phenotypes.

Evaluation of the AS
We evaluated the performance of the AS, the composite variable we created from mapped PCs that captures the association between phenotype and ancestry.

For different trans-ancestry genetic architectures
To characterize scenarios where the use of ASs would lead to a gain in predictive power, we simulated the scenarios based on the different genetic architectures as shown in Figure 4A. The equilateral triangle represents a map with one of the populations A, B, or C at each vertex. The circles represent the number of SNPs per category, which can be non-specific (green in the middle, evenly distributed), population specific (red on the vertices, population exclusive), or stratified (blue in between). For all scenarios, the total number of SNPs is kept at a constant total of 500. Based on a scenario, data are generated for three sample setsthe base, calibration, and target cohorts-with 20,000, 7,000, and 3,000 samples from populations A, B, and C, respectively, for a total sample size of 30,000. We repeat the simulations 50 times per scenario.
The results of the different scenarios are shown in Figure 4B. We see in Figure 4B that adding the AS increases the PVE and decreases the mean square error (MSE) exclusively when there are population-specific causal variants (scenarios 1 and 3) but not when the causal SNPs are just stratified (scenario 2). This is due to the fact that because population-specific SNPs are co-linear to PCs, the estimated SNP effect size is nullified. In contrast, causal stratified SNPs have their effect size corrected but are not discarded and end up included in the PGS. We conclude that, in real-world data, the increase in PVS due to the AS use could arise, along with covariant environmental factors, from similar population-specific SNPs. For generalization. To evaluate the transferability of AS, we simulated scenarios based on the different cohort composition as shown in Figure 5A. We run simulations with different population structures for, on the one hand, the base and calibration cohorts and, on the other hand, the target cohort. The simulated genetic data correspond to scenario 3 of the previous simulation section ( Figure 4A).
The results of the different scenarios are presented in Figure 5B. We generated scenarios 1 to 5 with increasing heterogeneity between the base/calibration cohorts and the target cohort. From scenarios 1 to 3, we go from a homogeneous situation in scenario 1 to proportions that are reversed in scenario 3. In all three scenarios, the models with PGSs alone see their mean PVE decrease and their variance as well as their MSE increase. Here, incorporating ASs to the model almost completely restores the mean EVP to its theoretical maximum of 0.6 and drastically reduces the variance. In scenario 4, the heterogeneity becomes extreme with a population C that corresponds to 3% of the base/calibration cohort and 97% of the target cohort. Here, the model with PGSs alone sees the variance of PVE explode while its mean continues to decrease. Even in this case, adding an AS to the model corrects for these effects strongly by decreasing the variance by a factor of 3 and bringing back the mean PVE above 0.4. In scenario 5, population A present in the base/calibration cohort is absent in the target cohort. In this case, the addition of an AS to the model almost completely corrects for cohort heterogeneity. Here, the model with only a PGS will estimate a shifted intercept on the calibration cohort relative to the optimal intercept for the target. This offset will be corrected by adding an AS to discriminate each population. We conclude that our method is effective to correct for cohort heterogeneity between the base/calibration cohorts and the target cohort. Application to real data On UKB. We then evaluated our method on UKB. Two sets of base and target cohorts were generated: one with samples of White British ancestry only-base/target-UKB-WBO-and the other with samples of all ancestriesbase/target-UKB-all. The split between the base and target cohorts was 90/10. The calibration cohort was generated multiple times based on a 10-times cross-validation from the target cohort.
Heritability is estimated by genome-based restricted maximum likelihood (GREML) method on the target cohort. 36,37 The results based on base/target-UKB-all are shown in Figure 6A. The addition of the AS parameter increases the PVE for all phenotypes, with a variable magnitude.
There is a small increase for diastolic blood pressure from 0.027 to 0.028, BMI from 0.075 to 0.079, and baldness from 0.120 to 0.128 and a larger increase for age of first menarche from 0.031 to 0.037, height from 0.219 to 0.271, and age at menopause from 0.024 to 0.039; it more than doubles for heel bone mineral density from 0.040 to 0.090 and education attainment from 0.012 to 0.054; and it is exacerbated for skin color from 0.023 to 0.654, where most of the PVE comes from the AS.
Results based on base/target-UKB-WBO in Figure 6B were used as a control to show that even when there is a population with narrow ancestry, the AS does not affect A B predictions. It was not significantly associated except for skin color, where it showed a slight gain. Note that our exploration on the UKB data does not exploit the gain due to heterogeneity between cohorts as seen in the simulations (we are in a similar case to scenario 1).
On CoLaus|PsyCoLaus. We also evaluated our method using CoLaus|PsyCoLaus as an external target cohort. From UKB, two sets of base and calibration cohorts were generated, base/calibration-UKB-WBO and base/calibration-UKB-all, based on a 90/10 split.
Results based on base/target-UKB-all are shown in Figure 6C. For height, we observe a gain in PVE by adding the AS into the model of 15.6% (from 0.212 to 0.245). For BMI and diastolic blood pressure, as discussed in the previous sections (see Figures 3B and 6A), PCs and thus the AS do not explain the phenotypic variance. As a control, results based on base/target-UKB-WBO are shown in Figure 6B. As expected due to the homogeneous ancestry, we observe no gain or loss when adding the AS.
Note that with base/target-UKB-WBO, the PGS alone for height remains suboptimal, with a PVE of 0.223 compared with the model with the PGS and AS based on base/target-UKB-all.
These results show the generalizability of our method to an independent cohort. Of note, the best results were obtained by including all UKB individuals in the base cohort, even if the CoLaus|PsyCoLaus cohort consists exclusively of individuals of European ancestry.

Discussion
Here, we propose a method allowing for the inclusion of an ancestry parameter derived from genetic data into phenotype prediction scores without having to manually categorize individuals. We show that the inclusion of an AS improves prediction, especially for admixed populations. In addition, because our approach emphasizes the inclusion of all individuals, it tends to increase the statistical power and the production of summary statistics that generalize better to diverse populations. Our method could therefore promote a much-needed increase in population diversity for human genomic research. So far, human genomic research has been disproportionately performed in populations of European ancestry, which might cause genomic-based medicine to exacerbate health disparities. 38 According to the GWAS catalog, although people of European descent make up only 16% of the world's population, they represent 79% of GWAS participants. 38,39 As a consequence, the predictive power of currently developed PGSs is lower in most underrepresented populations than in Europeans. 40 In addition, it has been reported that multi-ethnic GWASs or meta-GWASs increase statistical power for variants widely shared across populations, 41,42 which should lead to greater inclusion in future GWASs. As multi-ethnic GWASs become more common, the need to report the ancestry effect for the development of pan-ASs will increase. 43 As a result of our simulations, we can hypothesize that the traits that show a higher gain in prediction upon inclusion of an AS are the ones that are influenced by more population-specific alleles with a phenotype that is also differentiated between populations. These population-specific alleles-fixed in one ancestry, absent in others-will only be detected as variables in a multi-ethnic GWAS and are perfectly correlated with ancestry. In such a context, they can be thought of as perfectly correlated variants that are distributed across the genome independently of their physical distances. In a GWAS, such variants are usually discarded because of their collinearity with a covariate controlling for ancestry. As a result, the corresponding GWAS summary statistics miss them entirely, which precludes their inclusion in downstream polygenic risk prediction. Here, we correct this bias by adding a separate ancestry term in the predictive risk model.
There are still limitations when doing predictions in a multi-ethnic setting. The PGS-based prediction will remain limited for trans-ethnic cases due to (1) variants in the target that are absent in the base GWAS, (2) different effect sizes for the same causal variant between two populations due to pleiotropic effect, and (3) unmatched tagging SNPs due to different LD structures between populations.
Non-genetic factors influence phenotypes in complex ways and must be carefully considered in genetic studies to avoid confounding and false genetic associations. In particular, behavioral phenotypes are correlated with socio-economic and cultural factors, including racial and ethnic categories 44,45 that may be associated with ancestry. Consequently, the common GWAS assumption that environmental factors affect samples randomly does not hold, and the AS will be influenced by non-genetic factors. When studying such phenotypes, the investigator should not draw conclusions based solely on statistical associations between PCs and phenotype, as the socioeconomic factors that are captured could be misinterpreted as genetic or ancestry-related factors. Furthermore, because the magnitude and direction of associations between socioeconomic determinants and cultural background are society specific, the environmental effect embedded in the AS is less likely to be portable to a distant cohort.
Since genetic ancestry and ethnicity are closely related, it is necessary to draw the line between these two concepts. Ethnicity or race is a social construct that classifies people independently of the genetic component. Its meaning changes over time and between societies. Genetic ancestry-or genetically inferred ancestry-can be operationally defined as the systematic difference in allelic frequencies between subpopulations. Until now, reporting of ancestry has been mostly based on the Self-Identification of Race and Ethnicity (SIRE) method, which has important limitations, 46 such as its categorical rather than continuous nature; its overlap with the notion of race, whose definition fluctuates over time and depends on societies; it does not offer a simple solution for people of mixed ancestry, who are expected to become a larger share of the population in globalized societies; 47 and it does not allow for the classification of people who are unaware of their ancestry. To finely characterize genetic ancestry, we propose the use of mapped PCs, which can be easily derived from a reference map cohort such as the publicly available 1KG, to project any individual on a shared PC space. In addition to association studies, mapped PCs can also be shared to characterize ancestry in a discovery cohort as a new type of shareable metadata. Such data could, for example, be useful for assessing the compatibility between available GWAS summary statistics and a targeted individual for whom one wishes to calculate a PGS.
Here, we have shown that clinical risk models can benefit from a risk parameter, the AS, derived from mapped PCs, which allows each individual to be fitted to its phenotypic baseline value based on ancestry. The use of this fitting parameter makes it possible to directly apply the predictive models to individuals from underrepresented populations and of mixed ancestry.
The ClinGen Complex Disease Working Group has defined a standard method for reporting risk models based on PGSs 48 in collaboration with the Polygenic Score Catalog. The Polygenic Score Catalog is a rapidly growing repository for GWAS summary statistics. 49 This repository could host additional data useful for calculating risk parameters, such as PCAS summary statistics with the corre-sponding mapped-PC loadings. Today, researchers are strongly encouraged to share GWAS summary statistics to enable meta-analyses and speed up research. Similarly, we encourage researchers to share data to enable AS calculations.
We are at a pivotal moment for genomic-based medicine: large-scale personal data can begin to be used effectively to develop more individualized approaches to disease prevention and treatment. Ensuring equitable access to new approaches and technology is a major responsibility for the biomedical research community. We have introduced a method that aims to foster predictive models based on PGS and promotes the inclusion of more diverse populations in GWAS.

Data and code availability
The code that generated the simulated data during this study is the AS simulator and is available at Github (https://github.com/ onaret/AS_simulator). The UKB and CoLaus|PsyCoLaus data supporting the current study have not been deposited in a public repository due to their sensitive nature.