Impact of genetic similarity on imputation accuracy

Genotype imputation is a common technique in genetic research. Genetic similarity between target population and reference dataset is crucial for high-quality results. Although several reference panels are available, it is often not clear which is the most optimal for a particular target dataset to be imputed. Maximizing genetic similarity between study sample and intended reference panels may be the straight forward method for selecting the genetically best-matched reference. However, the impact of genetic similarity on imputation accuracy has not yet been studied in detail. We performed a simulation study in 20 ethnic groups obtained from POPRES. High-quality SNPs were masked and re-imputed with MaCH, MaCH-minimac and IMPUTE2 using four different HapMap reference panels (CEU, CHB-JPT, MEX and YRI). Imputation accuracy was assessed by different statistics. Genetic similarity between ethnic groups and reference populations were measured by F -statistics (FST) originally proposed by Wright and G -statistics (GST) introduced by Nei and others. To assess the predictive power of these measures regarding imputation accuracy, we analysed relations between them and corresponding imputation accuracy scores. We found that population genetic distances between homogeneous reference and target populations were strongly linearly correlated with resulting imputation accuracies irrespective of considered distance measure, imputation accuracy measure, missingness and imputation software used. Possible exception was African population. Usage of GST or FST-related measures for predicting the optimal reference panel for imputation frameworks relying on a specific reference is highly recommended. A cut-off of GST < 0.01 is recommended to achieve good imputation results for high-frequency variants and small data sets. The linear relationship is less pronounced for low-frequency variants for which we also observed a dependence of imputation accuracy on the number of polymorphic sites in the reference. We also show that the software specific measures MaCH-Rsq and IMPUTE-info must be interpreted with caution if the genetic distance of target and reference population is high.


Introduction
Genotype imputation is a common technique applied in the context of genome wide association (GWA) analysis. Typically, a set of densely genotyped samples is used as references to infer a large set of un-typed or missing markers in the target population. Although one has to deal with the uncertainty of genotypes derived by imputation, this procedure is nowadays standard since it makes large-scale genome-wide investigations feasible and cost effective. Furthermore, it enables meta-analysis by combining datasets genotyped at different platforms (e.g. Illumina versus Affymetrix arrays) [1]. It is also believed that genotype imputation improves the statistical power of genome wide association studies (GWAS) [2].
Moreover, imputation plays an essential role for the analysis of sequencing data [3]. Although, a dramatic cost reduction of next-generation sequencing technology was achieved, whole-genome sequencing of large study samples is still unaffordable. A way-out might be sequencing of a subset of individuals which could serve as an additional reference for imputation [4]. Strategies for selecting the individuals to be sequenced have been suggested recently [5]. These strategies consider genetic similarities between study population, subsets to be sequenced and the reference panel.
A number of different approaches have been suggested for building publically available reference panels that can maximize imputation accuracy. Some imputation software like IMPUTE2 [6] and MaCH-Admix [7] can exploit cosmopolitan references in order to optimize sequence similarity locally. However, other popular imputation frameworks (e.g. MaCH [8] and MaCH-minimac [9]) still rely on pre-selection of reference panels that are most closely matched with the ancestry of the study population For example, CEU is frequently used as imputation reference panel for European and European American samples, while CHB and JPT were chosen to impute samples from East Asian Populations [4].
Genetic distances like different F ST measures [10][11][12][13][14][15][16] or principal component analysis [17] have been proposed to determine the genetic similarity between target and reference datasets. F-statistics were originally proposed by Wright to assess genetic structure of populations [10,12]. Therefore, F ST measures were constructed to evaluate the genetic distance between (homogeneous) populations, or in other words, the degree of genetic variance explained by ethnic sub-entities. Since first introduction of Wright's F ST , a large variety of other F STrelated measures and corresponding estimators were proposed [11][12][13][14][15][16][17][18]. Nei [13,14,19] introduced the measure G ST which is also frequently used for this purpose [15]. A few studies revealed that F ST -like measures calculated between target and reference populations correlate with imputation accuracy [20,21].
However, it is still unclear how a reference panel with low genetic similarity affects the imputation accuracy. So far, no exact strategy (e.g. cut-offs for F ST -related measures) which could help us to select a well-suited reference panel has been proposed. To the best of our knowledge, there is no research on the relation between Nei's G ST and imputation accuracy. Therefore, in the present paper, we performed a simulation study to investigate the relationship of G ST and other F ST -like measures and imputation accuracy obtained by three imputation frameworks: MaCH, MaCH-minimac and IMPUTE2. All these frameworks can be run with specific rather than cosmopolitan reference panels. Finally, we investigate the impact of missingness and frequency of variants on this relationship. All analyses were performed on the basis of the publically available dataset of POPRES [22].

POPRES project
POPRES is a project fostering large Population Reference Samples of different ethnic origins [22]. The original POPRES project contains nearly 5,000 individuals of African-American, East Asian, South Asian, Mexican and European origin. Individuals included in the POPRES study are collected from different study groups all over the world. POPRES performed Genome-wide genotyping of these individuals on the Affymetrix (Mountain View, CA) GeneChip 500 K Array set with the published protocol for 96-well-plate format. Sample collection and methods for POPRES are described elsewhere [22]. The datasets used for the present analyses were obtained from dbGaP [23] through dbGaP accession number phs000145.v4.p2.

Datasets
We considered chromosome 22 from the POPRES dataset for our research. This dataset originally consisted of 5,637 SNPs measured in individuals from 35 different populations. To avoid biases due to different sample sizes and to include as many populations as possible into our analysis, we considered an equal number of individuals for each sub-population (N = 40). If more than 40 individuals are available, a random sample of N = 40 was drawn to rule out effects caused by differing sample sizes. Population groups with less than 40 members were discarded resulting in a total of 20 different ethnic subsets, namely 15 populations of Caucasian origin: Australian, Canadians, German, French, Swiss-French, Swiss-German, Swiss, Italian, Spanish, Irish, British, Belgish, Portuguese, former Yugoslavia, a mixed group of east European origin (a mixture of people from Czechrepublic, Hungary, Poland); two populations of South-Asian origin: Indians and Punjabis, one east-Asian population: Japanese, one Mexican population: Mexican, and finally, a mixed-population of African-Americans (AfAm). Study populations which do not match very closely with the available HapMap references CEU, JPT, CHB, YRI (see below) were supposed to indicate the impact of imperfect reference panels on the target populations. This might be applicable for the following populations: Indian, Punjabis, Yugoslavians, East-EU, Portuguese and African-Americans. Target populations like Europeans are abbreviated by EU, East European populations (eastEU and Yugoslavia) by EEU, South Asians (Punjabi and India) as SASI, Japanese by Jap, Mexican by MEX, South European (Italian, Portuguese) by SEU and African Americans by AfAm.

Reference Panel
1000 Genomes datasets were based on low depth whole genome sequencing data and are generally considered to have lower accuracy than HapMap data. Thus we considered HapMap3 [24] reference panels (NCBI Build 36) to impute the above mentioned populations. Four different pre-formatted reference panels: CEU, YRI, MEX and JPT + CHB provided by the MaCH-developers [25] and IMPUTE-developers [26] were considered. In a fullfactorial design, we imputed our target populations with these reference panels.

Strand verification of SNPs
Genomic assembly of the original POPRES data was identical to Affymetrix release 25 NSP25 and STY25 and the corresponding rs-IDs were identified by NCBI build "b36" with UCSC version "hg18". Strand alignment between study sample and reference data was performed using fcGENE [27] and PLINK [28]. SNPs with ambiguous strands and SNPs which could not be found in the HapMap3 reference panel were removed. In total 1,014 SNPs could not be matched to HapMap3 reference panels and were excluded. 4,623 SNPs overlapping with HapMap reference panel remained for further analyses.

Selection of good-quality (GQ) SNPs
Good quality (GQ)-SNPs were selected with stringent filtering criteria of genotype quality. These GQ SNPs were then assumed to express true genotypes for our experimental study. In analogy to our previous research [29], we masked these SNPs and re-imputed them to evaluate the imputation accuracy as explained in the next section. More precisely, we compared the posterior genotype probability distributions produced by the imputation software with the corresponding true genotypes. To select GQ SNPs, we apply the following quality criteria: average call rate (CR averaged over all populations > =95 %), average minor allele frequency (MAF averaged over all populations > =0.1) and p-values of stratified Hardy Weinberg Equilibrium Test (p (HWE) > =10e-2). Since the samples were from multiple ethnic group, we used exact stratified test of HWE [30]. A total of 457 SNPs passed these quality criteria.

Masking Process
We performed the masking process in two phases. First, we masked all good quality SNPs and imputed them with the imputation frameworks: MaCH, MaCHminimac and IMPUTE2. We also considered additional scenarios where we masked only 70 % and 50 % of the previously selected good quality SNPs. This type of masking was performed in such a way that all SNPs masked in the lower percentage of missingness were also masked in the higher percentages of missingness. The first type of masking (100 %) was used to investigate the relationships between G ST and F ST -related scores and corresponding imputation accuracy. The second type of masking was used to study the impact of different degrees of missingness on these relations. For this purpose, we only compared the 50 % GQ SNPs which were masked in all three missingness scenarios to avoid bias introduced by SNP selections.

Imputation
Imputations were performed separately for each of the previously mentioned sets of populations combined with any of the four reference panels. As suggested by MaCH developers, imputation with this software was performed in two steps: In the first step, imputation error rate and recombination rate were estimated. These two model parameters were determined by running the "greedy" algorithm for 100 iterations and were used in the second step to determine the transition probabilities of the underlying Hidden Markov Model [8]. In the second step, the most likely genotype probability distributions of each genotype at each individual and the imputation quality measured by the software specific Rsq score were determined. Commands used for MaCH-imputation are provided in Additional file 1. The relative performance of imputation methods differ greatly as a function of sample sizes, marker densities and parameters of the algorithm such as the number of EM iterations. Therefore, the same standard parameter settings were used for each imputation process.
Imputation with MaCH-minimac was also performed in two steps. In the first step, MaCH was used to predict the haplotypes of the study data sets, and then, minimac was used to calculate the posterior probabilities of the genotypes using these haplotypes.
As suggested by the software developer, imputation with IMPUTE2 was performed in a segmented way by defining different genomic intervals approximately of size 5 MB. An internal buffer region of 250 kb on both sides of the analysis interval was used to avoid the margin effects of chromosome segmentation.
After imputation, we compared the estimated posterior distribution with the measured genotypes as explained below. Considering four reference panels, 20 target populations, two missing scenarios and three software packages, a total of 480 imputations were performed.

Assessment of imputation accuracy
A common strategy for determining imputation performance is to compare true genotypes (genotypes measured by a technique with high confidence or consensus genotypes) with corresponding imputation results. Here, we directly compared the posterior distributions of our re-imputed GQ-SNPs with corresponding measured genotypes applying our recently proposed Hellinger and SEN score [31]. Both measures are platform independent. While Hellinger score measures the distance of imputed and measured genotype probability distribution, SEN score is maximal if their expectations are identical. Thus, SEN essentially compares gene doses. Cut-offs of 0.95 for SEN score and 0.45 for Hellinger score respectively were considered as indicators of good imputation accuracy (for motivation, see also Fig. 2 below). We also analysed imputation accuracy using the software specific measures MaCH-Rsq and IMPUTE-info determined during the imputation process. MaCH-Rsq measure is basically defined as the ratio of the empirically observed variance of the allele dosage to the expected binomial variance under Hardy-Weinberg equilibrium [32]. Similarly, IMPUTE-info score is the relative statistical information about the SNP allele frequency derived from the imputed data [33]. These two software-specific measures are defined at SNP-level and are useful to assess imputation quality of SNPs for which no measurements are available. These scores are widely applied to remove SNPs with low imputation accuracy during post-imputation quality control.
While Hellinger and SEN scores assess agreement of imputed and observed genotypes individually, the software specific measures assess imputation quality for entire SNPs, i.e. cannot be interpreted for single genotypes.

Estimation of G ST and other F ST -related measures
Nei's G ST is defined as the ratio of average gene diversity within subpopulations and the gene diversity of the total pooled population: where H T is the heterozygosity expected under Hardy-Weinberg equilibrium for the total pooled population and D' ST is the average gene diversity between the subpopulations [13,14,19]. For two-allelic markers, Bhatia et al. [16] recommended the estimator of G ST at any particular k th marker as: where p 1 k and p 2 k are the allele frequencies of the reference allele at the k th marker in the two populations. To calculate G ST between two population groups genotyped at N markers, one can use the formula: Computation of pair-wise G ST between any two population is implemented in the most recent version of fcGENE [27]. Small values of G ST indicate that allele frequencies between the two populations are similar, i.e. the genetic distance between them is small.
Regarding F ST -related measures, we considered F ST R described in the work of Reich et al. [17], and implemented in the program EIGENSOFT, in which a block-jack knife procedure is used to estimate the standard error of F ST R . For any k th Marker, F ST R is calculated as where a 1 and a 2 are the specific allele counts and m 1 and m 2 are the total allele counts of the marker in two population. Heterozygosities of the markers are 2 h 1 and 2 h 2 , with Þ respectively. Let n 1 and n 2 be the numbers of individuals genotyped in the two populations at the k th marker. The allele counts a 1 and a 2 and the total allele counts m 1 and m 2 can be determined as a 1 = 2u 1 + v 1 , a 2 = 2u 2 + v 2 , m 1 = 2n 1 , m 2 = 2n 2 , where u 1 and v 1 , and u 2 and v 2 are the counts of homozygotes and heterozygotes in the first, and in the second population respectively. Now if there are N markers genotyped in each population, an unbiased estimator of F ST can be defined as In order to compare the relative performance of different F ST -related measures in predicting imputation accuracy, we also computed the original and modified estimators of F ST (denoted by F ST WC and F ST mWC ) proposed by Weir and Cockerham [11]. F ST WC between two population was calculated as follows A modified estimator F ST mWC of Weir and Cockerham's F ST is defined as follows [16]: In previous studies [20], F ST was computed for individual SNPs and then averaged across SNPs. However these F ST estimators does not account for haplotype diversity very well [34]. Therefore, in our formula all quantities were averaged over all SNPs first, and then, F ST is calculated. This estimate is more precise, i.e. results in smaller standard errors as pointed out in [17,35].

Correlation statistics
Calculation of imputation quality scores is based on GQ SNPs masked prior to imputation. After calculation of G ST and other F ST related measures between POPRES populations and the four reference panels considered, we compared population distances with corresponding imputation accuracy scores. Different scatter plots were generated using the R-package 'ggplot2' [36] allowing to construct smoothed curves of non-linear relationships. To determine the correlation among G ST and other F ST -related measures, we used Kendall's rank correlation coefficient (Kendall's tau coefficient) [37], which measures the similarity of the ordering of the data to be compared.

Comparison of measures of genetic distance between populations
First, we compared our different measures of population distances derived from all pair-wise comparisons of POPRES data subsets and reference populations. Results can be found in Fig. 1 Table S1).

Characterization of measures of imputation accuracy
Next, we analysed the accuracy scores obtained from imputing each of the 20 target populations with any of the four reference panels using the three imputation frameworks. As an example, distribution of imputation accuracy scores of target population from Germany imputed with the four different reference panels applying MaCH are displayed in Fig. 2.
As expected, the target population from Germany is best imputed with the genetically closest reference panel "CEU" followed by "MEX", "CHB. JPT" and "YRI". A cut-off for Hellinger score of 0.45 almost perfectly separates correctly and incorrectly imputed genotypes. We obtained similar trends for other target populations as for example "AfAm" was best imputed with the ethnically closest reference panel "YRI" (see Additional file 1: Figure S1) although overall imputation yield is substantially reduced in this population.
MaCH-Rsq and IMPUTE-info scores are typically used for post-imputation quality control. These scores are essentially based on the ratio of sample variance of allele frequency during imputation and its expected variance under Hardy-Weinberg equilibrium. The expected variance depends on the allelic frequency of the corresponding SNP in the reference panel considered. Thus, in contrast to Hellinger or SEN score, imputation accuracy determined by MaCH-Rsq and IMPUTE-info scores depends on the reference sample used. We studied the relationship between MaCH-Rsq/ IMPUTE-info score and the average Hellinger score of GQ SNPs. Exemplarily, results of AfAm imputed with the four different reference panels are shown in Fig. 3. We observed a monotonous relationship between average Hellinger score and Mach-Rsq/IMPUTE-info irrespective of the target dataset or reference used. However, it turned out that for given Mach-Rsq/IMPUTE-info values, corresponding average Hellinger scores were higher for genetically matching reference panels compared to mismatching reference panels. This behaviour is especially pronounced for AfAm population where reference panels other than YRI result in particularly low average Hellinger scores even if corresponding MaCH-Rsq/ IMPUTE-info values are high (Fig. 3). This indicates that MaCH-Rsq/IMPUTE-info values measure imputation quality accurately only if a genetically matching reference is used.

Correlation of Nei's G ST and F ST -related scores with imputation accuracy
We investigated the relationship between G ST and F ST -related scores and imputation accuracy. In view of the good correlation of G ST and F ST -related scores, we focus on G ST in the following. Since good Hellinger scores (≥0.45) represent correctly imputed genotypes in most cases, percentages of genotypes with Hellinger score ≥0.45 in dependence on G ST serve as primary outcome of our analyses. Results can be found in Fig. 4 showing the scatter plot between pair-wise Nei's G ST and the percentage of genotypes with Hellinger score ≥0.45 for all target populations imputed with the four reference panels.
We observed an almost linear relationship between G ST and this measure of imputation quality for all three software packages. Pearson's correlation coefficients between G ST and imputation quality are -0.95, -0.93 and -0.91 for MaCH, Mach-minimac and IMPUTE2, respectively. We conclude that G ST is a good predictor of imputation accuracy for all type of imputation frameworks used under the best-matching policy for selecting a reference panel. Small values of G ST imply high imputation accuracies and vice versa. Only AfAm is an outlier of this relationship resulting in particularly low imputation quality even if YRI as best matching reference panel was used.
This outlying behaviour of AfAm was consistently observed for all three software packages considered. To analyse whether the sample size has an impact, we additionally considered the complete AfAm sample of POPRES with N = 252. For the reference sample YRI the results of all software packages were slightly improved, but we also observed a small reduction in G ST . For the other reference panels, we observed no difference to the results of MaCH-minimac and IMPUTE2 obtained for the original sample (N = 40). However, for MaCH we observed a small deterioration of Hellinger score for the larger sample. Results are shown in Additional file 1: Figures S10 and S11. We conclude that sample size alone does not explain the observed outlying behaviour of AfAm.
Correlation between MaCH-Rsq/IMPUTE-info score and G ST showed similar behaviour (Additional file 1: Figure S2). Moreover, G ST was also highly correlated with the percentage of genotypes with SEN ≥0.95 as shown in Additional file 1: Figure S3. Analyzing the relationship between G ST and imputation accuracy in more detail, it can be recommend that G ST between the target and reference population should be smaller than 0.04 to achieve a AfAm is an exception of this rule. In our data, good imputation results with about 90 % correctly imputed GQ SNPs are obtained if the value of G ST is less than 0.01. Since the largest set of POPRES populations are from Europe, we performed a more detailed analysis of this sub-group (Fig. 5). Interestingly, using CEU as reference, we obtained again a trend towards lower imputation accuracy for larger values of G ST . Notably, the populations from east and south Europe show somewhat lower yield of well imputed genotypes than those from Central and Western Europe.
Scatter plots of other measures of population distance (e.g. F ST R ) and imputation accuracy are similar. Results for F ST R can be found in Additional file 1: Figures S4 and  S5.
We also computed correlation coefficients between G ST and imputation accuracy of the 20 POPRES samples in dependence on reference, software and measure of imputation quality (Table 1). A strong linear trend was observed for all of these scenarios.

Dependence on degree of missingness
In order to study the impact of different degrees of missingness on the relation between imputation accuracy and F ST -related measures, we compared G ST , F ST R , F ST WC and F ST mWC with imputation accuracies at different degrees of missingness. Although, degree of missingness has a clear impact on overall imputation accuracy, it turned out that this has only a marginal impact on the observed linear relationship between population distance and imputation accuracy. Fig. 6 shows the results for Nei's G ST . Results of other accuracy endpoints and measures of genetic distance are similar and can be found in the supplement material (Additional file 1: Figures S6, S7 and S8).  2 Box plot of Hellinger scores of Germanic target population obtained from MaCH-imputation with four different reference panels. Results for correctly and wrongly imputed SNPs based on best-guess genotypes are presented separately. CEU achieves highest Hellinger scores for all, correctly and incorrectly imputed genotypes, i.e. performed best among reference panels. As one can see, applying a threshold of 0.45 for Hellinger scores almost ensures that the best-guess genotype is correct

Impact on low frequency variants
Finally, we analyzed how G ST and other F ST -related scores correlate with imputation accuracies of lowfrequency variants (MAF ≤ 5 %). Fig. 7 shows the results for G ST and the software-specific measures of imputation accuracy. As expected, the overall yield of well-imputed low-frequency variants is lower than for the common variants. Moreover, the correlation of G ST and imputation accuracy is also markedly reduced compared to the common variants. Correlation between F ST and the software-specific measures of low-frequency and common variants is displayed in Additional file 1: Figure S9 showing similar results.

Discussion
Imputation of un-typed SNPs or missing genotypes is a common technique in genome-wide analyses. However, accuracy of imputation is difficult to predict as it depends on a variety of factors including pre-imputation quality control, genetic similarity of reference and target population, and its haplotype structure. We recently performed a comprehensive simulation study analyzing the effect of pre-imputation quality control on accuracy of imputation [29]. In the present paper, we studied the impact of reference panels on imputation accuracy. For this purpose, we considered the three software packages MaCH, MaCHminimac and IMPUTE2 which can be run with a population-specific reference panel. Other approaches relying on mixed reference panels were proposed recently [21,38] circumventing the issue of selecting appropriate references (e.g. IMPUTE2 [39], MaCH-Admix [7]). However, previous researches [29,46] and our own results (submitted) showed that such algorithms could reduce imputation quality compared to frameworks relying on specific references. Thus, software packages like MaCH or MaCH-minimac are still frequently in use [4,[40][41][42][43][44]. It is beyond any doubt that in this case, the reference panel should ethnically match with the target population as best as possible so that it can represent the haplotype structures of the individuals in the target population. Consequently, for these imputation frameworks, it is recommended to choose a reference panel best-matched with the ancestry of the target population. This can be achieved for example by analysing measures of genetic distances between target and reference populations [20,45]. However the relation between genetic distance and imputation accuracy is not completely understood and requires further research.
In order to analyse this issue in more detail, we performed a simulation study on the basis of ethnic subsamples of the publically available POPRES panel [22]. A total of 20 target datasets were considered. However, samples were small regarding both, number of SNPs and individuals. This implies that our results may be valid only for small or medium-sized data sets.
Four ethnic reference data sets derived from HapMap3 (NCBI Build 36) were considered, namely CEU, YRI, MEX and JPT + CHB. Reference data are provided through the home pages of MaCH-and IMPUTEsoftware developers. These four reference data sets allowed us investigating the dependence of imputation accuracy on genetic similarity between target and reference panels for a higher number of combinations. In our paper, we focused on imputation of high-frequency variants. Although relying on HapMap3 reference might be a limitation of our study, we expect that the results for these variants are similar if switching to 1 kG reference panels. This is based on the observation that the yield of well-imputed high-frequency variants is comparable to our experiences (not shown).
We investigated imputation accuracy by comparing genotypes of masked SNPs with their posteriori distributions after different imputation scenarios. We only masked SNPs of good quality to ensure that error of measured genotypes is as small as possible. Several measures of comparisons of measured and imputed genotypes were considered, namely best-guess genotypes, SEN score and Hellinger score. While Hellinger score measures the agreement of measured and posterior genotype distribution, SEN score is maximal if their expectations coincide [29]. We also studied the software specific quality measure MaCH-Rsq and IMPUTE-info score which however are only defined for entire SNPs rather than single genotypes. An important result of our study is that these measures critically depends on the reference panel used. As a consequence, these scores can predict the imputation accuracy only if the reference panel is genetically similar to the target population. Otherwise even high MaCH-Rsq/IM-PUTE-info scores do not guarantee that the estimated genotypes are correct.
To evaluate genetic similarity between different target and reference populations, we computed pairwise G ST , (See figure on previous page.) Fig. 3 Scatter plot between Rsq-score/Info-score and average Hellinger score of GQ SNPs. In the three sub-figures, imputation results of AfAm obtained with the three software packages MaCH, MaCH-minimac and IMPUTE2 are presented. Color represents results after imputation with one of the four reference panels CEU, CHB_JPT, MEX and YRI. Results are smoothed by loess estimators. For a particular value of MaCH-Rsq/ IMPUTE-info, Hellinger scores obtained by using a genetically similar reference panel are higher than those obtained from mismatching reference panels (p-value < 0.001 for all three scenarios, based on a regression analysis considering the software-specific score as covariable)  [27] and SMARTPCA [17] which calculates pairwise F ST R between any two populations. The measure G ST and all of the F ST -related measures were strongly correlated. Relationships are almost linear except for the AfAm population which is a clear outlier. A detailed analysis revealed that G ST was slightly better correlated with F ST WC and F ST mWC than with F ST R . In previous research [20,45], G ST and F ST -related scores were estimated for SNPs first and then averaged across all SNPs. However, such type of estimates may not reflect haplotype diversity among populations of different ethnicities [15,34,35]. Therefore, we decided to estimate these measures in a haplotype-wise manner averaging their components (i.e. numerator and denominator of the formula) over all SNPs first. Then, the measure is calculated as the ratio of these estimates.
Independent of the type of measures considered, we observed an almost linear relationship between genetic distance and resulting imputation accuracy. Only AfAm showing particularly low imputation accuracy even if using YRI as reference violates this finding. Moreover, even though degree of missingness was shown to be a strong determinant of imputation accuracy [29], the linearity of the above mentioned relationship is preserved for different degrees of missingness. In view of this linear relationship, one can estimate imputation accuracy for a given pair of target and reference population. Relying on Nei's G ST we observed satisfactory imputation results for a cut-off of 0.04. Excellent results are achieved if G ST is less than 0.01. We recommend this threshold for selecting a reference panel at least for medium or small datasets considered in our study. Larger samples of genetically different groups are required for generalization of our result.
Finally, we analysed the performance of genotype imputation for low frequency variants. Although it is known that the imputation of low frequency variants is particularly difficult [46,47], it has become important in the context of next-generation sequencing. Imputation quality of these variants is much lower than for highfrequency variants. Still we found a negative trend between genetic distance and imputation quality which however is less pronounced than for the high-frequency variants. Interestingly, besides the ethnic similarity, the number of polymorphic sites in the reference panels influences imputation accuracy of low-frequency variants.
As mentioned previously, imputation accuracy is not solely determined by the genetic similarity between the reference and target population. An example is the AfAm population showing lower accuracy than expected on the basis of the genetic distance. The reason is the more complex haplotype structure and generally reduced levels of linkage disequilibrium in African populations which is not measured by the genetic distance [20]. Additional populations of African ancestry are required to analyse this issue and its impact on the relation of genetic similarity and imputation accuracy in more detail.

Conclusion
We conclude that G ST and other measures of genetic similarity of homogenous target and reference populations are good predictors of imputation accuracy for imputation frameworks relying on best-matched reference panels. An almost linear relationship of G ST and various measures of imputation accuracy was observed with exception of the African-American population considered. In our data, excellent imputation results are achieved if G ST is less than 0.01. However, this By estimating the association between G ST and imputation accuracy score and by computing a test of the correlation being zero, we got p-value < 9.52e-13 for all three scenarios threshold might not hold for African populations for which reduced linkage disequilibrium is a stronger determinant of imputation accuracy. For low-frequency variants, the same trend between G ST and imputation quality was observed, but here, panels with higher number of monomorphic sites (i.e. CHB-JPT) perform below the average. The software specific measures MaCH-Rsq or IMPUTE-info score must be interpreted with caution if the genetic distance of target and reference population is high. Fig. 6 Scatter plot of G ST and percentages of well imputed genotypes (Hellinger score ≥ 0.45) at different degrees of missing. Performance of the 50 % GQ SNPs missing in all scenarios was analysed. While the degree of missingness has a clear impact on imputation accuracy, the linear trend between G ST and imputation accuracy is essentially preserved. Pearson's correlation coefficients and the p-values obtained from computing a test of the correlation being zero are also described Fig. 7 Scatter plot of G ST and average Rsq-score/Info-scores of low-frequency variants (left panels) versus common variants (right panels). We present the results of the three imputation frameworks MaCH, MaCH-minimac and IMPUTE2. For low-frequency variants, both, overall yield of well-imputed SNPs and correlation between G ST and imputation accuracy are reduced. Pearson's correlation coefficients and the p-values obtained from computing a test of the correlation being zero are also described