Determining Ancestry Proportions in Complex Admixture Scenarios in South Africa Using a Novel Proxy Ancestry Selection Method

Admixed populations can make an important contribution to the discovery of disease susceptibility genes if the parental populations exhibit substantial variation in susceptibility. Admixture mapping has been used successfully, but is not designed to cope with populations that have more than two or three ancestral populations. The inference of admixture proportions and local ancestry and the imputation of missing genotypes in admixed populations are crucial in both understanding variation in disease and identifying novel disease loci. These inferences make use of reference populations, and accuracy depends on the choice of ancestral populations. Using an insufficient or inaccurate ancestral panel can result in erroneously inferred ancestry and affect the detection power of GWAS and meta-analysis when using imputation. Current algorithms are inadequate for multi-way admixed populations. To address these challenges we developed PROXYANC, an approach to select the best proxy ancestral populations. From the simulation of a multi-way admixed population we demonstrate the capability and accuracy of PROXYANC and illustrate the importance of the choice of ancestry in both estimating admixture proportions and imputing missing genotypes. We applied this approach to a complex, uniquely admixed South African population. Using genome-wide SNP data from over 764 individuals, we accurately estimate the genetic contributions from the best ancestral populations: isiXhosa , ‡Khomani SAN , European , Indian , and Chinese . We also demonstrate that the ancestral allele frequency differences correlate with increased linkage disequilibrium in the South African population, which originates from admixture events rather than population bottlenecks. Nomenclature The collective term for people of mixed ancestry in southern Africa is “Coloured,” and this is officially recognized in South Africa as a census term, and for self-classification. Whilst we acknowledge that some cultures may use this term in a derogatory manner, these connotations are not present in South Africa, and are certainly not intended here.


Introduction
The field of population genetics has experienced a resurgence in the past few years due to access to extensive single nucleotide polymorphism data. The availability of genome-wide multi-locus genotype profiles has fueled long-standing interest in analyzing patterns of genetic variations to trace the ancestry components of recently admixed human populations, to identify genes underlying ethnic difference in disease risk and shed light on both the evolutionary history and migrations of recently admixed human populations [1][2][3][4]. In order to understand the genetic variation which could be observed at genetic marker locations within and among populations, the inference of both local ancestry and population structure from the genotypes of single nucleotide polymorphisms is crucial. These inferences, including the imputation of missing genotypes in genome-wide association studies (GWAS) utilize panels of reference ancestral populations based on place-of-origin, ethnic or continent affiliation [5][6][7][8][9][10][11][12][13]. Fortunately, the availability of high-throughput genotype data from various populations may facilitate the choice of best proxy ancestry of a recently admixed population from a pool of reference populations. This choice is critical in both the study of population genetics and in identifying genes underlying ethnic difference in genetic diseases risk [1][2][3][4]. Furthermore, the accuracy of these inferences is in part related to the choice of reference populations. An insufficient or inaccurate ancestral proxy can weaken these inferences, resulting in erroneous inferred ancestry, and errors and uncertainty in the imputed genotypes. These issues may consequently affect the inference of ancestry and the detection power of GWAS and metaanalysis when using imputation, particularly in multi-way admixed populations.
Because distinct populations exhibit substantial variation in genetic disease risk, the choice of reference populations for a multiway admixed population may be sensitive and critical in biomedical research. Current algorithms for identifying the best proxy ancestral populations are inadequate for multi-way admixed populations, including HAPMIX [14], LAMPLD [5], MULTI-MIX [15] and PCADMIX [16]. Furthermore, Patterson et al.(2010) utilized a regression-style technique to compute the degree of admixture given samples from an admixed population, and samples from the populations believed to be contributing. Their method was able to report on the continental admixture underlying the genetic origin of the SAC, however given an ethnic group within different populations, their method cannot tell which population is the best proxy representing the ancestral genetic donor to the gene pool of a multi-way admixed population, as was the case of the SAC in their study. In addition, the indigenous Khoesan ethnic group in southern Africa, which is well known to have historically contributed to the gene pool of the SAC, was under-represented in their study.
To address these challenges and the uncertainty in ancestral populations we developed PROXYANC, an approach to select proxy ancestry for recently admixed populations. We implemented two novel algorithms in PROXYANC, based on population genetic differentiation and optimal quadratic programming, respectively. We demonstrated through simulation of a complex multi-way admixed population that these two algorithms can select best proxy ancestry for an admixed population, given a pool of related/unrelated or admixed reference populations. Our simulation demonstrated that our complementary algorithms perform better in selecting the best proxy-ancestry for a multi-way admixed population, compared to the f3 statistic [17]. In addition we demonstrated the impact of choosing the best proxy ancestral populations in both estimating admixture proportions (global and local ancestry) and imputing missing genotypes in a multi-way admixed population, which may reduce the computational cost of the imputation processes in selecting the best haplotypes among several reference populations.
The South African Coloured population (SAC) has a high level of intercontinental admixture and therefore diverse ancestry [18][19][20][21]. Historical sources and a few genetic studies have reported that this population is the result of unions between African (Bantu and Khoesan), Europeans, and various other population groups of Indian or Indonesian descent [18][19][20][21]. A South African government source (http://www.statssa.gov.za), describes the present population of South Africa to be characterized by diversity, including groups originating from African (79%), European (9.6%) and Asian (2.5%) populations. A study conducted by Tishkoff and colleagues [18] on the characterization of the microsatellite genetic variation and the relationships among populations across the African continent, revealed that the ancestral components of the SAC include nearly equally high levels of southern African San, Niger-Kordofanian (West/Central Africa Bantu), Indian, European, and lower levels of East Asian ancestry. However, their study used 39 samples of the SAC, possibly including Cape Malays [20]. Based on 20 samples from the SAC population, a study by Patterson et al. [20] showed that there is substantial genetic contribution from at least four distinct population groups in the SAC, including Europeans, South Asians, Indonesians and a population genetically close to the isiXhosa sub-Saharan Bantu. Quintana-Murci and co-authors [21] examined the genderspecific ancestry contributions in the SAC, using mitochondrial DNA (n~563) and Y-chromosome (n~228) variation analysis. They inferred at least five different ancestral populations (Khoesan, Bantu, Europeans, Indians, and South-East Asians). An in-depth investigation by De Wit et al. [19] had the advantage of using a very large cohort of the SAC (959 samples) and 75,000 autosomal single nucleotide polymorphisms (SNPs) common to HapMap and Human Genome Diversity Project (HGDP) data sources. The study exploited both subsets of selected random SNPs and ancestry informative markers (AIMs) from 75,000 autosomal SNPs, to address the question of ancestry contribution in the SAC. This early investigation used a small sample of Khoesan (5 samples of San obtained from HGDP), and no suitable ancestral population samples from local southern African populations, and inferred four major contributions to the SAC with the greatest from San Africans, followed by non-San Africans, Europeans and a smaller Asian contribution [19]. However, the low San sample size may have biased the estimate of the ancestry contributions. Overall, the above investigations have documented the genomewide continental average admixture proportions in the SAC to be in the range of 23% to 65% for African, 19% to 40% for European, and 7% to 10% for Asian, with some regional variation, and also with substantial variation among individuals. While different authors have focused on the global admixture (continental admixture) underlying the genetic origin of the SAC, attention has not yet been paid to which specific sub-continental populations or ethnic groups contributed to the admixture. In addition, recent studies demonstrated the extensive divergence between different Khoesan populations (estimated to be 30,000 years or more) and between Khoesan and Bantu-speaking groups [21][22][23]. The complexity of African population history and high differentiation between populations makes identifying the best African ancestral reference populations for the SAC crucial for local genetic ancestry inference. The greater the accuracy of the choice of ancestral population, the greater the utility in admixture mapping methods, in the imputation of missing genotypes, and in estimating global and local ancestry in multi-way admixed populations.
Here we developed PROXYANC, a novel method to select the best ancestral populations in multi-way admixed populations. We characterized the African, European, South Asian and East Asian origins of the SAC by applying PROXYANC to a cohort of the SAC (764 unrelated individuals) and report the considerable refinement of the contributions of genetic ancestry components. We establish that the SAC has had substantial admixture from isiXhosa, {Khomani, European, Indian (Gujarati) and Chinese populations. Using the best proxy ancestral populations found by PROXYANC, we demonstrated that the ancestral allele frequency differences correlated with increased linkage disequilibrium (LD) in the SAC, indicating that increased admixture LD is present in this population, and the observed LD has its origin from admixture events. This result supports the rejection of the hypothesis of founder effects or of population bottlenecks that could have been due to the racial segregation of the past, formalized during the recent apartheid regime in South Africa. the genetic distance between the admixed population and all possible synthetic populations, consisting of a linear combination from reference populations (see Materials and Methods). For genetic distance, the F ST was used as an objective function of ancestral proportions as variables through an optimal quadratic cone programming algorithm. In the same vein, PROXYANC also computes a proxy-ancestry score by regressing a statistic for LD (at short distance ,0.25 Morgan) between a pair of SNPs in the admixed population against a weighted ancestral allele frequency differentiation (see Materials and Methods). To evaluate PROXYANC, we mimic a 5{way admixture scenario by simulating (see Materials and Methods) the genomes of 750 individuals of mixed ancestry through the haplotype samples from Europeans (CEU), {Khomani, isiXhosa, Chinese (CHD) and Indian Gujarati with probability related to our prior estimate on the ancestral proportion from each putative ancestral population (20%, 32%, 29%, 8% and 11%, respectively). We applied both approaches implemented in PROXYANC to select the best ancestral proxies for the above simulated data using 5 distinct pools from 20 reference populations that includes the African Bantu (isiXhosa, Bantu South Africa, Yoruba, Kongo, Herero), South Asian (Gujarati, Pathan, Druze), East Asian (CHD, Dai, Daur, Japanese), European (CEU, Russian, Italian, French) and Khoesan ({Khomani, Ju|'hoan, Bushmen, SAN) populations. From each pool, our algorithms select the best ancestral populations for our simulated data. The result from the simulation demonstrates that the highest proxy-ancestry scores (Table 1) are from the five reference populations that contributed to the admixture in the simulated data ( Figure S1). The higher the proxy score, the more likely it is that the related reference population is a good proxy ancestral population. In addition, among different linear combinations of five reference populations, the linear combination formed from the five populations used in our simulation (CEU, {Khomani, isiXhosa, Chinese and Gujarati) minimizes the genetic distance (F ST ) within the simulated data ( Table 2). This result demonstrates that the selected proxy ancestries are in agreement and consistent with the ancestral populations used to generate these 750 admixed individuals (simulation data). To compare our algorithms to the f 3 statistic [17], which is a 3-population test for admixture given two reference populations and the admixed population (target), we applied the f 3 statistic to the same simulated data above, within each pair of populations from the 5 pools of reference populations described above. The results in Table 3 and Table S2 demonstrate that in many cases the f 3 statistic fails to provide clear evidence/ non-evidence of admixture in our simulated data which mimicked a multi-way admixed population. Given different pools of reference populations for a multi-way admixed population, the f 3 statistic clearly does not enable an accurate selection of the best proxy ancestry from each pool. Although the reference populations within a given pool may be closely related, the simulation shows that both approaches developed in PROXYANC are complementary and can select the best proxy populations that separated the closest in time from the true ancestor.

PROXYANC: Estimating Admixture Proportion and Imputing Missing Genotype in Admixed Populations
To evaluate the impact of selecting the best proxy ancestral population for an admixed population in estimating admixture proportion, we ran the ADMIXTURE software on three data sets  Figure S2). We ran ADMIXTURE again on the simulated data within a panel that included reference populations that are geographically close to the selected proxy ancestral populations, including Russian, Japanese, Palestine, Yoruba and Ju|'hoan. Compared to the true ancestral proportions used in our simulation, we obtained biased admixture proportions: 2:9%+0:2540 from both Russian and Palestinian, 2:6%+0:0229 from Japanese, 2:6%+0:023 from both Yoruba and JuD'hoan and 40%+0:2074 and 50%+0:2056 from two unknown populations ( Figure S2). An example of an African ancestry case (isiXhosa versus Yoruba contribution in the simulated data) is displayed in Figure 1. In Figure 1, we compared the true individual admixture proportions versus those estimated from the best proxy ancestry (isiXhosa) and an inappropriate proxy ancestry (Yoruba), respectively. The estimated individual admixture proportions from isiXhosa are closer to the true individual ancestral proportions compared to proportions from Yoruba ( Figure 1). This result shows the importance of selecting the best proxy ancestry in estimating admixture proportions. Including all available reference populations in imputing has recently been discussed to be useful in accurately imputing genotypes. However, it becomes computationally expensive to the imputation engine to choose the best haplotype among several available reference populations [24,25]. To address this, we assess whether a panel of selected best proxy ancestral populations can achieve similar accuracy as using all available reference populations, in imputing missing genotypes of an admixed population. We removed 2044 SNPs out of 39064 SNPs on chromosome 1 from the simulated data, and we imputed them using 4 different sets of reference populations. These four sets  Figure 2 indicates a high call rate when imputing missing genotypes of the simulated data using the correct proxy ancestries. The imputation using the first panel of populations used directly in our simulation, converged to perfectly imputed genotypes. The imputation using the second and third panels also yielded accurately imputed genotypes. Due to the small sample size used in the second panel (Material and Methods), the imputation based on this panel (consisting of five proxy ancestral populations (see Table 1) with their original sample size) has a lower genotype call rate compared to the first panel. Using the last panel of populations, which does not include proxy ancestors, we obtained poor imputation ( Figure 2).

PROXYANC: Selecting Proxy Ancestry in the SAC
To select the best proxy ancestral populations using the real SAC data, we applied PROXYANC to 5 pools of reference populations implicated by both PCA and admixture analysis. We first constructed African, European, South Asian and East Asian population data sets using populations described in Table S1, each including 764 unrelated SAC samples. The data analyzed was from four sources: The African population panels [26] (n~169 samples from 11 African populations genotyped on an Illumina Beadchip 550K custom v2 chip and Affymetrix 500K), the Human Genome Diversity Cell Line Panel [27] (n~732 samples from 53 populations genotyped on an Illumina 650K array), the International Haplotype Map (HapMap) Phase 3 [28] (n~856 samples from 10 populations genotyped on an Illumina 1M array), and samples from 6 southern African populations obtained from Pickrell et al. [22], Henn et al. [16,26] and Schlebusch et al. [23] ( Table S1). We performed admixture analysis using the ADMIX-TURE software [11] and Principal Component Analysis (n~47863 autosomal SNPs) on each data set described above. We were able to identify the candidate reference populations for the proxy ancestry analysis ( Figures S3-S8). We performed both proxy ancestry score and F ST -optimal quadratic cone programming on 5 distinct pools of reference populations. The results from both proxy ancestry score (Table 4 and Figure S9) and F SToptimal quadratic cone programming (Table 5) were in agreement and reveal that the combination of CEU, isiXhosa, Gujarati, CHD, and {Khomani formed the best proxy ancestry for the SAC (Table 4 and Table 5). This result suggests a Southern Bantu population (isiXhosa), and South African San ({Khomani) to be the best Bantu and San proxy ancestral population for the SAC, compared to the more frequently used Yoruba and the Namibian San of previous studies [18,19,21].

Refinement of Admixture Proportions in the SAC
Using the result from PROXYANC on the SAC data, we combined the top proxy ancestral populations (CEU, CHD, Gujarati, isiXhosa, {Khomani) (Tables 4 and 5), including the SAC, in one data set. We repeated both the PCA and the ancestral population clustering analysis. From these analyses, our inferred five major ancestral contributions (Table 6 and Figures 3 (A-B)) to the SAC population have a roughly equivalent African ancestral proportion from isiXhosa (33%) and {Khomani (31%), followed by European (CEU) (16%), Gujarati Indian (12%) and a smaller admixture proportion from the Chinese (8%). It is also clear from the PCA plots in Figure 3(D), that the SAC lie on a direct line with these five groups of proxy ancestors. In addition, both isiXhosa and {Khomani groups were related to the SAC, indicating their close ancestral affiliations with this population, and this reflects the role of both Southern Bantu and indigenous Khoesan south of the Kalahari in the early establishment of the SAC population [29].
The other putative groups of proxy ancestral populations, CEU, Gujarati Indian and Chinese, are separated from each other, and the SAC is in the convex hull of the three. These findings agree well with the result obtained from the admixture analysis on K~5 in Figure 3(A-B). As we expected, the PCA in Figure 3 (D) revealed the greatest genetic differentiation between these five proxy ancestries of the SAC, which clearly reflects the admixture of the SAC from these five proxy ancestors. In addition, we compared our estimated admixture proportions with previous estimates in Patterson et al. [20], and we redid the admixture analysis using the ancestral populations used in De Wit et al. [19] that included the Yoruba, CEU, San, Gujarati, and Chinese (CHB). Figure 3 (B) indicates a large difference in African ancestry of the SAC between the two analyses (using the proxy ancestries panel and the panel from De wit et al. 2010), suggesting that the choice of African ancestry for the SAC is critical when conducting ancestry inferences and admixture mapping studies. This may be due to the diversity and close relatedness of most African populations. Table 6 displays the estimated admixture proportions from our selected best proxy ancestries and from those two previous studies. Our result highlights the importance of selecting the best proxy ancestral populations for multi-way admixed populations, and we demonstrate that inaccurate proxy ancestries can result in inaccurate inferred ancestry which is fundamental to admixture association or admixture mapping studies, and can therefore lead to erroneous interpretation of results identifying genomic location underlying genetic ancestry difference in complex diseases risk. Taken together, the results above provide confidence that our inferred five ancestral components with balanced African contributions from isiXhosa and {Khomani populations, followed by Northern and Western European, Gujarati Indian and a smaller Chinese contribution, is closer to the true picture of ancestral contributions according to the SAC's history. We believe that our result also has the advantage of handling sample size differences and using accurate proxy ancestral populations, and believe that both the number of SNPs (n~47863 ) and target population sample size used can provide sufficient resolution to support our inferred ancestral contribution.

Linkage Disequilibrium and Genetic Diversity
Understanding the extent of admixture LD is useful in designing disease mapping tests in admixed populations [2,30]. To assess the pattern of admixture LD in the SAC as a result of ancestral admixture, we first compared LD between the SAC and its putative proxy ancestors (see Materials and Methods). We calculated the LD (r2w0:2) across the whole genome of each population and found that the LD is consistently higher at very short distances in the SAC ( Figure S10). The LD in the SAC decays from regions .0.2 Morgan (Figure S10), suggesting that this LD is primarily a result of admixture rather than founder effects. This finding is consistent with prior studies that established that the admixture LD decays within a few generations at long distances (.20 cM) but decays slowly at short distances (,10 cM) [31,32]. Recent admixture between genetically differentiated populations gives rise to an increase in admixture LD proportions [2,30]. To test for the admixture LD due to these five proxy ancestral population admixtures in the SAC, we computed the LD between all pairs of markers in the SAC, weighted by their frequency difference (see Materials and Methods) between each pair of these five proxy ancestral populations, isiXhosa and {Khomani, European (CEU), Gujarati Indian and Chinese(CHD). Through linear regression of the allele frequency differences of each pair of proxy ancestral groups with LD in the SAC, we obtained a correlation (R~0:74,OR~0:38,slope~0:41) with a significant p-value~0:000018, indicating an association of allele frequency differences with increased LD in the SAC. We finally estimated the maximum expected admixture LD (see Materials and Methods) from each pair of proxy ancestral populations. We compared this with the observed LD in the SAC. Table 7 shows that the correlation between the expected admixture LD from each pair of proxy ancestral groups and the observed LD in the SAC is significant, except for CEU-GIH, and CEU-CHD, which may be due to the fact that the GIH and CHD proportions are small ( Figure S11). Through an additive linear model, we obtained a lower p-value~2:2e {16 under the null hypothesis of no correlation between LD in the SAC and these maximum expected admixture LD, indicating that the LD in the SAC correlated with the expected admixture LD, and mainly has its origin in admixture between the populations related to these five proxy ancestries. This result confirms that admixture between populations related to these five proxy ancestral groups (isiXhosa and {Khomani, European (CEU), Gujarati Indian and Chinese (CHD)) largely contributed to the admixture LD observed in the present SAC population.
We additionally compared the genome-wide haplotype diversity and the percentage haplotype sharing by IBD (see Materials and Methods), and the result in Table S3 indicates that the SAC has a higher haplotype diversity than any of its five proxy ancestral groups. The result suggests that both the higher diversity and higher LD at short distances observed in the SAC are the result of admixture events, and not founder effects or an extreme bottleneck. The observed higher level of genetic diversity in the admixed SAC is likely to be the result of the geographic location of South Africa with respect to major trade routes in the past (from the 15th to the 19th centuries) and its history of multi-faceted colonization [29].

Discussion
We introduce PROXYANC, an approach to select proxy ancestry for complex multi-way admixed populations. We assessed its accuracy through a simulation of a multi-way admixed population and demonstrated the impact and sensitivity of the choice of reference panel in estimating global and local ancestry and in imputing missing genotypes. Because of increased urban development and migration, the proportion of individuals with significant recent genetic admixture is on the increase in many modern societies. However, increased population admixture influences genome heterozygosity, which in turn will affect phenotypes relevant to health. Therefore, the choice of the best proxy ancestral populations for an admixed population is critical in both the study of population genetics and in identifying genes underlying ethnic differences in genetic disease risk. Our simulation results demonstrated the usefulness of the choice of proxy ancestry for admixed populations, in contributing to the accuracy of the inferences of both local and global ancestry. Selecting an accurate proxy ancestral population for an admixed population is required for improving the power of GWAS for admixed populations. Furthermore, our simulation demonstrated that the proxy ancestral panel achieved a similar accuracy to that including all available populations in imputing missing genotypes of an admixed population. This indicates that the choice of accurate ancestral panel can help in reducing computational costs of the imputation engine for finding the best haplotype among all available populations during imputation processes.
To the best of our knowledge, PROXYANC is the first approach to select the best reference ancestral panel given pools of reference ancestral panels. Our methods to select proxy ancestral populations in a multi-way admixed population have enabled us to characterize the genetic ancestry component of the uniquely admixed Coloured population of South Africa that accounts for 49% of the population of the Western Cape Province of South Africa (Statistics South Africa, Census 2011). Previous studies of this historically complex population were hampered by the relatively small sample size and few putative ancestral populations publicly available, and especially the very low number of San individuals. In the present study we have utilized the increased number of reference populations available, and the best proxy ancestries of the SAC obtained from PROXYANC allowed us to document a contribution of the isiXhosa, {Khomani, European, Gujarati Indian, and Chinese genetic material to the SAC (33%, 31%, 16%, 12% and 7%, respectively). We expected a southern Bantu-speaking group such as isiXhosa instead of a West African group such as the Yoruba to be a better proxy ancestor of the SAC. The isiXhosa as best proxy ancestor of the SAC reflects the early mixing of mainly indigenous San females with the southern    Bantu groups and potentially the distinct genetic profile of southern Bantu-speaking populations after their expansion through Africa. Subsequently male settlers, mainly from the Netherlands, Britain, Germany and France, and male slaves from South Asia [29,33,34] also contributed to the SAC. The substantial number of {Khomani (southern Kalahari San) individuals available for this study greatly increases our confidence in the accuracy of the ancestry estimates presented here. Our results also emphasize the point that San clans are often very different from one another, and grouping San individuals from different areas together as generic ''San'' may result in a loss of discrimination at the genetic level [21,22]. This was also illustrated by the deep genetic differences between individual San (Bushmen) genomes [35]. In the case of the SAC in the Western Cape, it is perhaps to be expected that San groups from the southern Kalahari including {Khomani, which is geographically closer to the place of origin of the SAC, to be a better proxy ancestors of this group than Ju|'hoan from Namibia, and this is what we have shown (Table 4). This also gives credence to an earlier suggestion that only some of the San populations contributed to the SAC population [21]. A higher degree of LD is expected in admixed populations, and this could at certain points of its history be influenced by population bottlenecks, or still be a result of the admixture itself. We demonstrated that the allele frequency differences between each pair of proxy ancestral populations correlated with the degree of LD in the SAC, suggesting that the admixture increased genetic diversity and that the observed LD in the SAC has its origin mainly in the admixture. This study observed a weak level of founder haplotypes identical-by-descent along the genome of the SAC, which strengthens the evidence against the argument that past legislated separation of ethnic groups in South Africa, including the SAC, caused population bottlenecks. However in spite of this isolation, the original admixed population was large and a population bottleneck is therefore unlikely.
The obtained best proxy ancestry for the SAC provides opportunities to examine an accurate, unbiased estimation of the ancestry at each genetic locus in this multi-way admixed population, to potentially provide crucial insights into identifying disease genes based on ethnic difference. As existing methods that infer local ancestry assume that non-admixed ancestral populations are the most suitable, it may not be advisable to use the isiXhosa, which have some Khoe-San ancestry [36] as an ancestral population for admixture mapping. Until such time as these methods are updated, the highest ranking putative non-admixed African Bantu populations listed in Table 4, such as the Yoruba, can be used as proxy ancestral population(s) instead of the isiXhosa.
In conclusion, this study has highlighted the importance of selecting the best proxy ancestry for potential downstream analysis in a multi-way admixed population. The investigation of admixture LD and the identification of source populations for the SAC has not only deepened our understanding of its evolutionary history, but also provide opportunities for designing a method to account for a combined genome-wide SNP casecontrol and admixture mapping in a multi-way admixed population such as the SAC. PROXYANC will also provide a useful tool for the investigation of other multi-way admixed populations.

Ethics Statement
Approval from the Ethics Committee of the Faculty of Health Sciences, Stellenbosch University (project registration numbers 95/072 and NO6=07=132) was obtained before blood samples for DNA were collected with written informed consent. This research was conducted according to the principles expressed in the Declaration of Helsinki.

Genotype Data and Genotype Quality Control
The DNA samples of 764 unrelated individuals who selfidentified as South African Coloured (SAC) from two suburbs of Cape Town were collected and genotyped using the Affymetrix 500K genotyping platform, as described in De Wit et al [19]. A total of 159 samples from southern African populations obtained from Pickrell et al. [22], Henn et al. [26], Schlebusch et al. [23] and (HGDP-CEPH) [27] were used in this study. Additionally, we incorporated genome-wide SNP data from two public data sources, including the Human Genome Diversity Cell Line Panel (HGDP-CEPH) [27], the International Haplotype Map (HapMap) Phase 3 [28]. Detailed information about the number of  Table 7. Correlation between maximum expected admixture LD and the observed LD in the SAC. individuals included in our analysis is provided in Table S1. We performed quality-control filters on each population separately and removed SNPs that failed the Hardy-Weinberg exact test P,0.000001 and had a call rate ,95% across all samples per population using PLINK [37]. Population outliers and unknown relatedness were assessed using the smart program implemented in EIGENSOFT [10,38] and related samples were excluded. After applying the quality-control filters to each population separately, the SNPs genotyped in this study were reduced to a subset (n~47863) shared between the SAC, the three public data sources and the local southern Bantu from South Africa (Table S1). Grouping each population per continent, we were able to construct African, European, South Asian, East Asian, Middle East, American and Oceania data sets, each including the SAC.

Mathematical Details of PROXYANC
The question we want to address is, given a pool of available continental affiliated populations, for example European or African, which population is the best European or African proxy ancestry of the admixed population under study. We assume prior knowledge of geographical potential ancestral populations.

PROXYANC: F ST -optimal Quadratic Cone Programming
To limit the effect of background linkage disequilibrium, let us assume adjacent SNPs in each population that are spaced 10 Kb from each other. Let Z denote a set of pools of distinct reference ancestral populations. Let N j and p j be the total variant allele count and observed population allele-frequency in the admixed population (A), N jk and p jk be the total variant allele count and the population observed allele-frequency in a particular reference population k~1,2, . . . ,K of unrelated individuals at SNP j. Given different combinations C of L~DZD reference populations of unrelated individuals from each pool S i [Z~N L , (i~1, . . . ,L), each combination C of L reference populations can be obtained from the Cartesian product T~P L i S i ,C[Z. Thus, from each C[Z we construct synthetic populations consisting of L populations as the following linear combination, where a l is the ancestral proportion. A particular combination of L populations consists of best proxy ancestries of A if their linear combination (in equation 1) minimizes a constructed (equation 2) objective functionF F j &F ST (A,p ja ).F F j is approximated from a classical F ST function in order to render the optimization problem convex. The model described in here is related to optimal quadratic cone programming, where the objective functionF F j is given by, at SNP j, subject to X L l~1 a l~1 and a l ƒ0, Vl[f1, . . . ,Lg: is a generalized objective function such as that described in Price et al. (2009a) [6] and is a quadratic convex function with respect to a l (ancestry proportion, i~1, . . . ,L), therefore a global minimum can be found. Expanding and rearranging equation 2 (see File S1), we obtain a matrix representation of the optimal Cone Programming of the form, where a is a vector of L-dimensions of unknown ancestry proportions, G is an identity vector of L-dimensions, A is a vector of allele frequencies of L-dimensions, P is a positive semi definite matrice, and its diagonal elements are all coefficients of the quadratic term, and the mixture coefficients a l a n consist of its symmetric elements, and are given by: and the linear coefficients a l are the elements of vector q in equation 3, and are represented by: For the optimization of equations (3) or (2) with respect to a l (ancestry proportions, l~1, . . . ,L), the matrix form in equation (3) is constructed by summing equations (2), (4), (5) and (6) respectively across all SNPs. The details of the above model can be found in the File S1.

PROXYANC: Proxy-Ancestry Score
When admixture occurs between two or more previously isolated populations with differences in allele frequency, admixture creates linkage disequilibrium (LD) between genetic loci. Accounting for this assumption, we can compute the proxy ancestry score from the data of the admixed population and pair-wise reference populations. Computing the correlation between the LD in the admixed population and allele frequency differentiation in each pair of ancestral populations, the Proxy-Ancestry Score algorithm is as follows: (1) Given N samples from the data of the admixed population and the data of K groups of reference populations without missing genotypes data, we compute the expected squared correlation r 2 between each pair of SNPs S i and s j , (i=j) in the data of the admixed population.
Taking the Fisher's transformation on r 2 , thus, we compute the LD for each pair of SNPs located at distances (,0.2 Morgans), In this way, a smaller p-value corresponds to a larger p n k .
(1) For each reference population k~1, . . . ,K, we compute the proxy ancestry score, p score k~X p n k ffiffiffiffi K p p n , k~1, . . . , K: ð10Þ (6) To determine whether the proxy ancestry score in equation 14 is higher than expected, we normalized it. To address this we consider a vector of all proxy ancestry scores.
V~(p score 1 , . . . , p score k{1 , p score kz1 , . . . , p score K ) excluding p score k , and we compute the normalization of it as follows, All the methods above have been implemented in a python module called PROXYANC (http://www.cbio.uct.ac.za/ proxyanc).

Simulation Framework to Test PROXYANC
To start our simulation, we independently phased each putative ancestral population, including European (CEU), isiXhosa, {Khomani, East Asia (CHD) and Gujarati Indian using BEAGLE [39]. To generate n diploid admixed individuals, our simulation framework uses 2n ancestral haplotypes, where n should be the minimum sample size among the parental populations. Therefore, we independently expanded each putative ancestral population following Rogers and Harpendings (1992) model of exponential population growth. We implemented this model using three where an initial population of effective size N 0 , is assumed to grow exponentially to a new size of N 1 at a time t generations back from the present. The mutation rate m, is the per-generation probability that a mutation strikes a random nucleotide along the genome. Therefore, each ancestral population was expanded to a total size of 1500 plus its original size. We split the resulting samples in two separate groups. 1500 samples from each of these expanded reference populations were used to simulate admixed individuals and the remaining samples were set aside. The original population samples were used to test PROXYANC.
To simulate the genome of an admixed individual that can mimics the genetic make-up of a multi-way admixed population such as the SAC, we sample haplotypes from European (CEU), isiXhosa, {Khomani, East Asia (CHD) and Gujarati Indian with probability related to our prior estimate of the ancestral proportion from each putative ancestral population (20%, 32%, 29%, 8% and 11%, respectively). Considering a continuous gene flow model [14] in 100 generations and accounting for the Wright-Fisher model with random mating, from the beginning to the end of each chromosome, the ancestry is re-sampled using related ancestral proportion above, at each SNP in order to identify the occurrence of the admixture event. Following this process, the chromosomal segment of ancestral population is copied to the genome of the admixed individual, and records the locus-specific ancestry (the true ancestry) which will serve to assess the estimated ancestry. Using this procedure, we simulated the genomes of 750 individuals of mixed ancestry from Europeans (CEU), isiXhosa, {Khomani, East Asia (CHD) and Gujarati Indian.
To evaluate PROXYANC, we applied both approaches implemented in PROXYANC (F ST -optimal quadratic cone programming and proxy-ancestry score) to select the best ancestral proxy for the above simulated data. Since the true number of ancestral populations is known, one can choose closely related or geographically close populations to the true ancestral populations. In real data, where the number of ancestral populations is unknown, it is necessary to do a pre-population structure using ADMIXTURE [11], for example. Here, we use a pool of 20 reference populations geographically close to the true ancestral populations, including CEU, Italian, French, Russian, Gujarati, Pathan, Druze, isiXhosa, Bantu South Africa, Herero, Kongo, Yoruba, {Khomani, Ju|'hoan, SAN, Bushmen, dai, Chinese(CHD) Japanese (JPT) and Daur. Particularly, for these five putative ancestral populations (CEU, isiXhosa, {Khomani, East CHD and Gujarati), we used the initial samples that were not used in either expansion or the simulation of the admixed population in order to avoid overestimate.
To assess the impact of selecting the best reference ancestral populations in estimating the admixture proportions, we separately ran the ADMIXTURE software [11] on the simulated data together with the expanded and initial samples from ancestral populations (CEU, isiXhosa, {Khomani, CHD and Gujarati Indian), respectively (as described above). We also ran ADMIX-TURE on the simulated data together with a panel that included reference populations that are geographically close to the selected proxy ancestral populations, including Russian, Japanese, Palestine, Yoruba and Ju|'hoan. This allowed us to assess the estimated admixture proportions with a mis-identified source population versus the true proportions.
To investigate if a restricted panel of only the selected best proxy ancestral populations of an admixed population is useful in imputing accurate genotypes (one could use all available reference populations), we removed 2,044 out of 39,064 SNPs on chromosome 1 from the simulated data, and we imputed them using 4 different sets of reference populations, including a panel of populations (CEU, CHD, GIH, isiXhosa, {Khomani) used directly in the simulation, a panel formed of initial samples from ancestral populations (CEU, CHD, GIH, isiXhosa, {Khomani) used to test PROXYANC, a panel of all 20 populations listed above and a panel formed by inappropriate proxy ancestral populations, including Russian, Japanese, Palestinian, Yoruba and Ju|'hoan populations. his allowed us to assess the imputation accuracy rate using different reference panels.

Admixture Estimation and Principle Component Analysis of the SAC
We applied the clustering algorithm implemented in ADMIX-TURE [11] to determine the ancestral population clustering on each continental data set (African, European, South Asian, East Asia, Middle East, American and Oceania, see populations in Table S1) merged separately with the SAC data. Subsequently, the best proxy ancestral populations were merged with the SAC data for supervised clustering. Averaging the SAC individual admixture proportions, we obtained the population admixture proportions (ancestry contributions). The DISTRUCT program [40] was applied on Q-matrices. In order to perform principal component analysis (PCA) to evaluate the extent of substructure of the South African Coloured population, the smartpca programme in the EIGENSOFT package [10,38] was applied on the data sets described above.

Admixture Linkage Disequilibrium
Increased LD in a population relative to its ancestral population(s) can be due to founder events or population bottlenecks [41]. To determine whether the SAC has undergone an extreme bottleneck, we compared the significance level of increase in LD at short distances (,0.1 cM) and long distances (.0.2 cM), within and between the SAC and its proxy ancestors.
To account for the sample size effect in computing the LD, we first scaled each population sample, including the SAC sample, to roughly equal size. The LD-r 2 values were computed for all SNPpairs along the genome. Thus, we directly compared the LD-r 2 for each SNP-pair by ranking the number of pairs that had higher LD-r 2 (.0.5) in the SAC to that in each proxy ancestral population. Furthermore, we computed the correlation of interproxy ancestral allele-frequency differences and LD-r 2 in the SAC. The allele-frequency differences were calculated on the first (d 1 ) and second (d 2 ) SNP based on the pair of SNPs having LD-r 2 w0:5 in the SAC. The correlations were then computed between d 1 Ã d 2 and LD-r 2 in the SAC. We reported on the average p-values and the correlations. To see whether the level of the observed admixture in the SAC can account for the increased LD, we also estimated the maximum expected admixture LD from each pair of reference ancestral populations and compared them with the observed LD in the SAC. Given the LD and allele-frequency from a pair of unrelated ancestral populations X and Y of the admixed population Z, the admixture LD metric D Z is related to the LD D X and D Y from X and Y [41,42], and is modelled as, at SNPS, s 1 and s 2 , where m is the ancestral proportion. This equation is a quadratic equation of the second order of the form m 2 zbmzc, where a~{d s 1 |d s 2 , b~D X {D Y zd s 1 |d s 2 and C~D Y . We denoted d s 1 and d s 2 as the difference in allele frequency at genetic marker s 1 and s 2 from X and Y populations.
To obtain the admixture proportion m at which admixture LD reaches its maximum, we differentiate D Z with respect to m and obtain the maximum expected admixture LD as To assess the admixture LD, we compute the expected square correlation between the observed LD in a recently admixed population and D exp from each pair of candidate proxy ancestral populations. The above described method is also implemented in PROXYANC (http://www.cbio.uct.ac.za/proxyanc).

Genetic Diversity and Haplotypes Shared IBD in the SAC
In order to compare the level of admixture in the SAC, we computed the proportion of IBD and the pairwise population concordance (PPC) test. For the pairwise identity-by-state (IBS) test, we ran PLINK with 10,000 permutations between populations in the same data set (SAC versus each proxy ancestral population). We coded the SAC as cases and its proxy ancestries as controls. We calculated the empirical p-values to determine whether case/case-pairs were less similar to each other compared to control/control-pairs [37,43]. To compare the haplotypes shared IBD within and between the SAC and its proxy ancestral populations, the PLINK software package was used for this purpose. In addition, we computed the haplotype frequency from PLINK, for each population, we estimated the haplotype diversity as where h i is the haplotype frequency and N is the haplotype sample size. The mean haplotype diversity was reported. The haplotype frequency was computed for each population using PLINK.  (Table S1).

Supporting Information
(TIFF) Figure S7 Principal Component Analysis (PCA) of the SAC and the World-wide populations (Table S1). The first and the second eigenvectors in the PCA of the combined SAC and worldwide populations are shown. (TIFF) Figure S8 Principal Component Analysis of the SAC and both American and Middle-east populations, respectively. (Table S1)  File S1 PROXYANC: F ST -optimal Quadratic Cone Programming. (PDF)