Local ancestry prediction with PyLAE

Summary We developed PyLAE, a new tool for determining local ancestry along a genome using whole-genome sequencing data or high-density genotyping experiments. PyLAE can process an arbitrarily large number of ancestral populations (with or without an informative prior). Since PyLAE does not involve estimating many parameters, it can process thousands of genomes within a day. PyLAE can run on phased or unphased genomic data. We have shown how PyLAE can be applied to the identification of differentially enriched pathways between populations. The local ancestry approach results in higher enrichment scores compared to whole-genome approaches. We benchmarked PyLAE using the 1000 Genomes dataset, comparing the aggregated predictions with the global admixture results and the current gold standard program RFMix. Computational efficiency, minimal requirements for data pre-processing, straightforward presentation of results, and ease of installation make PyLAE a valuable tool to study admixed populations. Availability and implementation The source code and installation manual are available at https://github.com/smetam/pylae.


INTRODUCTION
In association studies, researchers combine samples with different (usually opposing) phenotypes and compare Single Nucleotide Polymorphisms (SNPs) frequencies in two groups. When enough samples are available (typically, thousands for complex traits), a significant difference in frequencies between the groups suggests an association between the position on the genome and the studied phenotype. However, there is a possibility that the association is due to inhomogeneity of the study group in terms of provenance/ origin (for example, all people with the disease are of French origin, and the healthy cohort is Bulgarian). In this case, two populations may have different frequencies of ancestry informative markers (AIM) that are not causal to the phenotype.
An intuitive approach to solving this problem is determining the population structure first and then adjusting for it. However, this strategy is complicated for admixed populations. Due to meiotic recombination during transmission of genetic material to the offspring, individual segments of the genome may have different origins. Recombination makes it hard to determine the source of such individuals (or plant populations), especially locus-specific local origin.
Several solutions to this problem exist, such as LAMP, LAMP-ANC, and RFMIX. However, there is always room to develop a user-friendly, fast, accurate approach. The LAMP (Sankararaman et al., 2008) algorithm determines local origin in mixed populations. The LAMP input includes the recombination rate, global mixing ratio, and the number of generations that have passed since the mixing started. The recombination rate can be considered known due to previous studies (Nachman & Crowell, 2000), while the global proportion can be estimated using, for example, the ADMIXTURE tool (Alexander, Novembre & Lange, 2009). The idea of the LAMP method is as follows. The iterated conditional modes clustering algorithm (ICM) determines the likelihood that a genome segment within a selected window has a specific origin. An individual SNP is assigned the source by the "popular vote" approach using the inferred origins of all windows covering this position. ICM is a modification of the Expectation-Maximization (EM) algorithm, which receives a point estimate at the E-step under the assumption that the a priori estimates are sufficiently accurate. Therefore, ICM is faster than the EM approach. To construct an accurate a priori estimate, LAMP uses the MAXVAR algorithm, which works only for two populations. The size of the genomic window is selected to minimize classification errors. LAMP works fast, but its accuracy decreases if the ethnic admixture ratio is close to 1:1.
LAMP-ANC (Pasaniuc et al., 2009) is a modification of the LAMP tool, showing a higher accuracy than LAMP. This modification also allows triple mixing to be estimated, while LAMP cannot determine frequencies for more than two ancestral populations.
The Machine Learning algorithm RFMIX (Maples et al., 2013;Uren, Hoal & Möller, 2020) treats origin as a hidden parameter in its statistical model. RFMIX employs conditional random fields and decision trees. The RFMIX model does not impose restrictions on the number of ancestral populations and types of mixing.
We developed PyLAE-a new tool for determining local origin along a genome using whole-genome sequencing data or high-density genotyping experiments. PyLAE can process an arbitrarily large number of ancestral populations (with or without an informative prior). PyLAE does not involve the estimation of many parameters. PyLAE is a valuable tool to study admixed populations (Hester et al., 2011;N'Diaye et al., 2011;Kantor et al., 2013). We have tested this approach using the 1000 Genomes database.

Data source
Whole-genome sequencing data in the VCF format were obtained from the 1000 Genomes Project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/, genome version GRCh37). We used the dataset of 2,504 worldwide individuals with the reported ethnic origin (Sudmant et al., 2015) and phased genome sequence data.

Admixture
To confirm the reported origin, we have extracted 130,000 ancestry informative markers identified by a worldwide study contacted by the National Genographic consortium, as described in (Elhaik et al., 2014). The supervised admixture analysis was performed using the K = 9 component division of ancestral populations into the following categories: North-East Asian, Mediterranean, South African, South-West Asian, Native American, Oceanian, Southeast Asian, Northern European, and Sub-Saharan African components. We used the putative ancestral populations from our earlier study (Elhaik et al., 2014). The admixture components represent proportions of an individual's genotype attributed to each of the nine putative ancestral genomes. The obtained nine-dimensional vectors were clustered based on the L2 norm (Euclidean distance). The optimal number of clusters was determined using the weighted Kullback-Leibler distance approach (Tatarinova, Bouck & Schumitzky, 2008;Tatarinova & Schumitzky, 2014). Within each group, the admixture profiles of individuals are similar. This dimension-reduction step allows the identification of potentially admixed individuals. This assignment was validated using haplogroup information and reAdmix analysis in group mode (Kozlov et al., 2015). reAdmix algorithm represents an individual as a weighted sum of present-day populations (e.g., 50% British, 25% Russian, 25% Han Chinese) based on K admixture components. Instead of solving an "exact admixture" problem, we aim to find the smallest subset of populations whose combined admixture components are close to those of the individuals within a small tolerance margin. Due to the range of natural variation, the admixture proportions can be considered exact neither for the reference populations nor for the tested individuals. The admixture proportions we use are merely maximum likelihood estimates and may fail to be exactly equal to the actual shares of ancestral genomes. Therefore, determining ethnic proportions is mathematically and computationally more challenging than finding a single most fitting bio-origin. From our experience with reAdmix, this tool accurately identifies all significant components; however, small proportions are frequently misassigned. When similar individuals are analyzed as a group, it is possible to estimate even the small proportions reliably.

Phasing
The phased data were obtained for the 1000 Genomes Project from the open repository https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/. Sudmant et al. (2015) used short-read Illumina DNA sequencing data to infer haplotype blocks in 26 human populations.

Two modes
We have developed two modes: "diploid" and "haploid". For the "diploid" mode, we used the samples without pre-processing. For the "haploid" mode, we used the phasing information provided by 1000 Genomes and assumed that the parents were homozygous and simulated two homozygous "parents" for each sample. The admixture vector, used as an informative prior for "diploid" and "haploid" modes, is generated by applying the Admixture tool (Alexander, Novembre & Lange, 2009) to the original "diploid" sample in supervised mode.

Local ancestry bayesian approach (PyLAE)
Input: Tested Individual: VCF file and (optional) admixture vector obtained by Admixture in supervised mode, as described above.
Reference: multi VCF file with putative ancestral populations corresponding to K component admixture.
N is the number of positions overlapping between the reference and the individual.
The number of genomic positions should be significantly large to ensure dense SNP coverage, while not being too large since the markers should be uncorrelated. Therefore, we recommend removing positions based on high levels of pairwise linkage disequilibrium (LD-pruning). It is a common practice to exclude positions with pairwise genotypic r 2 > 0.8 within sliding windows of 50 SNPs (Alexander, Novembre & Lange, 2009).
Stage 1. Bayesian posterior probability Posterior probability PðpopulationjgenotypeÞ is calculated using the Bayes formula: The prior probability of a population PðpopulationÞ is equal to the analyzed individual's admixture vector. We assume that adjacent positions have the same origin and the origin of all positions within a window of length L. Since the distance between two consecutive ancestry informative markers is sufficiently large, positions within the same window can be considered independent.
Pð genotype 1 …genotype L f g j populationÞ ¼ Y L i¼1 Pðgenotype i jpopulationÞ Therefore Pðgenotype i jpopulationÞ is estimated from observations, using genotypes of putative ancestral populations in position i P genotype ð Þ¼ X K k¼1 PðgenotypejPopulation k ÞP Population k ð Þ Therefore, we have K posterior probabilities for every genotype, and the matrix of posterior probabilities T Â K, where T ¼ N L .
We are interested in determining the optimal sequence of the source populations along the genome. This problem is solved using the Viterbi algorithm (Viterbi, 1967). For computational efficiency, all calculations are performed in the log-space.
Stage 2. Viterbi algorithm Our state-space S consists of K ancestral populations. Population labels (i = 1..K) are the hidden states in our model. The initial probabilities p i of being in the i th hidden state can either be assumed to be uniform with 1/K probability of each population or set to be equal to global admixture components. Transition probabilities a i;j of transitioning from the state i to j are inversely proportional to the TreeMix distances between corresponding ancestral populations. Transition probability from state i to j a ij can be calculated from distances between ancestral populations; alternatively, the same value can be used for all pairs of populations. The emission probabilities b j O ð Þ is calculated by the Bayes formula (above). At the first step of the algorithm d j 1 For efficiency, the computation is performed in the log-space. The algorithm is implemented as a python script and can be adapted to analyze any organism using user-provided ancestral components, prior probabilities, and transition matrix.

Application of local ancestry
Most of the currently living individuals are admixed to various degrees. To compare genotypes between populations and analyze selection signals, it is essential to identify and exclude introgressed regions containing non-representative genotypes. We have determined local ancestry profiles for all 2,504 analyzed individuals from the 1000 Genomes project. With our approach, even the admixed individuals that are typically excluded from the analysis (therefore, reducing the statistical power of the study) can be retained after masking the introgressed regions.
It is reasonable to assume that pathways accumulate nonsynonymous SNPs at the same rate during neutral evolution; therefore, the pathways' enrichment scores can be approximated by a normal distribution. The pathways under selection will appear as outliers.
To analyze differences in numbers of SNPs per pathway between the two groups of populations (groups A and B), we need to calculate: (1) DSSE scores (synonymous) and (2) DNSE scores (nonsynonymous) (Chekalin et al., 2019). Therefore, we perform the following steps: (1) Find the number of (non)synonymous SNPs in groups A and B.
(a) Let I represent the total number of studied pathways, and i = 1,…, I, be the number of the (non) synonymous SNP per i th pathway are nS(i) and nA(i). The expected fraction of (non) synonymous SNPs in individuals from group A is given by p ¼ nA nBþnA , where nA is the amount of (non) synonymous SNPs in all KEGG pathways found in group A, nB is the amount of (non) synonymous SNPs in all KEGG pathways found in group B. The fraction p i of A (non)synonymous SNPs in the i th KEGG pathway is (2) The enrichment D(N)SE scores are computed for every pathway with continuity correction: (3) P-values are calculated using Bonferroni and Benjamini-Hochberg corrections and used to identify differentially enriched pathways. A pathway is considered to be differentially enriched if the adjusted P-value < 0.005 (Benjamin et al., 2018).
(4) To consider the excess of synonymous SNPs over nonsynonymous SNPs, we calculate enrichment scores for synonymous SNPs, DSSE. To be considered significant, the P-value of the nonsynonymous test is required to be below the corresponding P-value of the synonymous test for each pathway.
Using PyLAE with different genomes and/or sets of markers A different set of putative ancestral populations or a different set of markers require additional processing. First, we need to collect a database of putatively un-admixed individuals. If there is an existing validated set of ancestry informative features, these markers should run the admixture in supervised mode. For each self-reported ancestry, samples should be clustered based on their admixture profiles to identify subgroups within each self-reported ancestry. These subgroups are then examined using information about the studied population's history, and the most representative subset is retained. Then, putative ancestral populations (from 15 to 20 individuals per group) are generated for every ancestry. The validity and stability of the ancestral populations are evaluated using (1) PCA, (2) leave-one-out supervised admixture, and (3) by application of supervised admixture to the original dataset.

Investigation of the reference dataset
An intrinsic difficulty for benchmarking an ancestry prediction pipeline is the lack of the gold standard. We used the 1000 Genomes dataset for the benchmark and evaluated the quality of the dataset. At first, we have investigated the 1000 Genomes dataset to determine intrinsic limitations to accuracy. Our first step is a comparison between phased and unphased data. PyLAE can process both phased and unphased data, which is both a blessing and a curse. It is a blessing since some genomes may not have phased data, and it is a curse since the accuracy may suffer. Therefore, we first evaluate the constraints that are imposed by different components of the pipeline. The first component is the calculation of the Admixture (Alexander, Novembre & Lange, 2009) profiles.

Admixture profiles in Diploid vs. Haploid modes
Nine-dimensional admixture profiles were calculated for 2,504 individuals (130 K markers each) using supervised admixture. Every sample was analyzed first in diploid and then in haploid mode (Fig. 1). Diploid and haploid admixture profiles were clustered within each reported ethnic origin (Fig. 2, Tables 1 and 2). African Ancestry SW, African Caribbean, Columbian, Gambian Mandinka, Mexican Ancestry, Peruvian, and Puerto-Rican showed 2-3 subgroups within each reported ethnicity, consistent with the complex population history of these populations.
We have investigated the difference between inferred haploid parental admixture profiles and diploid admixture profiles of analyzed individuals. We have computed an average of parental haploid admixture vectors for every individual and calculated the Еuclidean distance between the vectors. The average distance for all individuals was 0.092. On average, the largest differences were observed for Toscani, Iberian, Columbian, Mexican Ancestry, and Puerto Rican individuals. In diploid mode, the average value of the Mediterranean component in Toscani individuals was 60%, South West Asian −23%, and Northern European −15%. In haploid mode, the Mediterranean component was 82%, and South-West Asian −18%. The same situation happened with the Iberian samples. In the diploid mode, the average value of the Mediterranean component was 55%, South West Asian −19%, and Northern European −24%. In the double haploid mode, the Mediterranean component was 72%, South West Asian −16%, and Northern European −12%. Similar inflation in the Mediterranean component was accompanied by a reduction in the Northern European components.
To investigate the source of this difference, we have identified the most significant admixture component for every sample and calculated the difference between the average parental value of this component and its diploid value. The difference ranged from −0.0057 to 0.3063; the mean difference was 0.0684, indicating that haploid admixture tends to increase the value of the largest component. The magnitude of the increment grows approximately linearly for components <0.87 and then reduces linearly for components above 0.87. Worldwide populations show different trends (Fig. 3). For individuals of European and East Asian descent, the trend is positive linear, and the larger the value, the more it is inflated. For South Asian individuals, there is no clear trend. For African and American individuals, there are multiple linear trends. The slope of the line is defined by the relative values of the admixture components. For example, among Columbians, a larger slope corresponds to the individuals where the average ratio of   Mediterranean to the sum of Sub-Saharan African and Native American components is 2.07. In comparison, a smaller slope corresponds to a 0.56 ratio.

Population-specific accuracy limitations of the admixture-based approach
We have investigated the accuracy limitations of the dataset and admixture-based approaches. First, for each individual's admixture vector, we have calculated the nearest population, based on Euclidean distance between admixture vectors, using the leave-oneout approach; then we have computed fractions of self-hits (the nearest population agrees with the population label of the individual), and fractions of super population  self-hits (the agreement is at the level of super-populations). Admixed populations like African American, African Caribbean, Colombian, Mexican in the USA, Peruvian, and Puerto Rican have the lowest assignment accuracy at both population and subpopulation levels. We also noticed that West African populations, such as Yoruba, Esan, Gambian Mandinka, and Mende, are frequently misclassified. Although they speak languages from the same Niger-Congo group, their genomes are not identical (Skoglund et al., 2017;Fan et al., 2019). Simplified representation using nine admixture components is artificially "collapsing" those groups to a single point in 9-dimensional space. GPS and reAdmix (Table 3, Fig. 4) analyses suggest that population labels were not 100% accurate but were correct at the superpopulation level. For example, about 30% of Yoruba individuals were mapped to Yoruba reference, but all were mapped to various African reference populations. Several factors can explain this: first, there is true Results of GPS and reAdmix analyses. For each population, we report factions of self-hits (when the tool correctly identifies the population label) and superpopulation hits (when a superpopulation is correctlyidentified).
heterogeneity of individuals and fuzzy boundaries between populations; second, there may be a human error factor on behalf of the sampled individual or a researcher collecting data; the third factor is the unreported admixture from different populations. Native and African American individuals appear to be the most affected by these factors; this observation agrees with known historical events. Therefore, we expect to see admixed regions along the genomes of affected individuals.

Performance of the PyLAE algorithm
We have tested the PyLAE algorithm on 2,504 individuals of the 1000 Genomes dataset using: diploid and haploid versions, with informative and noninformative priors. Informative prior is assumed to be equal to the global admixture vector. The noninformative prior is uniform, assuming equal probabilities for a region to be from any of the K reference populations. Figure 4 ReAdmix assignment matrix heatmap. reAdmix analysis shows that population labels may not be 100% accurate; however, they are correct at the superpopulation level. This may be explained by the presence of admixed individuals in the dataset, and by shared histories by various groups. The assignment matrix indicates the percentage assignment. The color intensity of the heatmap shows the probability of an individual from a reference population (columns) being assigned to a given population (rows). The darkest shade indicates the highest probability and the lightest shade indicates the lowest. Full-size  DOI: 10.7717/peerj.12502/ fig-4 Since we do not know the origins of regions of the genome, we chose the agreement with the global admixture vector as a measure of performance. For each analyzed individual, we have computed a total fraction of a genome attributed to each of the K components and compared it to the global admixture vector. We have calculated Euclidean distance and Pearson correlation of aggregated results and admixture vectors. The results are shown in Fig. 5 and Table 4.
We compared PyLAE with the current local ancestry gold-standard tool RFMix (Maples et al., 2013;Uren, Hoal & Möller, 2020). We used the 1000 Genomes dataset and have selected 150K ancestry-informative markers. PyLAE and RFMix use different inputs and require separate data preparation steps. For RFMix, the following actions were performed: (1) phase ancestral samples (2) obtain genetic maps. The phasing was performed with Beagle (Browning, Zhou & Browning, 2018). PyLAE window size was set to 15, 30, 50, and 100 SNPs; no genetic map information was used. PyLAE was tested with the prior (results of admixture in supervised mode) and without prior, only the 'phased' case. We have also tried various values of the transition penalty. The probability of changing the population assignment (transition probability) was modeled as a i6 ¼j ¼ 1 nþx , where n is the number of populations and x is the penalty, that was ranged from 0 (uniform prior), to 10,000.
We aggregated the results according to the following rules: for PyLAE-we have computed a fraction of windows assigned to each ancestral population; for RFMix, we have calculated a fraction of markers assigned to each ancestral population. Then the results were compared with the global ancestry assignments calculated by Admixture (Alexander, Novembre & Lange, 2009). We simulated admixed individuals with the known origin of each genomic fragment and computed the fraction of correctly assigned positions. The choice of the optimal window size (measured in the number of SNPs per window) is determined by the ancestral composition of the analyzed sample (Fig. 6). Overall, an increase in the transition penalty suppresses switches between predicted populations and results in smoother prediction. Therefore, we recommend using the penalty of 10,000 or higher unless there are reasons to believe that the sample is highly mosaic. Table 5 shows that for individuals with East Asian and European ancestry, PyLAE has a higher correlation with corresponding global ancestral composition compared to RFMix. For African, American, and South Asian individuals, RFMix had a higher correlation. It is important to notice that global and aggregated local ancestries were in  Therefore, we propose that PyLAE and RFMix complement each other and be used as a part of the same local ancestry pipeline to improve prediction accuracy.

Application of local ancestry
Using the ANNOVAR (Wang, Li & Hakonarson, 2010;Yang & Wang, 2015) and snpEFF (Cingolani et al., 2012), we have performed variation annotations and calculated the counts of synonymous and nonsynonymous SNPs. We have conducted this analysis in several modes: (1)  Since we use a nine-component representation, all Africans in our datasets were assigned to the same Sub-Saharan African component. At the same time, African Americans have various admixture levels from other components. Therefore, local ancestry mode only partitions genomes of European and African American individuals.
On average, PyLAE has higher enrichment scores compared to the whole-genome approach. Comparing the enrichment scores between differentially enriched pathways (P-value < 0.01), in 91 cases, PyLAE resulted in higher enrichment scores and 41-in lower scores for African Americans. If the significance cut-off is lowered, PyLAE results in a higher number of differentially enriched pathways (Fig. 7).
Application of PyLAE to African American samples allowed the detection of eleven differentially enriched pathways that were not detected when entire African Americans' genomes were used but appeared when Europeans were compared with Africans from Africa. According to PyLAE, the following pathways were enriched in nonsynonymous SNPs: Africans: Lipoic acid metabolism, Vitamin B6 metabolism, Fatty acid biosynthesis, Notch signaling pathway, Type I diabetes mellitus, Allograft rejection, Cell adhesion molecules (CAMs), Phagosome, Malaria, Viral myocarditis, HTLV-I infection; Europeans: Nitrogen metabolism, Mismatch repair, Cell cycle, Glyoxylate and dicarboxylate metabolism, Porphyrin and chlorophyll metabolism, Insulin resistance, Tight junction, Selenocompound metabolism, Oocyte meiosis, Amyotrophic lateral sclerosis (ALS), Protein digestion and absorption.
Dependence on ethnicity efficiency has been reported in processes and conditions related to the above-listed pathways: vitamin B6 metabolism (Gong et al., 2014), fatty acid biosynthesis (Sergeant et al., 2012;Ameur et al., 2012;Kothapalli, Park & Brenna, 2020), resistance to malaria (Singh & Dhar, 2014;Gelabert et al., 2017), the prevalence of viral myocarditis (Shaboodien et al., 2013) and amyotrophic lateral sclerosis (Rechtman et al., 2015;Henning et al., 2021). According to recent studies (Nédélec et al., 2016;Singh et al., 2020), individuals with African ancestry have a stronger inflammatory response and reduced intracellular bacterial growth. Maintaining a strong immune response can have adverse side effects, such as autoimmune disease. Therefore, phagosome efficiency declined after migration out of Africa since Europeans were exposed to fewer pathogens, reducing immune response over generations.
In summary, PyLAE is an easy-to-use, accurate and fast tool for local ancestry analysis that can be used with minimal prior information about the genome or samples. Figure 7 Fraction of significantly enriched pathways as a function of a significance cut-off. PyLAE has higher enrichment scores compared to the whole-genome approach. Comparing the enrichment scores between differentially enriched pathways (P-value < 0.01), in 91 cases, PyLAE resulted in higher enrichment scores and 41 in lower scores for African Americans. If the significance cut-off is lowered, PyLAE results in a higher number of differentially enriched pathways.