Comparing local ancestry inference models in populations of two- and three-way admixture

Local ancestry estimation infers the regional ancestral origin of chromosomal segments in admixed populations using reference populations and a variety of statistical models. Integrating local ancestry into complex trait genetics has the potential to increase detection of genetic associations and improve genetic prediction models in understudied admixed populations, including African Americans and Hispanics. Five methods for local ancestry estimation that have been used in human complex trait genetics are LAMP-LD (2012), RFMix (2013), ELAI (2014), Loter (2018), and MOSAIC (2019). As users rather than developers, we sought to perform direct comparisons of accuracy, runtime, memory usage, and usability of these software tools to determine which is best for incorporation into association study pipelines. We find that in the majority of cases RFMix has the highest median accuracy with the ranking of the remaining software dependent on the ancestral architecture of the population tested. Additionally, we estimate the O(n) of both memory and runtime for each software and find that for both time and memory most software increase linearly with respect to sample size. The only exception is RFMix, which increases quadratically with respect to runtime and linearly with respect to memory. Effective local ancestry estimation tools are necessary to increase diversity and prevent population disparities in human genetics studies. RFMix performs the best across methods, however, depending on application, other methods perform just as well with the benefit of shorter runtimes. Scripts used to format data, run software, and estimate accuracy can be found at https://github.com/WheelerLab/LAI_benchmarking.


Background
Humans are a chromosomal mosaic of their ancestors. Through sexual reproduction and recombination, chromosomes resemble a subset of their ancestors' chromosomes in varying sizes and locations across the genome [1]. Large scale studies of the genetics underlying human disease have been limited to predominantly European populations and thus lack global diversity, which exacerbates health disparities [2,3]. It is well documented that prediction accuracy with polygenic risk scores decreases with increasing genetic distance [4,5]. In addition, many underrepresented populations in human genetics include recently admixed individuals, meaning their ancestors were previously isolated from each other on different continents until the last few centuries. This leads to chromosomal tracts originating from different continental populations in modern populations like African Americans and Hispanics.
Population structure is a potential confounding factor in all genetic association studies. Global ancestry is the proportion of different ancestral populations represented across the entire genome. Genotypic principal components are used to adjust for these average genomic background effects in genetic association studies [6]. Correcting only for global ancestry does not precisely account for ancestry at any specific locus. Local ancestry is the number of alleles derived from distinct ancestral populations at a given locus and may improve power to detect genetic associations in admixed populations [7][8][9][10]. For example, a recent expression quantitative trait (eQTL) mapping study in African Americans found a greater replication rate of eQTLs discovered via models that adjust for local ancestry, rather than models that adjust for global ancestry [11].
Several models have been developed to estimate local ancestry in admixed populations [1,[12][13][14][15][16][17]. By leveraging population or continental-specific SNPs, chromosomal tracts can be differentiated into their ancestral segments. Chromosomal regions are compared to reference populations of non-admixed ancestry to find which sections of the chromosomes descend from which continental region [1,18]. These estimates depend largely on the reference populations used, the genetic distance between the reference samples, the quality of the input genotypes, and, most importantly, the statistical models. LAMP-LD demonstrates strong ancestry estimation in recently admixed cohorts of African and Hispanic descent [12]. ELAI and Loter both report stable performance in instances of ancient admixture (ngenerations ≥ 100), out competing methods that prioritize recent admixture [14,16]. Additionally, Loter reports high performance in nonhuman species [16]. Similar to LAMP-LD, RFMix and MOSAIC each specialize in multi-way admixture. Unlike LAMP-LD, neither are constrained in the number of ancestral populations. Both RFMix and MOSAIC are reported to have robust performance even when reference panels are not closely related to the study population, though MOSAIC reported the added benefit of elucidating the relationship between all provided references and the study population and selecting the optimal references, thus circumventing the need to clarify the relationship between study and available reference populations [13,17].
To satisfy the growing call for increased diversity in genome-wide association studies [2,3], local ancestry estimation methods will become increasingly important in human genetics. While a recent review compared the underlying models of several local ancestry estimation software tools [19], accuracy and run time were not directly compared. A study from 2017 compared run time and memory usage of four older tools [20], but did not include the widely used RFMix [13] and newer tools MOSAIC [17] and Loter [16]. Here, we independently compare five local ancestry estimation methods for accuracy and feasibility by simulating admixed chromosomes from both two and three ancestral continental populations.

Results
We prioritize benchmarking each software in the context of recently admixed populations to assess accuracy and estimate previously unreported time and memory complexity. We selected five software for a combination of their novelty and relative popularity. LAMP-LD [12], ELAI [14], and RFMix [13] are each established local ancestry software that have been cited numerous times in the field of population genetics. Conversely, MOSAIC [17] and Loter [16] are fairly new, having been published in the last two years at the time of writing. A brief summary of their differences can be found in Table 1.

Simulating admixed individuals
We simulated admixed populations with ancestry proportions similar to those observed in previous studies [18]. These include two-way admixture between YRI and CEU representing a common pattern of descent for African American individuals (AFA); two-way admixture between PEL and CEU representing one common pattern of descent for some Hispanic individuals (HIS); and three-way admixture between PEL, YRI, and CEU, representing another common pattern of descent among some Hispanic individuals (3WAY) [18]. For each admixture group, we simulated 1000 individuals and selected 100 that had European ancestry within 10% of the admixture proportions listed in Table 2. We summarize our workflow in Fig.  1, see Methods for details. Runtime and memory usage Runtime increases with number of individuals We simulated an additional 2000 individuals based on the AFA admixture proportions at 7 generations since admixture. We randomly subset this set of people to 2000, 1500, 1000, 500, 100, 50, and 20 individuals to test how each software scales with an increasing sample size (Fig. 2). We find that the runtimes of four of the five software tools scale linearly with the number of samples, with the exception of RFMix, which scales quadratically (Table 3). We also note that MOSAIC runtime decreases when n=2000. MOSAIC will exit early the iteration of its expectationlikelihood algorithm when the log-likelihood decreases resulting in in cases where it finishes faster than would be expected by a standard linear model [17].

Memory increases linearly with number of individuals
We simultaneously measured the memory burden expected for each level of sample size (Fig. 3). We found that all software expand linearly or near linearly (Table  4). Loter had the steepest memory requirement and ELAI had the smallest slope. ELAI has the most stable memory requirement across sample sizes. At high sample sizes ELAI had the lowest memory overhead, but at low sample sizes (n ≤ 100) the memory requirement was third highest.

Non-Admixed Reference
Simulated Admixed Individuals

3.) Run Local Ancestry Estimation
Estimated Ancestry 10% Founders 90% Reference Figure 1 Process for simulating admixed individuals and estimating ancestry. 1) From non-admixed populations from 1000G we randomly select 10% of all individuals to use as founders for admixture simulation. The rest are used as reference panels for ancestry estimation. 2) We generate admixed individuals that are chromosomal mosaics of the founder group using the admixture simulation tool created by the authors of RFMix [13]. 3) Using the remaining our 1000G populations as reference panels, we estimate ancestry on our simulated population for all five softwares. Afterwards we compare estimation accuracy across software.  Table 4 Linear maximum memory usage estimated O(n). We fit a linear model between the maximum memory usage and sample size for each software. We report the estimated β 1 , model R 2 , and ANOVA p-value for each combination of software and model.  Software memory usage versus sample size. We tested the maximum memory usage of each software on one core at a sample size of 20, 50, 100, 500, 1000, 1500, and 2000. Points represented sample sizes tested versus memory, which are connected by line segments colored by software. We simulated n African American individuals from CEU and YRI "founder" populations with average admixtures proportions of 20% and 80%, respectively. We find that maximum memory usage for all software increases linearly with the number of samples.

Increasing number of ancestries can increase runtime and memory burden
We tested if increasing the number of ancestral populations increases the computational burden of each software. We found that in all software, increasing the number of ancestral populations resulted in a significant increase in memory usage (Fig. 4). However, increasing number of ancestries did not impact the runtime for two software: Loter and MOSAIC. For the three other software, increasing the number of ancestries did significantly increase the runtime (Fig. 5).
v n ancestries Figure 4.pdf  Accuracy varies by cohort composition For each admixture group (Table 2), we simulated 100 individuals and ran local ancestry estimation and accuracy benchmarking. Each software performs with high fidelity in regards to two-way admixture, but we note a considerable difference in our simulated two-way AFA and HIS cohorts. We attribute this to the introduction of the PEL population as both founders and reference, as they contain a significant amount of admixture in and of themselves. As their admixture overlaps with the other two reference populations, it is expected that they will introduce noise into our local ancestry estimation. For two way admixture, RFMix and ELAI had the highest median accuracy for AFA and HIS, respectively, though all software performed competitively well. For three way admixture, RFMix had the highest median accuracy (Fig. 6). After assessing accuracy of each software we performed a Tukey's test to determine which pairs of software performed significantly differently. In the case of our simulated AFA cohort, it was found that both RFMix and ELAI performed significantly better than both LAMP-LD and Loter. All other pairs were not significantly different (Additional File 1). In the case of our HIS cohort, we found that RFMix performed significantly better than LAMP-LD, with all other pairs found to be not significantly different (Additional File 2). In the v n ancestries Figure 5.pdf

Software is highly correlated on real data
We ran each software as described on real admixed individuals from the ASW population of the 1000 genomes project with the YRI and CEU populations as reference panels. Local ancestry estimates were highly correlated between each software ( Table 5). Additionally, to show the robustness of these estimates, we plot the mean local African ancestry estimated by each software against the first principal component of the genotypes, which is known to be an estimate of global African ancestry (Fig. 7) [11]. The local ancestry estimate was highly correlated with PC1 for all software tools (R 2 > 0.960, Table 5), with no significant difference between tools (p > 0.621). Table 5 Between software Pearson correlation using real data. We ran all five software on 61 real admixed individuals from the 1000 Genomes ASW population. Here we report the squared pairwise Pearson correlations of local ancestry estimates. Additionally, in the last column, we report the squared correlation of each software's estimated mean African ancestry with genotypic principal component 1.

Discussion
Local ancestry estimation is key step in adjusting for potential population stratification in admixed populations and in elucidating the effect of ancestry specific accuracy Figure 6.pdf  Figure 7 We plot the the mean local ancestry proportion of African ancestry estimated by each software against the first principal component of genotypes, a known estimate of global African ancestry, to validate the robustness of our local ancestry estimates. The local ancestry estimate was highly correlated with PC1 for all software tools (R 2 > 0.96), with no significant difference between tools (p > 0.62) loci on complex traits. Given the wide variety of tools available to perform local ancestry estimation, it is necessary to explore how each performs in a particular context. Here, we focused on recent human admixture within African American and Hispanic populations, and performed complexity and accuracy analyses of five different software tools using simulated and real data.
We did not consider instances of ancient admixture despite ELAI and Loter reporting robust performance in such instances [14,16], which could be one reason they underperformed in our 3-way simulation (Fig 6). In addition, Loter was designed to be compatible with many different species and both Loter and ELAI may require more fine-tuning of software parameters beyond the default settings than the other methods, especially in cases of 3-way admixture.
Here we report on how memory and time usage scale with number of individuals and not SNPs, as it is simpler and more common to scale studies by population size than by genome size. However, it is expected that most if not all software will increase in both time and memory usage given an increased number of SNPs. We find that all software perform with high accuracy in cases of two-way admixture, with RFMix and ELAI performing the best. In cases of three-way admixture, RFMix had the highest median accuracy and RFMix, MOSAIC, and LAMP-LD all performed significantly better than ELAI and Loter. While RFMix has a relatively low memory overhead, its runtime scales quadratically, severely limiting its scalability at standard GWAS sample sizes.
An important consideration in all cases is the availability of high quality reference data. Currently, Native American genetic data is not widely available due to cultural and historical incidents that have raised barriers between the tribal communities and the genetic community [21,22]. Here we use the PEL population as a proxy for non-admixed individuals of Native American descent as PEL has the highest portion of NAT ancestry among 1000G populations. However, PEL introduces noise as it contains significant admixture. This noise likely causes our HIS and 3WAY simulated populations to underperform. Still, our simulations show robust performance of several software.

Conclusion
We find that in cases of two-way simulated admixture, each software performs similarly well with RFMix and ELAI having the highest median performance depending on the population structure. In our three-way simulated admixed population, we see marked difference in performance, with RFMix performing best overall, followed by LAMP-LD and MOSAIC. While RFMix performs the best across methods, its scalability with regards to time may give weight to considering other software. Robust, scalable local ancestry estimation software are crucial for equitable implementation of genetics and genomics in medicine.

Simulating Genotypes
Our workflow is summarized in Fig 1. We chose three 1000 genomes (1000G) populations [23] to serve as non-admixed ancestral populations. From each of these populations we randomly selected 10% of individuals to use as founders for simulation of admixed individuals and the remaining individuals made up the nonadmixed reference populations. The three 1000G populations from which we drew samples are: Utah residents with Northern and Western European ancestry (CEU) for use as our European ancestral group; Yoruba in Ibadan, Nigeria (YRI) for use as our African ancestral group; and Peruvians from Lima, Peru (PEL) for use as our Native American ancestral group. We note that individuals in the PEL population have Native American, European and African admixture, however, the PEL have more Native American ancestry than all of the other American populations in 1000G (µ = 0.77, 95% CI = [0.75-0.80] [4]. The PEL population thus serves as a reasonable proxy for our Native American ancestral population. Simulated admixed populations fall into one of three categories; two-way admixture between YRI and CEU representing a common pattern of descent for African American individuals (AFA); two-way admixture between PEL and CEU representing one common pattern of descent for some Hispanic individuals (HIS); and lastly three-way admixture between PEL, YRI, and CEU, representing another common pattern of descent among some Hispanic individuals (3WAY) as observed in [18]. For each admixture group, we simulated 1000 individuals and selected 100 that had European ancestry within 10% of the admixture proportions described in Table 2. Global ancestry percentages across individuals are shown in Additional Files 4-6.
We used the admixture simulation tool developed by the creators of RFMix to generate simulated admixed chromosomes [13]. We limited our simulation to SNPs on chromosome 22, for a total of 158,159 SNPs. LAMP-LD v 1.0 has a computational limit of 50,000 random SNPs. In keeping with this, after simulating the entirety of chromosome 22, we independently selected 50,000 SNPs from each cohort using the --thin-count 50000 option in PLINK [24] and subset each cohort accordingly. The code used to run simulation can be found at https://github.com/WheelerLab/ LAI_benchmarking.

Running Each Software
We used individuals remaining within the non-admixed ancestral group after founder selection as the required reference group for running each of the five software. We ran each software using default parameters or using the minimum number of settings necessary as this is representative of how most new users will interact with each software. We ran each software as follows: Rscript mosaic.R <admixed population name> <folder containing required input> -c <chr range> -a <number of ancestries to infer> -m <maximum number of cores> --gens <number of generations> Loter loter_cli -r <reference panel genotype/haplotype> -a <admixed genotype/haplotype> -f <genotype file format> -o <output name> -n <number of cores> -v ELAI v1.01 elai-lin -g <ancestral haploypes 1> -p 10 -g <ancestral haploypes 2> -p 11 -g <ancestral haploypes 3> -p 12 -g <admixed haploypes> -p 1 -pos <snp position file> -C 3 -o <output name> elai-lin -g <ancestral haploypes 1> -p 10 -g <ancestral haploypes 2> -p 11 -g <admixed haploypes> -p 1 -pos <snp position file> -C 2 -o <output name> In all cases we ran software on one core. In cases with three ancestries, 11 was used for number of generations. In cases with two ancestries, 8 was used for number of generations. In most cases each software requires a genetic map file or SNP position file, the number of generations since admixture, and reference/admixed genotypes in a software specific format. As our genotype data was already phased, we do not consider phasing in this paper, though it could be considered a necessary step 0 of this process. As each software carries different requirements for formatting, we have constructed a brief pipeline for formatting and running each software. All scripts used to run each software can be found at https://github.com/WheelerLab/LAI_ benchmarking.

Benchmarking Each Software
We used the bash command time -v to benchmark time and memory of each software run. To benchmark time and memory usage with increasing sample size, we used the methods described above and simulated an additional 2000 two-way admixed AFA individuals to test time and memory burden at each level of 20, 50, 100, 500, 1000, 1500, and 2000 individuals. We performed regression analysis of time and memory complexity in base R for each software.
We defined accuracy as the Pearson correlation for each individual in a simulated population. For each individual, we calculated Pearson correlation of all SNPs tested between the known ancestry output by the ancestry simulation tool and the ancestry inferred by a given software.