Higher native Peruvian genetic ancestry proportion is associated with tuberculosis progression risk

Summary We investigated whether ancestry-specific genetic factors affect tuberculosis (TB) progression risk in a cohort of admixed Peruvians. We genotyped 2,105 patients with TB and 1,320 household contacts (HHCs) who were infected with Mycobacterium tuberculosis (M. tb) but did not develop TB and inferred each individual’s proportion of native Peruvian genetic ancestry. Our HHC study design and our data on potential confounders allowed us to demonstrate increased risk independent of socioeconomic factors. A 10% increase in individual-level native Peruvian genetic ancestry proportion corresponded to a 25% increased TB progression risk. This corresponds to a 3-fold increased risk for individuals in the highest decile of native Peruvian genetic ancestry versus the lowest decile, making native Peruvian genetic ancestry comparable in effect to clinical factors such as diabetes. Our results suggest that genetic ancestry is a major contributor to TB progression risk and highlight the value of including diverse populations in host genetic studies.


In brief
Our understanding of how genetic differences among human populations may affect susceptibility to infectious diseases is very limited. Asgari et al. show that the proportion of native genetic ancestry in contemporary Peruvians affects the risk of progression from latent to active tuberculosis even after accounting for differences in sociodemographic factors.
INTRODUCTION Tuberculosis (TB), caused by Mycobacterium tuberculosis (M. tb), is the leading cause of death from an infectious disease worldwide. 1,2 Similar to other infectious diseases, the development of TB after M. tb infection is determined in part by human genetic factors. 3 Previous twin studies have shown that under comparable environmental and social conditions, TB concordance is higher in monozygotic twins than in dizygotic twins. 3 Similarly, human genomics studies of TB have identified a number of variants that are associated with TB risk. [4][5][6][7] However, there is little concordance between known TB susceptibility loci in different populations, 3 suggesting that the risk alleles driving TB risk in different populations may be heterogeneous.
Because pathogens are a major selective force in shaping our genome, 8 it is reasonable to think that the high historical prevalence of TB in Europe over the past 2,000 years may have led to reduced frequencies of risk alleles in the European population. Indeed, a recent study showed that the negative selection exerted by the high burden of TB is likely to explain the sharp drop in the frequency of rs34536443, a missense variant in TYK2 that confers TB risk, after the Bronze age (2,500 years ago). 9 Similarly, a previous study of TB risk among admixed South Africans showed that European genetic ancestry protects against TB disease. 10 It is thus plausible that genetic ancestry contributes to differences in the incidence of TB across populations. However, quantifying the contribution of ancestry-specific genetic factors to TB risk can be challenging because genetic ancestry can track with non-genetic sociodemographic TB risk factors, such as smoking and under-nutrition. 2 Here, we aim to understand the role of ancestry-specific genetic factors that affect TB progression risk independently of socioeconomic and environmental factors in a cohort of Peruvian individuals. Peru has one of the highest TB incidences in South America. 11 The genetic makeup of contemporary Peruvians is shaped by extensive admixture between native residents of Peru and the Europeans, Africans, and Asians that have arrived in Peru since the 16 th century. 12 We recruited patients with TB and M. tb-infected household contacts (HHCs) in whom we ascertained infection status by tuberculin skin testing (TST). We specifically picked controls in this way to make sure they were exposed and infected and to focus specifically on TB progression risk. We also ascertained sociodemographic and known TB clinical risk factors in all participants. We then used genotype data to quantify the genetic diversity in our cohort and to estimate the proportion of native Peruvian genetic ancestry (i.e., the indigenous genetic ancestry component of the genome of contemporary Peruvians) in each individual. Finally, we tested the association between genetic ancestry and TB progression risk after accounting for potential confounding effects.

Study design and case-control definition
We conducted a longitudinal, HHC genetic study of pulmonary TB in Lima, Peru (STAR Methods; Figures S1 and S2). All cases (n = 2,105) had confirmed active TB. Within 14 days of enrollment of index TB cases (i.e., the first TB patient in each household), we screened their HHCs for signs and symptoms of active TB as well as for latent TB as measured by a TST. These tests were repeated at 2, 6, and 12 months (STAR Methods; Figure 1). We refer to HHCs who were identified as having TB within 14 days of enrollment of index TB cases as ''baseline cases'' and to HHCs who were diagnosed with TB after this period until the end of the 12 months follow up as ''secondary'' or ''secondary clustered cases'' (see STAR Methods for details). Controls (n = 1,320) are HHCs of index cases who were TST positive but who did not develop TB during 12 months of active follow up (STAR Methods). In addition to individuals' TB status, we also collected extensive information on sociodemographic risk factors for TB (STAR Methods; Table 1).

Global ancestry inference
We quantified the global genetic ancestry for each individual in our cohort, assuming four ancestral populations (K = 4) based on Peru's population history 12 ( Figure S3). These four populations corresponded to native Peruvian, European, West African, and East Asian genetic ancestry with average proportions 0.80 (standard deviation [SD] = 0.15), 0.16 (0.11), 0.03 (0.07), and 0.01 (0.03), respectively (Table 1; Figure 2A; Table S1; Figure S4). These proportions were consistent with previous genetic studies of Peruvians. 12 Increasing the number of clusters revealed finer substructures within each of the four main ancestral clusters ( Figure S5; Table S2).
Correlation between self-reported race and genetic ancestry Self-reported race or ethnicity is frequently used in epidemiological or medical studies to account for an individual's background. However, self-reported race/ethnicity can be a poor proxy for genetic ancestry in admixed populations. [13][14][15] In our cohort, the majority of participants self-identify their race as ''American Indian + White'' and their ethnicity as ''Latino'' (74% and 99% respectively; Table 2). Genetic ancestry proportions differ significantly between self-reported race categories (ANOVA p < 10 -30 for all four tested We recruited patients in a large catchment area that included 20 urban districts and $3.3 million residents. Within 14 days of enrollment of index cases, we contacted their household contacts (HHCs). HHCs with pulmonary TB were recruited as cases (baseline cases). HHCs that were TST positive but did not have active TB were recruited as controls. All individuals were followed up with for 1 year, and all HHCs were evaluated for signs and symptoms of pulmonary and extra-pulmonary TB disease at 2, 6, and 12 months after enrollment and were recruited as cases if they developed active TB during follow up (secondary cases). HHCs that remained or became TST positive but did not develop active TB were recruited as controls. The final cohort included 2,105 TB cases and 1,320 TST-positive HHCs.   Tables S3 and S4). In all tertiles of native Peruvian ancestry, the majority of individuals self-reported as ''American Indian + White'' followed by ''American Indian'' (Table S5). Altogether, these results suggest that self-reported race and genetic ancestry are correlated; however, individuals who self-report the same race can have drastically different levels of genetic ancestry proportions. We then tested whether self-reported race is associated with TB progression risk. No category of self-reported race was significantly associated with TB status, suggesting that in our cohort, self-reported race is not a risk factor for TB progression (Table S6).

Association between genetic ancestry and TB progression risk
To examine the relationship between genetic ancestry and TB progression risk in Peruvians, we applied logistic regression to test the effect of the estimated fraction of native Peruvian, European, West African, and East Asian genetic ancestries on casecontrol status after adjusting for age, sex, and socioeconomic status. Additionally, we included a random household effect to account for environmental factors and a genetic relatedness matrix to account for cryptic relatedness between individuals. We observed a significant association between increased native Peruvian genetic ancestry and TB progression risk (odds ratio per 0.1 increase in native Peruvian genetic ancestry proportion [OR NAT0.1 ] = 1.25, 95% confidence interval [CI] = 1.18-1.33, p = 1.1 3 10 À13 ; Table 3), and European, West African, and East Asian genetic ancestries were associated with reduced TB progression risk (Table 3). Adjusting for self-reported race (Table S7) or removing 430 related individuals (kinship coefficient R 0.125) did not change these results ( Figure S6; Table S8). Similarly, stratifying by sex did not change our results ( Figure S7; Table S9).
Next, to test whether these associations were independent of each other, we performed conditional analyses between ancestries. Native Peruvian genetic ancestry remained significantly associated with increased TB progression risk conditioned on the other ancestries, but the other ancestries showed no association with TB progression risk after conditioning on native Peruvian genetic ancestry (Table 3).
In our cohort, native Peruvian is the main genetic ancestry and the only one that is associated with an increased TB progression risk relative to other ancestry components. We observed a significantly higher level of native Peruvian genetic ancestry in cases compared with the infected HHCs (0.82 [SD = 0.13] and 0.78 [0.17], t test p = 8.8 3 10 À19 ; Figure 2B) and a higher probability of being a case with an increasing proportion of native Peruvian genetic ancestry. Individuals with the highest level of native Peruvian genetic ancestry (top decile, average native Peruvian genetic ancestry proportion = 0.97 [0.01], n = 232 cases, 110 controls) were three times more likely to progress to active TB (OR = 2.90, 95% CI = 1.99-4.26, p = 2.8 3 10 À8 ; Figure 2C) compared with the individuals with the lowest level of native Peruvian genetic ancestry (bottom decile, average native Peruvian genetic ancestry proportion = 0.48 [0.13], n = 149 cases, 194 controls). Assuming a larger number of ancestral clusters did not substantively change the association between native Peruvian genetic ancestry and TB progression risk (Table S10).
As a sensitivity analysis and to rule out the effect of individuallevel non-genetic confounders, we added West African and East Asian genetic ancestry proportions, BMI, education level, and BCG vaccination, smoking, alcohol use, and previous TB status to our model. Including these additional covariates did not  Figure 2D; Table S11). Collectively, these results suggest that native Peruvian genetic ancestry is associated with increased TB progression risk independently of other genetic ancestries or non-genetic factors that can track with genetic ancestry such as sociodemographic or known clinical TB risk factors. However, these results do not rule out the possibility of this association being the result of other non-genetic confounders related to phenotypic heterogeneity, exposure, or transmission. We thus performed a series of statistical analyses to account for these potential confounders.

Accounting for phenotypic heterogeneity
To test if phenotypic heterogeneity in our cohort could explain our results, we restricted the analysis to microbiologically confirmed TB cases (n = 2,043) and HHCs who were TST positive at baseline and did not develop active TB during the 1 year follow up (n = 950). These analyses resulted in an OR similar to the larger cohort (OR NAT0.1 = 1.28 [1.20-1.37], p = 5.6 3 10 À14 ; Table S12). We also considered that cases and controls who were from the same household may share a more similar ancestry profile compared with average, which may bias our results. For this, we tested the association of native Peruvian genetic ancestry with TB progression risk using half of the cases (n = 791) and the same number of controls who were not from the same household as cases and after correction for age, sex, socioeconomic status, and genetic relatedness. This analysis had a similar result to the analysis performed using the whole cohort (OR NAT0.1 = 1.19 [1.10-1.28], p = 1.0 3 10 À12 ).
Accounting for transmission and exposure While index cases might have acquired TB in the community, secondary cases are more likely to result from within household To account for potential differences in exposure and transmission between index cases and HHCs, we tested the association between native Peruvian genetic ancestry and TB progression in secondary cases (n = 213) and controls from the same households (n = 214) and after adjusting for age, sex, socioeconomic status, household, and genetic relatedness, as we did in our primary analysis. We observed an OR similar to the one observed for the whole cohort (OR NAT0.1 = 1.30 [1.12-1.51], p = 4.4 3 10 À3 ; Figure 2D; Table S12).
We further restricted the cohort to secondary clustered TB cases to ensure that their TB disease was the result of within household transmission or infection from circulating M. tb strains rather than reactivation of an old infection. While this analysis had a much smaller sample size, the results were consistent with our previous analyses (n = 58 TB cases and 48 HHCs, OR NAT0.1 = 1.67 [1.13-2.47], p = 1.0 3 10 À2 ; Figure 2D; Table S12). Native Peruvian genetic ancestry proportion was similar between baseline and secondary cases from the same households, which is consistent with the conclusion that differences in native Peruvian genetic ancestry proportion are not associated with differences in exposure (n = 135 baseline and 213 secondary cases, OR NAT0.1 = 1.02 [0.86-1.22], p = 0.28; Table S11).

Admixture mapping
We performed local ancestry inference followed by admixture mapping to look for specific genomic regions that might explain the association between native Peruvian genetic ancestry and TB progression risk (n = 889,203 markers following imputation, quality control [QC], and pruning). No locus passed the genome-wide significance threshold (p < 4.3 3 10 À6 ). However, we observed suggestive evidence of association at 5p23.2 (OR = 1.34 [1.17-1.53], p = 2.9 3 10 À5 ; Figure S8). When we restricted the analysis to cases with microbiologically confirmed TB (n = 2,043) and HHCs who were TST positive at baseline and did not progress to active TB over the 1 year follow up (n = 950), the signal on 5p23.2 got stronger (OR = 1.39 [1.20-1.61], p = 1.5 3 10 À5 ; Figure S8) and closer to the genome-wide significance threshold set using permutation for this analysis (p < 1.0 3 10 À5 ). This locus overlaps a 100 Mb region on chromosome 5 (125855350-125963352), which includes the coding sequence of ALDH4A1 and 69 variants that were nominally associated with TB progression risk in our cohort (Table S13). [4][5][6][7] However, understanding whether any of these variants or other variants in this locus might explain the observed admixture mapping signal requires further investigation.

DISCUSSION
Our results suggest that relative to other tested genetic ancestries, native Peruvian genetic ancestry is associated with TB progression risk independently of population structure, the sociodemographic factors that we tested here, and factors related to exposure or transmission. In our cohort, individuals with the highest proportion of native Peruvian genetic ancestry are three times more likely to progress to active TB compared with individuals with the lowest level of native Peruvian genetic ancestry. To compare, this effect is similar to the reported effect of diabetes on TB risk based on previous cohort studies. 16 Compared with native Peruvian genetic ancestry, European and West African genetic ancestries were associated with reduced TB progression risk. This protective effect can be the result of the long, shared history of these populations with M. tb, 17 which could have led to selective pressures that have mitigated TB genetic risk if such pressures were not present in pre-colonization Peru. 18 However, we want to emphasize that the effect of native Peruvian ancestry on TB progression risk is relative to other ancestries that we tested here and cannot be causally detangled from the effect of other genetic ancestries.
While our results show a strong genome-wide signal for the effect of native Peruvian genetic ancestry on TB risk, we did not identify any single locus that can explain this effect, suggesting that it is driven by a polygenic architecture with many variants exerting modest impact. This conclusion is in line with our previous genome-wide association study (GWAS) of TB progression in the current cohort where we showed SNP heritability (h 2 g ) of TB progression in Peruvians to be 21.2% yet found only one genomewide significant locus (3q23) associated with TB progression risk. 7 We did not identify any association in 3q23 locus in our admixture mapping analysis; however, GWAS and admixture mapping results can be complementary and do not necessarily always point to the same risk loci. 19,20 In addition to the polygenic structure or TB progression, our power to detect specific TB progression risk loci through admixture mapping may have also been affected by the lower accuracy of local ancestry inference in multi-way admixture scenarios compared with two-way admixtures. 10 Altogether, these results suggest that conclusively identifying TB progression risk loci requires larger studies with greater statistical power.
To our knowledge, this study is the first large-scale genetic study to look at the effect of indigenous ancestry on TB or TB progression risk in South or Latin American populations. However, the role of indigenous ancestry in the apparently increased burden of TB among native populations of America has been extensively debated for over 200 years. 21,22 These debates are rooted in epidemiological studies showing a high TB burden and mortality rates in post-contact indigenous However, these debates remain inconclusive mainly due to the challenges associated with separating population-specific genetic risk factors from non-genetic risk factors that track with genetic ancestry. 21,22 Our study is different from these previous genetic studies in three ways. First, our study uses genetic data to quantitatively assign genetic ancestry, whereas previous studies used self-reported ancestry, which is often a poor proxy for genetic ancestry in admixed populations 13,15 and thus can lead to misclassification of participants. Second, we carefully phenotyped all individuals and ascertained infection status using TST to ensure that all individuals were exposed to M. tb. This is an important distinction as different genetic factors might underlie different stages of the disease (e.g., infection versus progression upon infection). 7 Third, our HHC, longitudinal study design allowed us to ensure that all controls were exposed to M. tb and to rigorously account for potential non-genetic factors that track with genetic ancestry.
In addition to its relevance for better understanding the genetic architecture of TB progression, our study provides a framework for similar future studies where it is important to account for environmental and socioeconomic factors to identify genetic factors that affect disease outcomes. Our results also highlight that differences in infectious disease burden among different populations cannot be solely attributed to variations in sociodemographic factors and can be partially due to genetic differences. Our results also highlight that differences in infectious disease burden among different populations cannot be solely attributed to variations in sociodemographic factors and can be partially due to genetic differences. Currently, the majority of human genomics studies of complex traits are done in populations of European ancestry. 25 However, with the increasing clinical applications of complex trait genomics data, 26 this European bias can lead to increased health disparities. 27 Our results underline the importance of conducting large-scale human genomics studies in diverse populations in order to get a better understanding of population-specific genetic risk for TB and other complex diseases, to get a comprehensive picture of the genotype-phenotype relationship, and to enable all human populations to benefit from the results of human genomics research.

Limitations of the study
One caveat of our study is that we have not tested for all possible non-genetic TB risk factors. For example, while we tried to account for factors related to exposure by including only participants with a documented household exposure to an index TB patient, we do not have information on possible community or workplace exposures. We also could not correct for potential biases and unmeasured social discriminations and inequalities that might track with both genetic ancestry and TB progression risk. While these may be potential confounders, we consider it unlikely to explain the entirety of our signal: such biases are likely to track with households, and correction for household in our analyses did not alter our results. We note that the distribution of demographic variables such as sex, age, and education level in our cohort may differ from that in the general population. However, accounting for these covariates does not change our results, suggesting that our findings are unlikely to be driven by demographics. Finally, we emphasize that while our study brings proof that TB risk can vary across populations with different genetic ancestries, our results cannot be generalized to all indigenous populations, as different populations have different histories, and the sociodemographic factors, the primary determinants of TB risk, vary widely across different indigenous populations. 2,28 STAR+METHODS Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
We thank all participants enrolled in this study. The study was supported by the National Institutes of Health (NIH) TB Research Unit Network, grants U19-AI111224-01 and U01-HG009088, and NIH grants U01-HG009379 and 1R01AR063759. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. S.A. was supported by the Swiss National Science Foundation postdoctoral mobility fellowships P2ELP3_172101 and P400PB_183823 and NIH T32 grant T32HG010464.  Native Peruvian genetic ancestry (NAT) is associated with TB progression risk while European (EUR), West African (AFR), and East Asian (ASI) genetic ancestries were associated with reduced TB progression risk. NAT remained significantly associated with increased TB progression risk conditioned on non-NAT ancestries, but none of the non-NAT ancestries showed association with TB progression risk after conditioning on NAT. Odds ratios and 95% confidence interval (CI) correspond to 10% increase in genetic ancestry (OR NAT0.1 ). p, two-sided Wald test p value; SES, socioeconomic status; HH, household.

AUTHOR CONTRIBUTIONS
8 Cell Genomics 2, 100151, July 13, 2022 Article ll OPEN ACCESS 12 months after enrollment. While cases were not excluded based on TB history, controls with a history of active TB or previous positive TST were excluded. We chose this study design because HHCs of individuals with TB are highly exposed to M.tb and are at a high risk of developing TB 31,32 ; hence our strategy allowed us to focus on TB progression by including controls that were exposed to the pathogen and were infected. To adjust for any residual confounding that might be missed by our household recruitment study design, we also collected extensive sociodemographic and clinical variables at baseline including self-reported race and ethnicity, age, sex, body mass index (BMI), smoking, alcohol use, previous TB, socioeconomic status, education level, and BCG vaccination. We refer to HHCs who receive a TB diagnosis within 14 days of enrollment of index cases as ''baseline'' cases. We refer to HHCs that developed active TB (e.g. became TB cases) 14 days or more after index case enrollment as ''secondary cases'' and to secondary cases that their M. tb strain shared exact MIRU genotyping with another TB case as ''secondary clustered cases''. In analyses focused on secondary or secondary clustered cases, controls were restricted to HHCs of these cases (i.e. if a household did not have any secondary or secondary clustered cases HHCs from that household were not included as controls).

METHOD DETAILS
Categorizing smoking, drinking, body mass index (BMI), and socioeconomic status We categorized participants according to their alcohol intake as follows: nondrinkers if they reported having consumed no alcoholic drinks per day, light drinkers if they reported drinking <40 g or <3 alcoholic drinks per day and heavy drinkers if they reported drinking 40 g of alcohol or more or 3 or more drinks per day. 33 For smoking, we classified people as nonsmokers if they reported no cigarette smoking, as light smokers if they reported smoking one cigarette per day, and as heavy smokers if they reported smoking more than one cigarette per day. 33 We categorized people with BMI z-scores of less than À2 as underweight and those greater than 2 as overweight. For children, we defined the nutritional status based on the World Health Organization BMI z-score tables. 34 We calculated household-level composite socioeconomic scores (SES) using principal component analysis (PCA) as described before 30 by summarizing the following household-level factors: type of housing, the total number of rooms in the house, exterior wall material, primary floor material, primary roof material, type of water supply, type of sanitation facility, and type of lighting in the house. We categorized the continuous SES scores into tertiles corresponding to low, middle, and upper socioeconomic status groups.

Genotyping and global ancestry inference
We extracted genomic DNA from participants' whole blood. To optimally capture the genetic diversity of Peruvians, we designed a customized array (LIMAArray) with 712,200 markers. In addition to the general genome-wide markers from the Affymetrix Axiomâ my-Design custom genotyping array, we supplemented our array using coding markers from exome sequencing data of 116 Peruvian TB cases from the same population as our study population in order to optimally capture Peruvian's genetic variation, and particularly rare and protein-coding variations 7 (Method S1). Our array included 1.6% coding and 98.4% non-coding SNPs. Following QC and filtering, we kept 677,232 genotyped variants to use for downstream PCA and genetic ancestry inference. We merged genotyping data from our cohort, with previously published data from the 1000 Genomes Project phase 3 (2,054 individuals from 26 populations) 35 and Siberian and Native American populations from Reich et al. 36 (493 individuals from 57 native American populations and 245 individuals from 17 Siberian populations), by matching on the chromosome, position, reference, and alternate alleles using PLINK (version 1.90b3w). 37 After merging the datasets, we excluded variants with an overall minor allele frequency (MAF) < 1%. We then pruned the data for linkage disequilibrium (LD) by removing the markers with r 2 > 0.1 with any other marker within a sliding window of 50 markers per window and an offset of 10 using PLINK. The final merged dataset included 22198 variants (Methods S2, S3, and S4). We used the Genomewide Complex Trait Analysis tool (GCTA), 38 version 1.26.0) to perform PCA and ADMIXTURE 39 (version 1.3), an unsupervised clustering method, with K = 4-7 clusters to perform global ancestry inference on this dataset ( Figure S3). We used reference populations in ADMIXTURE analysis to determine what genetic ancestry each cluster represents. For example, if a cluster was the dominant cluster in the European individuals from the 1000 Genomes Project 35 we concluded that this cluster represents European genetic ancestry in admixed Peruvians from our cohort. All genetic analyses were done using GRCh37.

Kinship estimation and genetic relatedness matrix (GRM)
We used the PC-Relate 40 implemented in the GENESIS R package (version 2.6.1) to estimate the kinship coefficients between individuals and to generate a genetic relatedness matrix (GRM). We removed rare variants (MAF <1%), regions with known long-range linkage disequilibrium (LD), 41 and variants in high LD (r 2 > 0.2 in a window of 50 kb and an offset of 5) using Plink (version 1.90b3w). In total, 551 pairs had kinship coefficients R0.125 corresponding to second-degree relatives or closer. Of these, 36 pairs were parentchild, 72 were sib pairs, and 453 were second-degree relatives. Of related pairs, we randomly removed one individual. In total 430 individuals were removed. The remaining cohort included 1,929 TB cases and 1,066 HHCs. All remaining pairwise relatedness estimates were <0.125 ( Figure S6).

Testing the association between global genetic ancestry proportions and TB progression risk
To test the association of genetic ancestry proportions with TB progression risk we used the following logistic mixed model framework implemented in lme4qtl 42  We also performed the following sensitivity analyses to test the effect of additional covariates, phenotypic heterogeneity, or factors related to exposure or transmission on our results. First, in addition to native Peruvian genetic ancestry, age, sex, socioeconomic status, household, and GRM, we also included the following individual-level covariates in the above model: West African and East Asian genetic ancestries BMI, smoking, alcohol use, previous TB, education level, and BCG vaccination. Second, we restricted the analysis to 2043 microbiologically confirmed TB cases and 950 HHCs who were TST positive at baseline and did not develop active TB during the one year follow up and tested the association of native Peruvian genetic ancestry and TB progression risk using Equation 1. Third, we restricted the analysis to secondary cases (N = 213) and their HHCs (N = 214) and tested the association of native Peruvian genetic ancestry and TB progression risk using Equation 1. Finally, we performed a sensitivity analysis using 58 secondary clustered TB cases and their 48 HHCs and tested the association of native Peruvian genetic ancestry and TB progression risk using Equation 1.
For all above analyses, we removed individuals with any missing values for the included covariates. For all analyses, we calculated z-score for given covariates and used Wald test to calculate a two-sided p value.
Testing the association between self-reported race and TB progression risk To test the association between self-reported race and TB progression we used the following logistic regression model framework implemented in R's glm function: Yi is the probability of individual i being a case, q is the intercept, b self À reported race is a vector of effect estimates for self-reported race (categorical variable with eight categories).
Local ancestry inference ''Local ancestry'' is defined as the genetic ancestry of an individual at a particular locus, where an individual can have 0, 1, or 2 copies of an allele derived from each ancestral population. 43 We performed local ancestry inference using PCAdmix, 44 using imputed data to increase the number of shared markers between our data and reference data. Following phasing and imputation as described previously, 7 we excluded SNPs with imputation quality score r 2 < 0.4, HWE p value < 10 À5 in controls, or a missing rate per SNP greater than 5% which left 7,756,401 markers. To increase the number of overlapping variants between our cohort and the reference panel, we chose reference individuals with whole-genome sequencing data available including 25 native American individuals from the Simons Genome diversity project 45 plus 5 individuals from the 1000 Genomes Project 35 PEL that had inferred native Peruvian genetic ancestry >0.95 based on ADMIXTURE analysis at K = 4 clusters as proxy for native Peruvian genetic ancestry, 30 randomly selected individuals from the 1000 Genomes study CEU as proxy for European genetic ancestry, 30 randomly selected individuals from the 1000 Genomes study YRI as proxy for West African genetic ancestry, and 30 randomly selected individuals from CHS population as proxy for East Asian genetic ancestry. We then merged our data with the reference panel data and restricted the merged dataset to variants with MAF >5% in each of the reference populations. The post-QC dataset included 2,910,169 variants. We phased the merged data using SHAPEIT2. 46 After phasing and following PCAdmix developer's recommendation, we used PLINK (version 1.90b3w) to remove the markers with r 2 > 0.8 with any other marker within a sliding window of 20 markers per window and an offset of 10 using. 889,203 variants after pruning remained for use in local ancestry inference. We then used SHAPEIT2 46 (version v2.r837) to generate VCF files followed by Beagle (version 4.1) to generate input files for PCAdmix. All files were generated per chromosome. Finally, we performed local ancestry inference on each chromosome using PCAdmix (version 3) with the following options -bed and -ld 0 and recombination maps from the 1000 Genomes Project. 47 Local ancestry inference was done in windows of 20 SNPs and in total local ancestry was inferred for 44470 intervals. Genomic regions with long-range LD 41 including the major histocompatibility complex were excluded for admixture mapping. To avoid noise introduced by potential phenotypic heterogeneity, we repeated our admixture mapping analysis using cases with microbiologically confirmed TB (N = 2043) and HHCs who were TST positive at baseline and did not progress to active TB over the one year follow up (N = 950) ( Figure S8). For SNPs within windows with the lowest association p value, we report variant-level summary statistics (Table S13).

Admixture mapping
We used admixture mapping, a method to associate the inferred ancestry of a locus with a trait in an admixed population, 48 to search for genomic loci that can explain some of the observed association between native Peruvian genetic ancestry and TB progression.  49 a generalized linear mixed model framework to check the association between the inferred local native Peruvian genetic ancestry {coded as a number between 0 and 2 for local native ancestry posterior probability, 0 means no native Peruvian allele and 2 means 2 native Peruvian alleles} and TB progression risk {case, control}. We included standardized age {0-1}, sex {male, female}, global EUR, AFR, and ASI genetic ancestry proportions {0-1}, and a matrix of pairwise genetic relatedness {0-0.5} as covariates in the model. While the total number of loci tested was 44,470 we recognized that adjacent markers were highly correlated and that Bonferroni correction would be too stringent. Hence, to define the significance threshold for admixture mapping we permuted the case-control status and repeated the association analysis 1000 times. We then used the lowest p value from each permutation to generate an empirical null distribution. The fifth percentile of this distribution was used as the cutoff for genome-wide significance.

QUANTIFICATION AND STATISTICAL ANALYSIS
All the statistical methods and softwares used in this study are listed in the corresponding sections in the method details. The statistical significance was determined by properly accounting for multiple testing as described in the method details. All p values are two-sided.