Introduction

The study of genetically diverse populations has become a priority in public health research. Several major initiatives started in the past few years, including the National Institute of Health (NIH) Trans-Omics in Precision Medicine Program, which aims to sequence over a hundred of thousands of whole genomes from a variety of ancestries. These initiatives compensate for the lack of participants of non-European ancestries in genetic studies [1,2,3]. Besides addressing health disparities among ethnic groups, studies of multi-ethnic cohorts and admixed populations can provide important information about the biology of complex diseases and help to identify associated genes [4]. Recently, admixed populations, such as African Americans, represent a special case of multi-ethnic cohorts with mosaic chromosomes derived from several ancestral populations. Admixture mapping, often applied to recently admixed populations, searches for genomic loci of unusual local ancestry at a putative disease risk locus compared with the genome-wide average [5]. Findings for respiratory disease, chronic renal disease, prostate cancer, and systemic lupus erythematosus have been reported as results of admixture mapping [6,7,8,9,10,11].

The data analysis in admixture mapping consists of two main steps: inferring local ancestry and testing for association between every local ancestry segment and an observed phenotype. Given unobservable ancestry information, current methods on ancestry inference probabilistically define the location of every ancestral switch using genotyping array data, reference haplotypes, and algorithms, based on hidden Markov models (HMM) [12, 13]. These methods were empirically shown to produce reliable results on African Americans [14], as they represent a relatively simple two-way admixture and are well-modeled by available reference panels [15]. The standard approach for association testing is similar to genome-wide association studies (GWAS) on single-nucleotide polymorphisms (SNPs) and runs linear regression to estimate the correlation between local ancestry and phenotype. To avoid confounding due to population structure that is inherently present in admixed individuals, most studies also included global ancestry components (i.e., the genome-wide proportions of ancestry derived from each ancestral population) as covariates. Going beyond linear regression, the framework of linear mixed models (LMMs) was applied to quantify individual similarities by ancestry, showing how the phenotypic variance is explained by local ancestry [16] and linking it to the heritability of complex traits estimated from SNP data [17]. Notably, several admixture mapping studies applied LMMs to control for family structure (kinship), household groups, and census blocks [18, 19], but modeling individual relationships through local ancestry genome-wide has not be explored yet, despite potential mapping improvements [13].

Recently, we found that the correlation between local ancestry and untyped causal variants can be leveraged to detect distant gene–gene interactions in admixed populations through local ancestry–local ancestry screening [20]. That work also demonstrated that the power of such admixture mapping increases with the number of causal variants within local ancestry tested and with the degree of differentiation of variants between the ancestral populations. Here we suggest that the same principle can be applied to search for gene–environment interactions. Regarding previous studies of gene–environment interactions, most works focused on interactions between the global ancestry and environmental factors using linear regression [17, 21,22,23]. As here we sought to screen for local ancestry–environment interactions, a type of admixture mapping seldom explored, we argue that such application might face multiple methodological challenges due to both population stratification and outcome heterogeneity among individual groups stratified by environmental exposure. Hence, the use of LMMs will be particularly appropriate to account for complex genetic and environmental relationships in admixed individuals.

Application of admixture mapping to lung function phenotypes in African Americans is especially relevant [24]. European Americans and African Americans are well known to show differences in spirometric measures of lung function such as forced expiratory volume in one second (FEV1) and forced vital capacity (FVC) with higher values for these two traits for European Americans. Factors partially responsible for these differences include body habitus, early-life development conditions, socioeconomic status, and other environmental factors [25, 26]. Previous studies showed strong evidence that the proportion of African global ancestry is associated with lower lung function for a given ranges of height and age [21, 24]. Also, the higher proportion of African ancestry in African Americans was associated with an additional decrease in lung function for smokers [27].

To address the aforementioned methodological challenges, we used real data and proceed in a stepwise assessment of multi-component LMM, to define a robust interaction test of association. More precisely, we conducted a genome-wide scan of local ancestry–smoking interactions for five spirometric lung function phenotypes available in 3300 African Americans from the Genetic Epidemiology of COPD (COPDGene) study [28]. The search for gene–environment interactions related to pulmonary function phenotypes in the COPDGene study is markedly relevant, as it is one of the largest studies of African American smokers. In result, our application of the proposed LMM identified two genome-wide significant and five suggestive loci that would have been missed in standard single SNP-based approaches.

Materials and methods

The COPDGene dataset

In the analysis of COPDGene study (dbGaP phs000179.v5.p2), we focused on five correlated quantitative pulmonary phenotypes: FEV1; FEV1 as a percent of predicted (FEV1 % predicted); FVC; FVC as a percent of predicted (FVC % predicted); and the ratio of FEV1 to FVC (FEV1/FVC). We derived two binary smoking exposures from the number of cigarettes smoked per day for gene–environment interactions: current smoker exposure (current smokers vs. former smokers) and heavy smoker exposure (current heavy smokers vs. current moderate smokers). We defined moderate current smokers with 1–14 cigarettes per day on average, whereas heavy current smokers were defined as with >14 cigarettes per day on average. When using heavy smoker exposure in the analysis, we excluded all subjects who were former smokers.

Locus-specific ancestry or local ancestry was inferred from 684,187 autosomal genotyped SNPs as previously described [28]. Briefly, we used the LAMP-LD program [29] to estimates local ancestry per individual using a HMM algorithm comparing observed genotypes and haplotypes from reference ancestral populations. We parameterized the algorithm with 15 HMM states, a window size of 50 SNPs, and used 99 CEU and 108 YRI unrelated individuals from the 1000 Genomes Project (Phase III, hg19) [30] as reference panels. Per SNP estimates of African ancestry were further used in two post-processing steps. First, we averaged the local ancestry per individual to derive the genome-wide proportion of African ancestry (i.e., the global ancestry). Second, we estimated local ancestry segments by merging neighboring SNPs with identical values. We next filtered out short ancestry segments of length <10,000 bases to mitigate possible artifacts of the inference procedure that might affect further admixture mapping analysis (LAMP-LD authors, personal communication; 2017).

Step-wise model selection for admixture mapping

Consider a quantitative trait stored in a n-dimensional vector y and a n × m data matrix Z of local ancestry segments, where n is the number of individuals and m is the number of local ancestry segments. Let zl be a column of matrix Z corresponding to a single local ancestry segment and xe being a n-dimensional column vector of a binary environmental exposure. We aim at testing the statistical interaction between the local ancestry segments zl and the exposure xe on the phenotype y. As discussed in the Supplementary Material, we are interested to assess a fixed effect of interaction zl × xe using the standard Wald’s test, while controlling for variance related to local ancestry, global ancestry, and exposure that might confound the estimate of effect. However, in regards to potentially non-trivial structure in the data, due to both ancestry admixture and potential heterogeneity of outcome across exposure groups, we defined our association model using a stepwise approach, where the model complexity was incremental until reaching the desired criteria of validity. Following standard practices [31], these criteria were designed (1) to reach a genomic inflation parameter (λ close to 1) and (2) to achieve an overall shape of the standard quantile–quantile plot (QQ plot) of the −log10(P) matching the expected uniform distribution of p-values for the majority of the tests.

In practice, before evaluating interaction effects, we first assessed the robustness of a standard LMM when testing for the marginal effect of local ancestry zl only:

$$y = C\beta _{\mathrm{{C}}} + \beta _{\mathrm{{e}}}x_{\mathrm{{e}}} + \beta _{\mathrm{{g}}}z_{\mathrm{{g}}} + \beta _{\mathrm{{l}}}z_{\mathrm{{l}}} + u_{\mathrm{{c}}} + e$$
(1)

where C is a matrix of trait-specific covariates and βC is a vector of their fixed effects, zg is a vector of global, and βe, βg, β1 are fixed effects of exposure, global ancestry, and local ancestry, respectively. The random effects include a vector of the residual errors e and a vector of random-effect uc encoding whether two given individuals belong to the same medical center. We further added additional random-effect components (described below) in the LMM, for which the importance was assessed incrementally (Supplementary Table S1). We next included our parameter of interest, the local ancestry–exposure interaction effect, on top of random-effect components selected at the previous step and continued our assessment of additional components until reaching the desired characteristics. Our full and final LMM was defined as follows:

$$y = \; C\beta _{\mathrm{{C}}} + \beta _{\mathrm{{e}}}x_{\mathrm{{e}}} + \left[ {\beta _{\mathrm{{g}}}z_{\mathrm{{g}}} + \delta _{\mathrm{{g}}}z_{\mathrm{{g}}}x_{\mathrm{{e}}}} \right] \\ + \left[ {\beta _{\mathrm{{l}}}z_{\mathrm{{l}}} + \delta _{\mathrm{{l}}}z_{\mathrm{{l}}}x_{\mathrm{{e}}}} \right] + u_{\mathrm{{m}}} + u_{\mathrm{{i}}} + u_{\mathrm{{h}}} + u_{\mathrm{{c}}} + e$$
(2)

where, in addition to notation in Eq. (1), zg × xe and zl × xe represent interactions between global and local ancestries and exposure, respectively; δg and δl are fixed effects of interactions between global ancestry and local ancestry and exposure, respectively. The first vector of additional random effects um captures the variance of local ancestry remaining after taking into account the global ancestry as a fixed effect (zg) [16, 32]. The variance–covariance matrix of um is the ancestral relationship matrix (ARM) derived by the cross-product operation on column-wise centered and scaled Z matrix [32]. The second vector ui complements the previous vector um and comes out due to testing the interaction effect (δl) rather than the marginal effect (βl) [33]. The variance–covariance matrix of ui is derived from ARM based on stratification by binary environmental exposure status (xe) [33]. We refer to this matrix as environmental ARM or EARM. The third vector uh models the heterogeneity of phenotypic variance across the three smoking groups. The variance–covariance matrix of uh is a diagonal matrix, where entries are the same if they correspond to the same group.

To make the genome-wide screenings computationally efficient, we followed the standard two-step approach in GWAS [34,35,36,37]. First, we estimated the variance–covariance matrix of a trait by a LMM only once in the absence of local-ancestry fixed effects. Second, we applied generalized least squares to derive association test statistic for each local ancestry segment [34]. The test of association was the Wald’s test, which statistic is computed as ratio of estimated effect size (the marginal effect size βl in Eq. 1 or the interaction effect size δl in Eq. 2) to its SE.

Multi-phenotype analysis

We first conducted single-trait admixture mapping of gene–environment interactions for each of the five pulmonary traits considered and each of the two exposures (current smoker and current heavy smoker). However, these five spirometric lung function measurements are all highly correlated and likely share genetic association signals. To reduce the penalty for multiple testing and potentially increase power, we focused our primary screening on a multivariate association tests combining single-trait signals separately for each exposure without assuming any prior on the direction of the single-trait effects [36, 38,39,40]. In practice, given five Z-scores for a local ancestry segment, stored in a vector S5×1, we estimated the multivariate Z-score Sjoint in two steps. We first estimated \(\Sigma _{5 \times 5}\), the variance–covariance matrix among Z-scores under the null hypothesis of no association and then, for each local ancestry segment, we derived the multivariate statistics using the Mahalanobis distance, defined as:

$$S_{\mathrm{{joint}}} = S^{\mathrm{{T}}}\hat{\Sigma}^{-1}S$$
(3)

which squared statistics \(T = S_{\mathrm{{joint}}}^2\) follows a χ2 distribution with 5 degrees of freedom under the null composite hypothesis of no interaction effect on any of these five traits. For the estimation of \(\Sigma\), previous works suggested using the complete Z-score data from the genome-wide scan, i.e., \(\hat{\Sigma} = S^{\mathrm{{T}}}S\), assuming that the vast majority of Z-scores are distributed under the null [38, 40]. Here we made the same assumption and additionally discarded large single-trait Z-scores above a given threshold to reduce the risk of bias. Further, we approximated \(\hat{\Sigma}\) from the resulting truncated multivariate normal distribution by the maximum likelihood estimation. We empirically found the best threshold value equals 3 for our admixture mapping Z-scores (Supplementary Material).

Follow-up analysis

We performed association and further fine-mapping analysis using genotyped data, following up regions identified by admixture mapping. Considering xg is a vector of genotypes for a single SNP, we extended the interaction model (Eq. 2) by adding two terms for marginal genetic (xg) and interaction (xg × xe) effects. It is noteworthy that the exposure term (xe) was already included in this model. We performed a univariate Wald’s test with one degree of freedom to derive the p-value for interaction effect between genotype and exposure. By including a local ancestry term when testing for the interaction effect of genotype, we accounted for possible different Linkage Disequilibrium (LD) patterns for European and African ancestral backgrounds [41]. Such conditional analysis can reduce power but assures that the interaction effect of genotype is driven by a biological mechanism rather than a better SNP tagging in a particular ancestral population.

As discussed in our previous work [20], we expected SNPs in regions of local ancestry–smoking interactions to show multiple-SNP effects on the trait as well as high allelic frequency differentiation at SNPs between ancestral populations. Hence, we performed a comparative study of allelic frequencies between the two ancestral populations and fine-mapping analysis to assess the potential presence of multiple causal variants. First, we computed allele frequency differences (ΔDAF) [42] at all SNPs to measure the allelic heterogeneity between European and African population groups from the 1000 Genomes Project (Phase III) [30]. The ΔDAF measures were previously found to be highly correlated with Weir and Cockerham’s FST in the 1000 Genomes sample [43]. We assessed cases of extreme ΔDAF in the regions of interest by comparing the observed value against the threshold proposed by Colona et al. [44] In brief, that study grouped 36.8 million variants in African and European populations from the 1000 Genomes Project [30] into bins of non-overlapping sets of 5000 variants and derived the distribution of the maximum ΔDAF for further comparisons. Second, we applied a Bayesian method implemented in the software package FINEMAP [45] to estimate the posterior probability of each single SNP interaction to be causal, conditionally on in-sample linkage disequilibrium pattern. FINEMAP implements a shotgun stochastic search algorithm to efficiently explore the most likely causal configurations and we ran FINEMAP with the default parameters on the maximum number of causal SNPs (5), prior probabilities on the number of causal SNPs, and the prior probabilities of a single SNP to be causal.

Results

Participants characteristics and description of traits

We used a dataset of 3300 African American research volunteers from the COPDGene study [46]. Self-identified non-Hispanic African Americans and non-Hispanic European Americans between 45 and 80 years of age with a history of at least 10 pack-years of smoking were enrolled from 21 medical centers across the United States. Details on phenotyping, genome-wide genotyping of the cohort, and inference of local ancestry are provided in Methods and Supplementary Material.

Individual characteristics of the COPDGene dataset are described in Table 1 for the whole sample and stratified by current smoking status: non-smokers, moderate smokers (1–14 cigarettes per day), and heavy smokers (>14 cigarettes per day). The three smoker groups have similar overall characteristics, although there are some differences due to the COPDGene study enrollment protocol: the current smokers are younger and have higher proportions of males; non-smokers have higher numbers of accumulated pack-years. After a stringent quality-control procedure (Supplementary Figs. S1 and S2, and Supplementary Table S2), we observed a total of 30,043 local ancestry segments. The distribution of proportions of local African ancestry (averaged across individuals) is roughly uniform along the genome (Fig. 1a), confirming that the local ancestry data were free from artifacts. The proportion of global African ancestry ranged between 26.3% and 99.8% across individuals with an average of 80.3% (Fig. 1b). We confirmed that individuals with higher proportions of African ancestry tend to have lower pulmonary function at a highly significant level (P < 0.001; Supplementary Table S3) for all the traits, except FEV1/FVC. We also assessed the variance heterogeneity among the smoking groups (not current smokers, moderate current smokers, and heavy current smokers) by its explicit modeling as random effects (Eq. 2). Both exploratory data analysis (Supplementary Fig. S3) and formal statistical tests (P < 0.0001 for all traits; Supplementary Table S4) highlighted differences among these three groups not only by mean, but also by variance (the group of non-smokers has the largest variance).

Table 1 Characteristics of African American participants in the COPDGene project.
Fig. 1: The African ancestry of African American participants of the COPDGene project.
figure 1

a The distribution of local ancestry is plotted by physical position in the genome on the X axis. For each local ancestry segment, the proportions of individuals with two African chromosomes (light green color), one African chromosome (green color), and no African chromosomes (dark green color) are presented on the Y axis. b The distribution of the global African ancestry among 3300 African American individuals in the COPDGene study is shown. The vertical red dashed line depicts the mean value, 0.803.

Mixed model accounts for population structure and variance heterogeneity

To address the methodological question of performing a test of local ancestry–exposure interaction in admixed population, we chose a data-driven approach where the robustness of candidate models was assessed using the genomic control parameter (λ) and the overall shape of the standard QQ plot of the −log10(p-value). We started with the simplest model testing the marginal effect of local ancestry and incrementally added terms necessary for robust testing of local ancestry–exposure interaction, our main parameter of interest.

We proposed to use a LMM including up to four random effects: um, ui, uh, and uc, capturing structure due to shared local ancestry, local ancestry–smoking interaction, smoking status, and recruitment medical center, respectively. We added each of these components into the model through a stepwise procedure, assessing their relative contribution in various combinations. When all components are present, the model has the form given in Eq. 2. All variance–covariance matrices for the random effects observed in real data are illustrated in Supplementary Fig. S4, whereas additional details of the model selection are provided in Supplementary Material and are summarized in Supplementary Table S1.

We used the FEV1% predicted phenotype as an illustrative scenario and started with a marginal association model for single-trait admixture mapping (Fig. 2a) (i.e., without ancestry–environment interaction) including only uc, as there was strong heterogeneity in the distribution of traits by medical center (Supplementary Table S4). This initial model is given in Eq. 1. Marginal local ancestry test statistics in the initial model showed substantial inflation (λ = 1.19). We further accounted for correlation by genome-wide local ancestry across individuals, which was done by modeling the ARM (the um random effect as in Eq. 2). The addition of the ARM component substantially mitigated the inflation (λ = 1.04). We next considered adding the heterogeneity component (the uh random effect as in Eq. 2) that captured substantial portion of variance of modeled traits (Supplementary Table S4). The last additional component did not impact the overall distribution of p-values (λ = 1.03), but resulted in an improved power of admixture mapping because of reduced amount of the residual variance after modeling the heterogeneity. When comparing two models with and without the heterogeneity component, the average test statistic at associated ancestry segments (P < 3.06 × 10−5) increased from 11.17 to 15.01, which is equivalent to boosting the effective sample size by 34% [37].

Fig. 2: Robustness of mixed-model admixture mapping assessed by quantile–quantile (QQ) plots.
figure 2

The linear mixed model is examined through different combinations of random genetic and heterogeneity effects denoted in equations as um, ui, and uh, while labeled on the plots as ARM, EARM, and Het. (see Eqs. 12 and Supplementary Table S1). Admixture mapping is conducted for FEV1 % predicted phenotype and current smoker status is used in the evaluation of ancestry–smoking interactions. a When testing marginal effects of ancestry, the distribution of the test statistic is not inflated only when the genetic random effect (ARM) is presented in the model (λ = 1.04 or 1.02). b When testing ancestry–smoking interaction effects, all three random effects (ARM, EARM, and Het.) are necessary to mitigate the inflation (λ = 1.06).

We then expanded the LMM to address the model components related to testing local ancestry–smoking interactions (Eq. 2 and Supplementary Table S1). Following the exploration of the marginal model, we examined the role of four random effects by conducting admixture mapping for a single trait, the FEV1% predicted phenotype, and the current smoker exposure in different model configurations (Fig. 2b). Overall, the association test statistic was substantially inflated (λ = 1.61) for the initial model with only the uc random effect and without any of the other three components um, ui, and uh in Eq. 2. Including the components separately in the model, either the heterogeneity component (uh), one genetic component with ARM (um), or two genetic components (um and ui), was not sufficient to fix the inflation (λ = 1.16, 1.6, and 1.39, respectively). The association test statistic was well-behaved only when all three components were added (λ = 1.06). Hence, our final model for admixture mapping of local ancestry–smoking (gene–environment) interaction (Eq. 2) has three additional random effects um, ui, and uh, which capture structures due to shared local ancestry, local ancestry–smoking interaction, and outcome heterogeneity across smoking groups.

Local ancestry–smoking interactions detected by admixture mapping

We conducted admixture mapping of local ancestry–environment interaction for all five spirometric traits considering the two binary exposures independently (Supplementary Figs. S5 and S6). Single-trait statistics showed limited inflation for all analyses (λ = 0.93–1.09), except for FEV1/FVC and current smoker exposure (λ = 1.12; Supplementary Figs. S7 and S8). For each exposure, single trait results were combined to form a multi-trait test (Fig. 3). The two multi-trait analyses show a well-controlled type I error rate (λ = 0.98, 1.02; Supplementary Fig. S9). Following the eigenMT approach [20], we estimated the number of effective tests to be 1635 (Supplementary Material) and thus reduced the multiple-testing burden from the Bonferroni threshold (0.05/30,043 = 1.51 × 10−6) to the effective Bonferroni threshold (0.05/1635 = 3.06 × 10−5). We also considered the suggestive Bonferroni threshold (0.5/1635 = 3.06 × 10−4) to select additional candidate loci of interest.

Fig. 3: Admixture mapping identifies two genome-wide significant and five suggestive loci of local ancestry–smoking interactions.
figure 3

Manhattan plots show results of two admixture mappings of ancestry–smoking interactions, where smoking is one of two binary variables: a current smokers vs. non-smokers and b current moderate smokers vs. current heavy smokers. The multivariate test joins the single-trait test statistics from five traits under the composite null hypothesis of no association and provides the multi-trait p-values. Horizontal lines depict the effective Bonferroni threshold (0.05/1635 = 3.06 × 10−5) and the effective suggestive Bonferroni threshold (0.5/1635 = 3.06 × 10−4), where 1635 is the effective number of tests estimated by the eigenMT method [20].

We identified two genome-wide significant and five suggestive interaction association signals (Table 2). The first genome-wide significant locus in the region chromosome 11p15.2-3 spanned 11 ancestry segments of average length 37 Kb and 17 SNPs per segment. The top ancestry segment Chr11: 12,341,061–12,373,680 of length 32,619 kb and 21 SNPs had a multi-trait positive interaction effect with current smoker exposure with multi-trait Z = 5.34 and P = 2.79 × 10−5. The signal was driven by all single-trait associations, where all p-values passed the suggestive Bonferroni threshold. The top single-trait association for FEV1 % predicted showed Z = 4.53 and P = 5.86 × 10−6. That top single-trait ancestry segment was 47 kb away from the top multi-trait one, but the two segments were highly correlated after adjusting for the global ancestry (the Pearson’s correlation coefficient, 0.98). The second genome-wide significant locus in chromosome 2q37.3 included 11 ancestry segments of length 32 kb and 9 SNPs on average. The strongest multi-trait and top single-trait (FEV1) association signals were both localized in the same ancestry segment Chr2: 238,430,224–238,486,767 of length 56,543 kb and 21 SNPs. The positive interaction effects with the current heavy smoker exposure showed Z = 5.34 and P = 2.90 × 10−6, and Z = 4.53 and P = 2.50 × 10−6 for multi-trait and top single-trait associations, respectively.

Table 2 Top local ancestry segments-smoking interactions.

The other five suggestive loci (Table 2) were mostly detected in the interaction admixture mapping with current heavy smoker exposure (13q12.3-13.1, 11q21, 7p15.2-3, and 8q21.13), except one locus 1q44 from the mapping with current smoking exposure. In contrast to the genome-wide significant loci, multi-trait association signals were much stronger than top single-trait signals for all suggestive loci (the difference in p-values was several orders of magnitude for some tests).

Genotype-smoking interactions reveal differentiated genetic variants

For each region showing at least suggestive significance in the multi-trait admixture mapping analysis, we assessed potential interactions of single SNPs available in the region around the top admixture signal. We conducted association analysis for a total of 888 SNPs available across the seven regions lying within ancestry segments (the average number of SNPs per region was 126.9 and the average number of SNPs per segment was 13.7). Here we focused on the single trait showing the largest association signal in admixture mapping. Overall, none of these SNPs passed a stringent Bonferroni correction threshold accounting for all SNPs tested in each region. However, the top SNPs especially in the genome-wide significant loci helped to localize the association signal (Table 3). In the first genome-wide significant region 11p15.2-3 (Fig. 4), the top SNP rs933920 (hg19 chr11:g.12481110C> T) (P = 0.0036) is an intronic variant in the PARVA gene (MIM 608120). In the second genome-wide significant region 2q37.3 (Fig. 4), the first top SNP rs7569427 (hg19 chr2:g.238413338A> G) (P = 0.02) was an intronic variant in the MLPH gene (MIM 606526) and the second top SNP rs2280289 (hg19 chr2:g.238483729A> G) (P = 0.036) was a missense variant in the RAB17 gene (MIM 602206).

Table 3 Top SNP-smoking interactions in regions identified by admixture mapping.
Fig. 4: Detailed regional association plots for three selected loci 11p15.2-3, 2q37.3, and 11q21.
figure 4

From top to bottom, each panel shows (i) the regional association plot, (ii) linkage disequilibrium pattern, (iii) annotated protein-coding genes, (iv) trans-continental differences in allele frequency. (i) The Y axis represents −log10(P) of SNP association tests using a single phenotype most strongly associated with the ancestry segment in the locus. The shaded area represents the strength of local ancestry association in the multi-trait admixture mapping with stronger associations painted by darker shades of grey. The blue diamonds represent the Bayes factors for assessing the evidence that a SNP is causal estimated by FINEMAP; the bigger the diamond the higher the Bayes factor. (ii) The r2-based LD heatmap is built using genotypes of the COPDGene study and the gradient of red is proportional to the r2. (iii) Protein-coding genes are queried from grch37.ensembl.org (iv) Allele frequencies are estimated in the 1000 Genomes Project for European and African populations. For each SNP in an ancestry segment, the cyan indicates the frequency of the European minor allele variant, while a vertical segment connects the European and African frequencies of the allele. The segments are colored according to the direction of the difference: red when the African frequency is higher than the European frequency, or green for lower African frequency.

To assess the level of allelic heterogeneity of SNPs in these identified regions, we computed the allele frequency differences (defined as ΔDAF in Materials and Methods). The SNPs exhibited high levels of heterogeneity for all regions (ΔDAF in 0.62 and 0.78). Overall, three out of our seven loci (1q44, 2q37.3, and 8q21.13) matched the 0.7 threshold proposed by Colonna et al. [44], defining the 1% of the genome displaying the most extreme differentiation across populations. These differences in minimum allele frequency between European and African ancestries can also be visually assessed on Fig. 4 and Supplementary Figs. S1013. Finally, we also evaluated the hypothesis of multiple causal SNPS per region using the FINEMAP software [23], but we were not able to find any strong evidence for multiple causal SNPs (Supplementary Table S5).

Replication of association signals in European GWASs

We performed replication analyses of association signals detected at individual SNP level and region level, for the seven loci reported in Tables 2 and 3. We considered two large studies of pulmonary phenotypes and COPD conducted in individuals of European ancestry: the CHARGE consortium (N = 50,047), which had genome-wide summary results for SNP-by-smoking interaction [47], and the most recent and largest meta-analysis of UK Biobank and SpiroMeta cohorts (COPD cases = 35,735, COPD controls = 222,076; N = 400,102 for pulmonary function phenotypes), which provides an up-to-date list of variants with genome-wide significant marginal genetic effect [48, 49].

Matching our nine top SNPs with the CHARGE consortium results [47], we found that three were missing, being rare or monomorphic in European population. Interaction effects of three out of the six remaining SNPs were replicated at the nominal significance level (P < 0.05) (Supplementary Table S6) in SNP–smoking interaction screening of FEV1/FVC, where packs-years was used as a proxy for heavy smoking.

We next assessed the presence of marginal genetic effect at our seven loci using the aforementioned meta-analysis of pulmonary phenotypes [48] and COPD [49]. Although our marginal signals of top local ancestry segments and SNPs were weak (Supplementary Tables S7 and S8), we observed that nine genome-wide significant SNPs from GWASs [48, 49] were located <1 Mb away from the three loci 2q37.3, 11p15.2-3, and 7p15.2-3 (Supplementary Table S9). In particular, two SNPs rs80145403 (hg19 chr11:g.12493292G> A) and rs7114698 (hg19 chr11:g.12707876C> T) were within the same PARVA and TEAD1 genes in Chromosome 11 as in our SNP–smoking interaction analysis (Table 3). Notably, five SNPs come from two loci that each has three distinct signals estimated by the conditional analysis [49]. Such a scenario with multiple SNPs driving either marginal or interaction association is beneficial for our ancestry-based approach to detect gene–environment interactions, and thus may explain the signal overlap at the region level for two 11p15.2-3 and 2q37.3 loci.

Discussion

Broadening the diversity of ethnicities in genetic analysis can provide important information for disease pathogenesis. Leveraging local ancestry through admixture mapping could improve power to discover marginal genetic and gene–environment effects, although the technical and statistical challenges still remain. To address these challenges, we introduced a multi-component LMM and empirically demonstrated its robustness in admixture mapping on real data in 3300 African American participants in the COPDGene study. We detected two genome-wide significant and five suggestive loci showing smoking-dependent effects of local ancestry on spirometric lung function phenotypes.

Although the functional effects of variants in the identified genomic regions is unknown, these regions harbor genes previously known for traits related to smoking. The top SNP rs933920 in the first genome-wide significant locus 11p15.2-3 (P = 2.79 × 10−5) is located within the PARVA gene, which produces a focal adhesion protein [50]. Two previous studies reported that this gene was differentially methylated in small airway epithelium [51] and buccal mucosa [52] when stratified by current smoking status. The second top SNP rs2280289 for the second genome-wide significant region on 2q37.3 (P = 2.90 × 10−5) is a missense variant in RAB17, which was previously associated with a smoking cessation genotype success score [53, 54]. As for the most relevant result found among the five suggestive loci, the top SNP rs11020968 (hg19 chr11:g.94602414C> T) in the locus 11q21 (P = 5.0 × 10−5) is a missense variant for angiomotin-like protein 1 (AMOTL1; MIM 614657), a tight junction protein hypothesized to play a role in COPD through endothelial tight junction permeability and whose expression is affected by cigarette smoking [55]. Additional genes around other loci may warrant further investigation.

We further attempted to evaluate whether SNPs within these identified regions show multiple-SNP effects and exhibit high allelic differentiation, as our previous work on gene–gene interaction admixture mapping suggested this kind of genetic architecture [20]. Overall, allele frequency heterogeneity between European and African ancestries was very strong and persistent in the identified regions. Although our fine-mapping analysis did not show evidence for multiple causal variants, SNP–smoking interaction analysis is known to have limited power [56]; thus, we cannot rule out the possibility of multiple causal variants. Indeed, the conditioning on the primary signal of local ancestry is likely to decrease the statistical power to detect interactions at the SNP level even in larger samples [56]. Alternative methods, such as jointly modeling ancestry and genotype association signals, might help to overcome this limitation [57].

Our methodological contributions to admixture mapping are multiple. First, we extended the concept of genetic relationship matrix originally proposed to control population structure in GWAS [34]: the ARM was similarly computed on local ancestry data and further used in association tests. Second, we adopted the population stratification approach recently designed specifically for GWAS of gene–environment interactions [33]: two matrices, the standard ARM, but also a second environmental ARM or EARM, were essential to control for spurious association results when testing local ancestry–environment interactions. Finally, we modeled the outcome heterogeneity among groups stratified by environmental exposure. Modeling this heterogeneity increased the power because of reduced residual phenotypic variance (up to 65%) and substantially decreased the inflation of interaction test statistic. Admixture mapping has a lower computational complexity compared with GWAS and our complete single-trait analysis took up to 4 min on the standard desktop computer.

Our study also has limitations. COPDGene is one of the largest studies of African American smokers, with a high proportion of subjects with COPD, which makes suitable replication cohorts challenging. Nevertheless, we were able to reproduce some of the top SNP–smoking interactions in the CHARGE consortium [47] at nominal significance level. More importantly, our study identified three loci 2q37.3, 11p15.2-3, and 7p15.2-3 using a dataset of only 3300 African American individuals, whereas the same loci only passed the genome-wide significance threshold of standard univariate association in an independent replication cohort including up to 400,000 individuals [48, 49]. We also note that our screenings for local ancestry–smoking interactions were limited to binary smoking exposures. We explored a three-level exposure model; however, we found that the parameters increase made the estimation computationally unstable [32]. We believe that such complex gene–environment LMMs would require datasets with larger sample sizes [58]. Further, the method for admixture mapping can also be optimized. When conducting association analysis, excluding the local ancestry segment under testing from the ARM construction will be able improve power [13], but is computationally more burdensome. We attempted a more efficient out-of-chromosome strategy commonly applied in GWAS [35], but we observed fairly inflated test statistics (data not shown).

In conclusion, our study reports a powerful approach for gene–environment interaction association studies, leveraging the unique genetic architecture of complex traits measured in recently admixed populations. The proposed statistical model has shown to be robust to population structure and outcome variance heterogeneity. In our application to the COPDGene study, we have found two genome-wide significant local ancestry–smoking interactions of lung function phenotypes that would have been missed in standard single SNP interaction analyses. Overall, our findings provide additional evidence of the importance of ethnic diversity in genetic clinical studies.