Best practices for multi-ancestry, meta-analytic transcriptome-wide association studies: Lessons from the Global Biobank Meta-analysis Initiative

Summary The Global Biobank Meta-analysis Initiative (GBMI), through its diversity, provides a valuable opportunity to study population-wide and ancestry-specific genetic associations. However, with multiple ascertainment strategies and multi-ancestry study populations across biobanks, GBMI presents unique challenges in implementing statistical genetics methods. Transcriptome-wide association studies (TWASs) boost detection power for and provide biological context to genetic associations by integrating genetic variant-to-trait associations from genome-wide association studies (GWASs) with predictive models of gene expression. TWASs present unique challenges beyond GWASs, especially in a multi-biobank, meta-analytic setting. Here, we present the GBMI TWAS pipeline, outlining practical considerations for ancestry and tissue specificity, meta-analytic strategies, and open challenges at every step of the framework. We advise conducting ancestry-stratified TWASs using ancestry-specific expression models and meta-analyzing results using inverse-variance weighting, showing the least test statistic inflation. Our work provides a foundation for adding transcriptomic context to biobank-linked GWASs, allowing for ancestry-aware discovery to accelerate genomic medicine.


In brief
Bhattacharya and Hirbo et al. describe a framework to identify gene-level genetic associations with complex traits through meta-analysis of multi-ancestry datasets from biobanks using transcriptome-wide association studies (TWASs). They demonstrate pitfalls in TWASs and review open methodological challenges, including lack of portability of expression prediction and tissue specificity of results.

INTRODUCTION
Population-based or clinical case-based biobanks are key to precision medicine efforts and provide opportunities for genomic research, 1 offering context to deploy genome-wide association studies (GWASs) at scale. Multi-biobank collaborations, like the Global Biobank Meta-analysis Initiative (GBMI), facilitate well-powered, multi-ancestry genetic research and accelerate the understanding of biological mechanisms underlying diseases by in silico longitudinal genetic studies and examination of pleiotropy. 2,3 A key challenge in GWASs is interpreting trait-associated loci with a biological mechanism, 4,5 using methods like colocalization, [6][7][8][9] Mendelian randomization (MR), [10][11][12] and transcriptomewide association studies (TWASs). TWASs integrate GWASs with expression quantitative trait loci (eQTL) to prioritize genetrait associations (GTAs) using mediation analysis 13,14 or MR. 15 TWASs involve three steps. First, genetic predictive models of gene expression are trained in the eQTL dataset. Then, genetically regulated expression (GReX) is imputed in the GWAS cohort with individual-level genotypes. Lastly, statistical associations between GReX and trait are estimated. [13][14][15][16] TWASs are viable with GWAS summary statistics by estimating the test statistic of the TWAS association using a proper linkage disequilibrium (LD) reference panel. 14 Traditional TWAS methods predict expression using SNPs within 1 Mb of the gene body. [13][14][15][16] Recently, methods including strong distal-eQTL signals have shown improved prediction and power to detect GTAs. 17,18 Nonetheless, practical and statistical considerations to accurately prioritize GTAs through TWASs still require methodological improvement.
Along with GWASs, TWASs introduce new challenges by incorporating gene expression 19 (Figure 1). On the genetic level, as in GWASs, disentangling signals from complex LD structure, relatedness, and ancestry requires careful modeling considerations. 20,21 Selection of LD references is specifically important in multi-ancestry settings, like GBMI, as LD structure across ancestry groups differs greatly. 22 Mismatched LD may lead to gene expression models with reduced predictive power, reduced power to detect GTAs, and increased false positives. [23][24][25] In addition, phenotype acquisition and aggregation are challenging across multiple biobanks with different healthcare, electronic health record, and case-control definitions. However, a challenge specific to TWASs is the integration of gene expression with GWAS signal. Not only is it unclear how to choose an optimal set of genes and tissues best explaining the SNP-trait association, the role of context-specific expression is still being evaluated. For example, dynamic differences in bulk tissue expression due to varied cell types and cell states can pro-vide granularity to GTAs. The impact of these challenges in a meta-analytic framework has not been explored.
Here, we outline a TWAS framework for analyzing multiancestry, meta-analytic GWASs across multiple biobanks. We explore practical considerations for all three steps of the TWAS framework (Figures 1 and S1): ancestry specificity of expression models and LD reference panels, meta-analytic techniques for GTA detection, and follow-up analyses for biological context. Given current TWAS frameworks and for analyses using GWAS summary statistics, we recommend training ancestry-specific genetic predictive models of gene expression and conducting TWAS disease mapping by meta-analyzing effect sizes using inversevariance weighted meta-analysis. Our framework can be applied broadly to study population-wide and ancestry-specific genetic associations with underlying transcriptomic mechanisms.

RESULTS
The GBMI ancestry-wide, meta-analytic TWAS pipeline We outline the TWAS framework employed in analyses using GWAS summary statistics from the GBMI ( Figure S1; details in STAR Methods). 3 First, using joint-tissue imputation (JTI) 26 and multi-omic strategies for TWASs (MOSTWAS), 17 we train ancestry-specific predictive models of gene expression. Next, we identify GTAs using these models and GWAS summary statistics by either multiple-instrumental-variable causal inference or the weighted burden test with an ancestry-matched LD Each level of data in a TWAS introduces a set of challenges: (1) genetics data include confounding from genetic ancestry, population structure and relatedness, and complex linkage disequilibrium patterns, (2) gene expression data introduces context-specific factors, such as tissue-, cell-type-, or cell-state-specific expression, and (3) phenotypic data involve challenges in acquiring and aggregating phenotypes, properly defining controls for phenotypes, and ascertainment and selection bias from non-random sampling.
2 Cell Genomics 2, 100180, October 12, 2022 Short Article ll OPEN ACCESS reference panel for JTI and MOSTWAS models, respectively. This association testing is done for each ancestry group and each biobank. Finally, we perform meta-analysis on the effect sizes across each biobank and each ancestry group using inverse-variance weighting. Lastly, we contextualize GTAs with multiple follow-up analyses.
Expression models are not portable across ancestry groups GBMI's diversity enables uniquely well-powered studies to detect trait associations in non-European populations. However, optimal TWASs require well-powered training datasets of genetic and tissue-specific gene expression data, which are still lacking for non-European populations. Power to detect GTAs in TWASs is dependent on expression heritability and the predictive power of the expression model. 27 Hence, accurate expression prediction across diverse populations is necessary to ensure that TWAS associations are not restricted to European populations. For the first GBMI TWAS, we restrict analysis to European ancestry populations because of small numbers of non-European ancestry individuals in eQTL reference panels. 28 As sample sizes for eQTL datasets in non-European populations increase, the TWAS pipeline will include expression models for these populations (STAR Methods). We illustrate some challenges in building these expression models across ancestry groups.
In five tissues in GTEx with >70 samples from both European (EUR) and African (AFR) ancestry, we trained EUR-and AFRspecific models using elastic net regularized regression and imputed expression into the aligned (i.e., training and imputation samples have similar ancestries) and misaligned (i.e., training and imputation samples have different ancestries) groups. 29,30 For context, we also built ancestry-unaware models, where EUR and AFR samples were pooled together. Predictive performance was calculated with adjusted 5-fold cross-validation R 2 to account for sample size (STAR Methods). For this analysis, we do not use the JTI or MOSTWAS models, as they either require multi-tissue samples or larger sample sizes than what is available for AFR ancestry individuals, respectively. Both methods borrow from elastic net regression; results from this analysis are applicable to both methods.
Across these tissues, models trained in EUR samples performed, on average, 3-4 times worse (0.02-0.04 difference in median R 2 ) in AFR samples compared with models trained in AFR samples (Figure 2A; Tables S1-S3). More than 80% of gene models have stronger performance if trained in AFR samples. Distributions of ancestry-aligned and -misaligned adjusted R 2 and percent differences ( Figure S2) emphasize that a 0.02-0.04 increase in prediction R 2 is stark. Similar trends hold for ancestry-specific models imputed into down-sampled EUR imputation samples (Figures S3 and S4; Tables S1-S3), consistent with previous simulation and real-world studies 23,25 ; here, we considered a randomly selected EUR imputation sample with equal sample size to that of the AFR sample. In fact, we observed that ancestry-specific models imputed into a sample with aligned ancestry showed larger predictive R 2 than ancestry-unaware (individuals of EUR and AFR ancestry in the training sample) models imputed into the same sample (Fig-ure 2B; Table S4), despite increased sample sizes. This observation also holds if we further increase the training sample size by including individuals of other ancestries (Asian, American Indian, and so-called Unknown ancestries) ( Figure S5). Not only is this observation in line with recent results from Patel et al. that show differences in causal effect sizes for gene expression across ancestry groups, 31 it also emphasizes the need for ancestry matching in expression prediction and greater recruitment of non-European ancestry patients in eQTL studies.
One reason ancestry-unaware models may perform poorly in AFR samples is because of differences in minor allele frequency (MAF) of predictive SNPs between EUR and AFR ancestry populations. Importantly, this discrepancy is not generally specific to any one ancestry. Rather, ancestry imbalance in training or reference datasets may lead to poor portability of genetic models because of MAF differences. To incorporate common SNPs in both AFR and EUR ancestry populations, we trained ancestryunaware and ancestry-specific models using SNPs with MAF exceeding various thresholds in both AFR and EUR samples. Excluding SNPs with MAF <0.01 improved predictive performance of ancestry-unaware models across all tissues ( Figure S6; Table S5). However, the gap in predictive performance between ancestry-specific and ancestry-unaware models did not decrease when the MAF cutoff was increased ( Figure 2B; Table S4). This observation may reflect that dropping ancestryspecific rarer SNPs ignores variants with large ancestry-specific effects on gene expression. Additionally, excluding rare ancestry-specific SNPs does not address the differences in LD across the EUR and AFR samples that lead to different regularization paths and, hence, SNP-gene weights. Addressing the portability of expression models remains an open study direction; methodology that borrows information from functional annotations or across different cell-type-or cell-state-specific contexts may bridge this gap in predictive performance, similar to recent developments in polygenic scores (PGSs) 32 or polygenic transcriptomic risk scores, an analog to PGSs that is constructed using multi-SNP predictors of gene expression. 33,34 Meta-analytic strategies must be ancestry-aware Another critical consideration for GBMI involves meta-analysis with GWAS summary statistics. TWASs estimate a GTA by weighting the standardized SNP-trait effect sizes from GWAS summary statistics by SNP-gene weights from the expression models. To account for the correlation between SNPs, an external LD reference panel, like the 1000Genomes Project, 35 is used to estimate the standard error of the TWAS association. Thus, the differences in the reference and in-sample LD in the GWAS cohorts influence the differences in the summary statistics-based TWAS association and TWAS association from direct imputation into individual-level genotypes. Ideally, in-sample LD will give the best estimate of the TWAS standard error, but several biobanks cannot provide this information under specific genetic data sharing and privacy policies. Even departures in LD across subgroups of EUR ancestry populations may influence standard error estimates. In addition, as the estimates of SNPgene weights are influenced by the LD in the eQTL panel, differences in LD between the eQTL and GWAS panel will also affect the TWAS effect size estimation.
Cell Genomics 2, 100180, October 12, 2022 3 Short Article ll OPEN ACCESS As LD patterns differ across ancestry groups, 22 pooling ancestry groups in TWASs may lead to reduced power. We conducted TWASs for asthma risk using ancestry-unaware and EUR-and AFR-specific models of whole blood expression (4,782 genes; see STAR Methods). Ancestry-specific TWAS Z scores across EUR and AFR ancestry groups were not strongly correlated (r = 0:11Þ, potentially because of differences in sample size and eQTL and GWAS architecture ( Figures 3A, S7, and S8). 36,37 For genes with p < 2:5 3 10 À 6 in either EUR or AFR ancestry groups, both SNP-gene effects and corresponding standardized effect sizes for these SNPs from ancestry-specific meta-analyzed GWASs show very low correlation ( Figure S9). These results reinforce the lessons from the low cross-ancestry group correlations of TWAS Z scores and suggest that model training and association testing should be conducted within ancestry groups.
In fact, we detected only two genes that had a significant association with asthma in both EUR and AFR ancestry groups with p < 2:5 3 10 À 6 . One of these genes, DFFA, has been implicated with asthma risk through GWASs and colocalization in EURs. 38 However, the TWAS associations across EUR and AFR ancestry groups went in opposite directions using blood tissue. In the other four tissues explored, DFFA TWAS associations did not reach transcriptome-wide significance, but effect directions were generally concordant ( Figure S10). In blood, lead local-eQTLs (within 1 Mb) of DFFA are in opposite directions, though only nominally significant at p < 0.05 ( Figure S11). Although they are within 60 kb, the lead eQTLs for DFFA across AFR (rs263526) and EUR (rs903916) ancestry groups are not in LD (R 2 = 3 3 10 À4 in AFR, 0.0072 in EUR). The GWAS effect sizes of SNPs local to DFFA do not show large deviations in effect direction and are only nominally significant as well ( Figure S11). These differences in TWAS associations across ancestry motivate careful consideration of meta-analytic strategy to avoid biasing cross-ancestry associations toward cohorts with larger sample sizes, which still tend to be predominantly of EUR ancestry.
We investigated five meta-analytic strategies: meta-analyzing across ancestry-specific, per-biobank GWAS summary statistics  Figure 2. Comparison of predictive performance of expression prediction models across ancestry (A) Adjusted R 2 difference (y axis) when predicting expression in the AFR imputation sample between models trained in EUR and AFR training samples across tissue (x axis). Proportion of models with improved R 2 using ancestry-aligned models versus ancestry-mismatched models is labeled. (B) Adjusted R 2 difference between ancestry-specific and ancestry-unaware models imputing into EUR (left) and AFR (right) samples.
4 Cell Genomics 2, 100180, October 12, 2022 Short Article ll OPEN ACCESS using (1) inverse-variance weighting (IVW) and (2) effective sample size weighting (SSW), meta-analyzing across ancestry-specific meta-analyzed GWAS summary statistics using (3) IVW and (4) SSW, and (5) performing a TWAS using ancestry-unaware models and meta-analyzed GWAS summary statistics across EUR and AFR ancestry groups (STAR Methods). QQ-plots in Figure 3B show earlier departure of Z scores from the QQ-line for SSW meta-analyzed Z scores and the ancestry-unaware strategy, suggesting type I error inflation. This observation is supported with estimates of test statistic bias and inflation, with the largest esti-mated bias and inflation for these SSW and ancestry-unaware methods. 39 IVW strategies show similar levels of inflation, with IVW meta-analysis across ancestry-specific meta-analyzed GWAS summary statistics showing minimal bias ( Figure S12). More naive SSW meta-analysis and ancestry-unaware methods bias toward the larger EUR cohorts, whereas Z scores from the IVW methods showed positive correlations with Z scores from AFR cohorts ( Figure S8). However, it is unclear whether ancestry-specific IVW metaanalysis to the per-biobank level is necessary. As shown in B C A Figure 3. Comparison of meta-analytic strategies for multi-biobank, multi-ancestry TWASs (A) Per-ancestry meta-analyzed TWAS scores across EUR (x axis) versus AFR ancestry (y axis). The dotted lines indicate p < 2.5310 À6 with a 45-degree line for reference. Points are colored by which ancestry population the TWAS association meets p < 2.5310 À6 . (B) QQ-plot of TWAS Z scores , colored by meta-analytic strategies. Per ancestry refers to TWAS meta-analysis across meta-analyzed ancestry-specific GWAS summary statistics. Per bank/per ancestry refers to TWAS meta-analysis using all biobank-and ancestry-specific GWAS summary statistics. (C) Effect sizes and Bonferroni-corrected confidence intervals (CIs) for TWAS associations across 17 individual biobanks (EUR in teal, AFR in red) and two IVW meta-analysis strategies (in yellow) for five representative genes. The Higgins-Thompson I 2 statistic is provided. Figure S13, Z scores from these two IVW methods are moderately positively correlated (r = 0:51 across 4,152 tests), with this correlation increasing when we consider genes with nominally significant Z scores for both strategies (r = 0:70, 564 tests). We observed that top associations across these IVW meta-analyses often had high degrees of heterogeneity in effect size across biobanks, measured by the Higgins-Thompson I 2 statistic 40 ( Figures 3C and S14). When using the weighted burden test, the standard error and confidence interval widths are functions of only SNP-gene weights and the reference LD matrix; GWAS sample size does not have an effect on the standard error. One gene, A1BG, that showed directionally concordant associations across both IVW strategies had large heterogeneity across the participating cohorts (I 2 = 0.77). In fact, the cross-biobank heterogeneity is often larger than the crossancestry heterogeneity for TWAS associations of A1BG. ZNF665, another gene with concordant associations across IVW strategies, showed a low heterogeneity in the per-biobank effect sizes (I 2 = 0.26). However, genes with discordant associations across IVW strategies showed varied patterns. Two illustrative examples are MLNR and MYOZ3, both with large degrees of test statistic heterogeneity (I 2 = 0.91 and 0.82, respectively). Across the two IVW strategies, effect sizes are in opposite directions, possibly due to large standard error differences across the ancestry-specific per-biobank associations. A thorough investigation of the power and false discovery rates of these meta-analysis strategies through simulations is necessary. Methods that incorporate the per-biobank uncertainty must be explored to increase power and properly leverage the large sample sizes of GBMI. [41][42][43] A recent method called METRO is a promising route for multi-ancestry TWASs because it combines multiple ancestry-specific expression models in a joint likelihood-based inference framework and accounts for uncertainty in the prediction models across the various ancestry groups. 44 In addition to multi-ancestry populations, analyzing genetic data from individuals of admixed ancestry is also an open area of study. In this analysis, we have used the 1000Genomes AFR LD reference panel as an estimate of the LD for AFR ancestry samples from each biobank. However, most of GBMI's AFR ancestry populations are of admixed ancestry (e.g., African Americans or African British). A single LD reference panel of AFR ancestry may not reflect the genetic diversity in these admixed populations of AFR and EUR ancestries. 45 In admixed populations, using local ancestry estimates aids in better characterization of heritability of complex traits and more accurate mapping of genetic associations, especially eQTLs. 46 Incorporating ancestry-specific allelic effects on gene expression, estimated from phased genotypes and local ancestry inference, into TWASs may lead to increased power and should be explored. However, errors in ancestry inference and heterogeneity in ancestry-specific effect may pose challenges for these methodological extensions.
Follow-up tests provide biological and clinical context to TWAS GTAs TWAS GTAs identified using GWAS summary statistics are subject to several factors that may lead to false positives. We implement several follow-up tests to provide context to TWAS-identi-fied GTAs. First, a TWAS GTA could attain significance due only to strong SNP-trait associations from the underlying GWAS. To quantify the significance of the GTA conditional on the SNP-trait effects at the locus, we perform a permutation test by permuting the SNP-gene weights from the expression model to generate a null distribution (STAR Methods). Comparing the original TWAS Z score to this null distribution assesses how much signal is added by the expression given the specific GWAS architecture of the locus. This permutation test is highly conservative and prioritizes only associations already significant in the standard TWAS GTA detection. 14 Next, gene expression models for genes in adjacent genomic windows may be built from overlapping SNPs or SNPs in strong LD. When TWASs detect GTAs in overlapping genomic regions, we apply probabilistic fine-mapping using FOCUS 47 to estimate a 90% credible set of genes to explain the observed association signal in a given tissue (STAR Methods). However, the current iteration of FOCUS has limitations. Priors for the correlation matrix between GReX of overlapping genes are dependent on SNP LD reference panels. Thus, fine-mapping in trans-ancestry settings is difficult, though recent additions to the FOCUS framework, called multi-ancestry FOCUS (MA-FOCUS), account for differences in genetic architecture across the study sample. 48 Another challenge for gene-level fine-mapping in multi-tissue TWASs is distinguishing between overlapping signals across tissues. Primarily because of cross-cell-type variation in expression levels and eQTL architecture, TWASs may prioritize genes in multiple tissues that are overrepresented by the same underlying causal cell types. 19 This multi-tissue gene prioritization extends to fine-mapping overlapping TWAS signals across tissue, as priors for FOCUS are not tissue dependent; extracting posterior signal that is biologically consistent and meaningful without allowing the prior to dominate is challenging.
The GBMI TWAS pipeline incorporates gene expression models using MOSTWAS, which prioritizes distal-eQTLs by testing their mediation effect through local molecular features (STAR Methods). For genes with models trained with MOSTWAS and associated with the trait at transcriptomewide significance, we test the additional association signal from the distal-SNPs using an added-last test, analogous to a group-added-last test in linear regression. 17 This test prioritizes sets of genomic or epigenomic features that mediate the predicted distal-eQTLs for subsequent study of upstream, tissue-specific regulation of GTAs. In one application of MOSTWAS, one prioritized functional hypothesis was experimentally validated in vitro. 49 As distal-eQTLs are more likely to be tissue or cell type specific, 50 the association signal from these distal-eQTLs could also be leveraged in cross-tissue fine-mapping strategies.
Lastly, TWASs suffer from severely reduced power and inflated false positives in the presence of SNP horizontal pleiotropy. 51 We encourage estimating the degree of and accounting for SNP pleiotropy using LDA-MR-Egger 52 or PMR-Egger, 53 especially in settings with individual-level GWAS genotypes. Applications for these methods using GWAS summary statistics reveal some inflation of standard errors, 54 suggesting the need for further evaluation and development of summary statisticsbased methods.
Biobanks enable GReX-PheWAS for biological context Biobanks aggregated in GBMI provide a rich catalog of phenotypes for analysis, with phenotype codes (phecodes) aggregated from ICD codes classified into clinically relevant categories. 55 This phenotype catalog enables phenome-wide association studies (PheWASs) as a complement to GWASs by both replicating GWAS associations and providing a larger set of trait associations with GWAS variants. To follow up on novel TWASprioritized genes, the PheWAS framework can be expanded to the GReX level in a similar analysis: GReX-level phenome-wide association study (GReX-PheWAS), similar to the PredixVU database. [56][57][58][59] Not only do these analyses replicate and detect new TWAS associations, they can also point to groups of phenotypes that show enrichments for trait associations for the gene of interest.
We briefly illustrate an example of a GReX-PheWAS using three genes (Figures 4, S15, and S16; Table S6): TAF7, a novel gene in our TWAS, and ILRAP18 and TMEM258, two genes previously implicated through GWASs. [60][61][62][63][64] These genes were prioritized from European-specific TWASs for asthma risk from the flagship GBMI project using lung tissue expression (101,311 cases and 1,118,682 controls): TAF7 (MOSTWAS model), IL18RAP (JTI model), and TMEM258 (JTI model). European-specific TWAS meta-analysis for asthma detected a negative association with TAF7 cis-GReX, a gene that did not intersect a GWAS-significant locus. In TWAS follow-up tests, TAF7 passed permutation testing and was estimated in the 90% credible set at the genomic locus via FOCUS with posterior inclusion probability 1. As the clinically relevant associations for TAF7 lung GReX are not characterized, we employed a GReX-PheWAS in UKBB European ancestry GWAS summary statistics across 731 traits and diseases with sample sizes greater than 100,000, grouped into nine categories (Figure 4 and STAR Methods). We see enrichments for phenotypes of the hematopoietic and musculoskeletal groups ( Figure 4A) with hypothyroidism and chronic laryngitis as the top phenotype associations ( Figure 4B; Table S6). The curved nature of the Miami plots in Figures 4 and S14 is only because we plot Z scores in decreasing absolute value within phecode groups. These phenotypes include multiple inflammations of organs (e.g., laryngitis, osteitis, meningitis, etc.). We also detected several associations with related respiratory diseases and traits. Similarly, for the two previously implicated genes, we find enrichments for respiratory and hematopoietic GTAs for IL-RAP18 and across multiple categories for TMEM258, consistent with the categorized functions and associations of these genes (Figures S15 and S16; Table S6). GBMI's wide roster of phenotypes enables the GReX-PheWAS to add biological and clinical context to novel TWAS associations.
GReX-PheWAS, despite its utility, shares the challenges of the PheWAS. Phenotypes within and across groups may be correlated, leading to a series of dependent tests. Thus, simple adjustments of multiple testing burden may not be appropriate, and methods that account for correlation between phenotypes, like permutation tests, may be more applicable. 55,[65][66][67] In addition, covariate adjustments in expression models built for disease-specific analyses may not be generalizable for multiple phenotypes. Most population-based clinical biobanks lack comprehensive clinical and lifestyle information of the individuals because of gaps in electronic health records. Phenotype groupings may also be deceptive: as most biobanks follow ICD coding that groups traits and diseases by body systems, GReX-PheWAS enrichments for a given group may not reflect shared genetic pathways across body systems. In addition, case-control selection may not be optimal because of differences in exclusion criteria. 68 Despite these limitations in phenotype acquisition, recent methods focusing on identifying shared genetic architecture among multiple phenotypes in a phenomewide approach highlight the advantages of GReX-PheWASs. 68

DISCUSSION
Here, we provide a framework for TWASs in a multi-biobank setting across many ancestry groups. In general, for TWASs using GWAS summary statistics, we recommend a meta-analytic approach across both ancestry groups and individual cohorts. As multi-tissue transcriptomic reference panels and eQTL resources increase in sample size for populations not of European ancestry (e.g., African, Hispanic/Latin American, and East and South Asian ancestries that are represented broadly by GBMI), we will extend our TWAS analyses to interrogate underlying transcriptomic mechanisms underlying the complex traits studied with GBMI. We outline several methodological gaps to be addressed in the future: (1) training expression models that are portable across ancestry groups, (2) limiting false discovery in TWASs by properly modeling differences in LD across ancestry groups, (3) incorporating uncertainty within and heterogeneity across biobanks to boost TWAS meta-analytic power, and (4) contextualizing TWAS GTAs through follow-up testing, probabilistic fine-mapping across ancestry groups and expression contexts, and GReX-PheWASs.
Along with the discussed issues with current TWAS methodology, tissue-specific expression may not provide sufficient granularity needed to discover trait-relevant biological mechanisms. Recent methods that study the mediation of the SNP-trait relationship by cell-type heterogeneity show that cell types and cell states are influenced by genetics and predict complex traits, and modeling these directly may lead to improved power in detecting trait associations. 26,69-72 Single-cell eQTL datasets can be integrated with GWASs to identify context-specific expression pathways that are disease related. Incorporating singlecell expression data into a predictive model will require improved methodology that models cell identity as a spectrum. 73,74 Multiomic approaches incorporating functional data with TWASs may better model the flow of biological information in a biologically interpretable fashion. 17, 75,76 Despite the limitations of this suite of methods, TWASs continue to be a useful tool for interpreting GWAS associations and independently discovering genetic associations mediated by gene expression. More sophisticated integrative computational and experimental tools to complement improved TWASs and GWASs to understand the biology underlying health and disease need to be developed. Most importantly, reference eQTL data from individuals of non-European ancestry needs to be collected at parity with those of European ancestry individuals.

DECLARATION OF INTERESTS
The authors declare no competing interests.

INCLUSION AND DIVERSITY
We worked to ensure gender balance in the recruitment of human subjects. We worked to ensure ethnic or other types of diversity in the recruitment of human subjects. One or more of the authors of this paper self-identifies as an underrepresented ethnic minority in science. One or more of the authors of this paper self-identifies as a member of the LGBTQ+ community. One or more of the authors of this paper received support from a program designed to increase minority representation in science. While citing references scientifically relevant for this work, we also actively worked to promote gender balance in our reference list. The author list of this paper includes contributors from the R 2 R 0:01 with P < 0:05 are considered for TWAS. Ancestry-specific models are trained, excluding SNPs with MAF <0.01 and deviated from Hardy-Weinberg at P < 10 À 5 across all 838 GTEx samples. These filtering steps led to a total of 6,106,016 SNPs. We acknowledge that removing SNPs that deviate from Hardy-Weinberg may lead to removal of some causal eQTL variants, but often times these SNPs are prone to genotyping errors. In particular, we assessed a less conservative p-value cutoff for deviation from Hardy-Weinberg (p < 10 À6 ) as in previous TWAS analyses 13,14,23 ; this cutoff only included an additional 47,265 SNPs. A comprehensive analysis of the effects of estimated deviation from Hardy-Weinberg on eQTL estimation and TWAS predictive model training will be illuminating. The first iteration of the GBMI TWAS pipeline focuses on EUR-ancestry models due to larger sample sizes. However, as sample sizes for reference eQTL data for other ancestry groups increase, this pipeline will include gene expression models that are specific for these currently underrepresented ancestry groups. In addition, models from other data sources using other methods can be incorporated in subsequent steps.

Hypothesis tests for TWAS
To test for an association between tissue-specific GReX of a gene and a trait of interest, GWAS summary statistics are integrated with these expression models. For the EUR-specific TWAS, we use EUR-specific meta-analyzed GWAS summary statistics across all biobanks. JTI and MOSTWAS use two different approaches to test for a GTA. For MR-JTI, the posterior predictive distribution of GReX is estimated, and multiple-instrumental-variable causal inference is used to estimate the GTA, controlling for overall heterogeneity. 26 For MOSTWAS, a weighted burden test is constructed, as in FUSION. 14,17 Both of these methods require a LD reference panel; the GTEx LD matrix is used as a reference. Taken together, these methods provide effect sizes, standard errors, Z-scores (effect sizes standardized by standard error), and p-values for GTAs. A GTA is transcriptome-wide significant using a Bonferroni correction across all tests run. The number of tests run is equal to the sum of the number of significant gene models across all tissues. Follow-up tests and analyses are then run to provide context to the TWAS GTAs. A permutation test is run by shuffling the SNPgene weights 1,000 times and determining the TWAS Z score at each permutation, generating a null distribution. The original TWAS Z-score is compared to this null distribution to generate a permutation p-value; Benjamini-Hochberg FDR correction is used to account for multiple testing burden here. This test examines whether the SNP-gene relationship provides more information than just the SNP-trait association. Next, for MOSTWAS, the distal-SNPs added-last test is run to measure the association from distal-SNPs in the expression models, conditional on the association from local-SNPs. 17 This test prioritizes sets of mediating molecular features for the SNP-gene relationship with significant effects on the trait. Lastly, for genes whose models are built using SNPs from overlapping genomic regions, probabilistic fine-mapping via FOCUS (default parameters and priors) is employed to determine a 90% credible set of genes that explain the gene-level association signal at the locus. 47 FOCUS also outputs posterior inclusion probabilities for each gene in the 90% credible set.

Analysis of ancestry-specific and -unaware models
To show the utility of ancestry-specific models, we train EUR-and AFR-specific models using elastic net regression for 5 tissues with more than 70 samples from AFR ancestry patients. To balance sample sizes in the imputation sample, we down-sampled the EUR ancestry imputation sample to match the AFR imputation sample. We consider only genes with positive expression heritability in both EUR and AFR training samples. We also build ancestry-unaware models, where genotypes for EUR and AFR samples are pooled together in the training sample. We calculate predictive performance in aligned and misaligned imputation samples based on ancestry; the aligned imputation sample is one with ancestry that predominantly matches the ancestry of the training sample. Predictive performance is measured with adjusted R 2 to account for sample size, using an appropriate linear model between predicted and observed expression. For imputation samples that are used in training (aligned imputation panel), we use 5-fold CV when measuring predictive performance. We also conducted a down-sampling analysis, where we randomly selected a subset of the EUR samples to match the AFR sample size; in this down-sample, we performed the same comparison of predictive performance. Lastly, when imputing into AFR and EUR samples using the ancestry-unaware models, we only cross-validate over the AFR or EUR samples, respectively. For example, the training set in a fold when imputed into AFR samples includes all EUR samples and 80% of the AFR samples.

QUANTIFICATION AND STATISTICAL ANALYSIS
Strategies for meta-analysis We compared 5 different meta-analytic strategies empirically: meta-analyzing across ancestry-specific, per-biobank GWAS summary statistics using (1) inverse-variance weighting (IVW) and (2) sample-size weighting (SSW), meta-analyzing across ancestry-specific meta-analyzed GWAS summary statistics using (3) IVW and (4) SSW, and (5) TWAS using ancestry-unaware models into metaanalyzed GWAS summary statistics across EUR and AFR ancestry groups. First, we consider three different sets of GWAS summary statistics: biobank-and ancestry-specific summary statistics, ancestry-specific summary statistics meta-analyzed across all biobanks, and summary statistics meta-analyzed across biobanks and ancestry groups. In two former settings, for biobank i and a given gene, we generate b TWAS;i , the TWAS effect size, and SE TWAS;i , the corresponding standard error. Given B different biobanks, the IVW TWAS Z-score, Z TWAS; IVW , is calculated as: With Z TWAS;i = b TWAS;i =SE TWAS;i and N i as the effective sample size of the i th biobank (or pooled effective sample size across all ancestry-specific biobank summary statistics), the SSW TWAS Z-score, Z TWAS;SSW , is calculated as: Here, we define the effective sample size as N i = 4= . For the ancestry-unaware TWAS, we use ancestry-unaware elastic net regression models and integrate with GWAS summary statistics meta-analyzed across all ancestry groups and biobanks.
GReX-level phenome-wide association studies (GReX-PheWAS) Transcriptome-wide significant genes are further prioritized by performing GReX-PheWAS to categorize associations across a broad spectrum of phenotypes. Using UKBB summary statistics from European ancestry patients, 78 we tested for GTAs for 731 phenotypes grouped into 9 categories: dermatologic, digestive, endocrine/metabolic, genitourinary, hematopoietic, musculoskeletal, neoplasms, neurological, and respiratory. Here, we illustrate GReX-PheWAS using three genes from the European-specific TWAS for asthma risk using lung tissue expression: TAF7 (MOSTWAS model), IL18RAP (JTI model), and TMEM258 (JTI model). A phenome-wide significant association was defined via Bonferroni correction (P < 0:05 33731 = 2:28 3 10 À 5 ).                  Table S1: Difference in adjusted R2 between models trained in aligned and misaligned ancestry samples as the ancestry of the imputation sample. Related to Figure 2.