The power of TOPMed imputation for the discovery of Latino-enriched rare variants associated with type 2 diabetes

Aims/hypothesis The Latino population has been systematically underrepresented in large-scale genetic analyses, and previous studies have relied on the imputation of ungenotyped variants based on the 1000 Genomes (1000G) imputation panel, which results in suboptimal capture of low-frequency or Latino-enriched variants. The National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) released the largest multi-ancestry genotype reference panel representing a unique opportunity to analyse rare genetic variations in the Latino population. We hypothesise that a more comprehensive analysis of low/rare variation using the TOPMed panel would improve our knowledge of the genetics of type 2 diabetes in the Latino population. Methods We evaluated the TOPMed imputation performance using genotyping array and whole-exome sequence data in six Latino cohorts. To evaluate the ability of TOPMed imputation to increase the number of identified loci, we performed a Latino type 2 diabetes genome-wide association study (GWAS) meta-analysis in 8150 individuals with type 2 diabetes and 10,735 control individuals and replicated the results in six additional cohorts including whole-genome sequence data from the All of Us cohort. Results Compared with imputation with 1000G, the TOPMed panel improved the identification of rare and low-frequency variants. We identified 26 genome-wide significant signals including a novel variant (minor allele frequency 1.7%; OR 1.37, p=3.4 × 10−9). A Latino-tailored polygenic score constructed from our data and GWAS data from East Asian and European populations improved the prediction accuracy in a Latino target dataset, explaining up to 7.6% of the type 2 diabetes risk variance. Conclusions/interpretation Our results demonstrate the utility of TOPMed imputation for identifying low-frequency variants in understudied populations, leading to the discovery of novel disease associations and the improvement of polygenic scores. Data availability Full summary statistics are available through the Common Metabolic Diseases Knowledge Portal (https://t2d.hugeamp.org/downloads.html) and through the GWAS catalog (https://www.ebi.ac.uk/gwas/, accession ID: GCST90255648). Polygenic score (PS) weights for each ancestry are available via the PGS catalog (https://www.pgscatalog.org, publication ID: PGP000445, scores IDs: PGS003443, PGS003444 and PGS003445). Graphical abstract Supplementary Information The online version of this article (10.1007/s00125-023-05912-9) contains peer-reviewed but unedited supplementary material.

For comparison purposes, we imputed the phased haplotypes using both 1000G Phase 3 version 5[10] and TOPMed reference panels freeze 8 [11] in each cohort, separately.
We used Michigan imputation and TOPMed imputation servers to impute genotypes. [12][13][14] For chromosome X, we imputed non-pseudoautosomal regions from females and males separately.

Imputation performance evaluation
We used the r 2 quality measure, as reported by the Michigan and TOPMed imputation servers. It calculates the ratio of the empirically observed variance of the allele dosage to the expected binomial variance. [15] We evaluated the performance of TOPMed and 1000G imputations by summarizing the chromosome-wise r 2 quality measure and the number of well-imputed (r 2 0.8) variants at different allele frequency thresholds. To further test if the quality of imputation was well-powered to detect low-frequency and rare variation without relying on the imputation server's quality measures, we leveraged available WES data from SIGMA3 cohort and estimated the proportion of the sequenced variants, for chromosome 22 only, that were well-imputed with TOPMed and 1000G panels at different WES allele frequency thresholds. Loss-of-Function (LoF) variants are of clinical interest as they disrupt protein-coding genes, potentially being disease-causal. Therefore, we used SnpEff [16] to annotate the WES variants for chromosome 22 only and estimated the percentage of well-imputed variants identified with TOPMed and 1000G imputations. We used CADD score [17] to predict the deleteriousness of variants.
We classified the variants using the score cut-offs of 10 and 20, below which the variants were considered as benign.
We also calculated the effective sample size (Neff) needed to reach 80% statistical power to detect genome-wide significant associated signals (=510 -8 ) at different effect sizes and allele frequencies covered by the imputations. Neff was calculated as derived by Grotzinger et al. [18]:

Type 2 diabetes association and meta-analysis
Type 2 diabetes association analyses were performed in each cohort with SNPTEST using the expectation maximization (em) method [19] for doing maximum likelihood estimation in a generalized linear model framework using genotype probabilities under the additive model. Models were adjusted for sex, age, body mass index (BMI) and 10 PCs to account for population structure. We ran additional models without adjusting for BMI. Only well-imputed variants (r 2 0.5) were meta-analyzed using the inverse of the corresponding squared standard errors in METAL. [20] The statistical significance threshold was set to P<510 -8 .
We performed LD-based clumping on the genome-wide significant variants to keep one representative variant per region of LD. We set an LD r 2 =0.5 and a distance between variants of 250 kb. If the variant was located within a previously reported type 2 diabetesrelated locus, we used a conditioning strategy to test for distinct signals. We conditioned on each of the previously reported SNPs within the locus. We performed conditional analyses in each cohort separately, and then meta-analyzed using the inverse of the variance of the effect estimates. If our lead variant showed evidence of residual type 2 diabetes association (p-value<510 -5 ) after conditioning on any previously reported variant within the locus, we considered our novel signal as distinct.
Variants with sub-genome-wide significance (p<110 -6 ) that were only imputed with the TOPMed reference panel, showed increased frequency in the Latino population and were > 250 kb from other reported genome-wide significant variants from large consortia analyzing either European or East Asian populations [21,22] were considered for further investigation.

Replication sample
Variants associated with type 2 diabetes risk at genome-wide and sub genome-wide significance were tested for replication in six independent cohorts described below (Table   S2).

Progress in Diabetes Genetics in Youth (PRODIGY)
PRODIGY is the largest cohort for youth-onset type 2 diabetes with available GWAS data, comprising the Treatment Options for Type 2 Diabetes in Adolescents and Youth (TODAY) [23] and the SEARCH for Diabetes in Youth studies [24]. TODAY includes participants diagnosed with type 2 diabetes prior to 18 years of age and have documented BMI85 th percentile at the time of diagnosis. Of note, the American Indian tribal nations that partnered on the TODAY Study elected not to participate in the genomics collection.
SEARCH is a population-based prospective registry study launched in 2000 that ascertained diabetes in youth diagnosed at <20 years of age in the U.S. Overall, PRODIGY has shown consistent direction and size of effects for most genetic variants associated with type 2 diabetes in adults. [25] The TODAY and SEARCH protocols were approved by the institutional review boards of each participating institution. Participants provided written informed parental consent and child assent, including consent and assent specifically for genetic testing.
We identified 1,198 youth type 2 diabetes cases of Latino descent. As the control group, we used 1,805 diabetes-free adult Latino samples from the T2D-GENES (Type 2 Diabetes Genetics Exploration by Next-generation sequencing in multi-Ethnic Samples) and from the METS (Mexican Metabolic Syndrome) Cohort. [26] Genotypes from both datasets were merged and the pre-imputation quality control additionally included the exclusion of variants with genotyping array missingness difference (p-value<0.00005).
Phasing was done as described above and genotypes were imputed to the TOPMed reference panel to ensure high-quality imputation. Type 2 diabetes association was tested under an additive model using SNPTEST and em method [19]. Models were adjusted for sex and 10 PCs to account for population structure. Given the case-control design of this replication cohort, age and BMI were not considered as covariates.

Cameron County Hispanic Cohort (CCHC)
CCHC is a population-based cohort of Mexican American individuals living in the U.S. [27] We selected 971 type 2 diabetes cases and 857 controls. We used high-quality TOPMed imputed genotypes (r 2 >0.9) and tested for type 2 diabetes association under additive models adjusted for sex, age, BMI and 10 PCs.

Urban American Indians and Arizona Pima Indians cohorts
We also used 851 type 2 diabetes cases and 2,191 controls from four groups of urbandwelling American Indians living in or near Phoenix, Arizona, as well as 2,571 type 2 diabetes cases and 5,088 controls from a community of Pima Indians in Arizona, who participated in a longitudinal study of type 2 diabetes. [28] For both cohorts, imputation was done using Pima whole-genome sequences from 266 individuals. Variant rs1016378028, was directly genotyped using Taqman probes. type 2 diabetes association was tested by fitting additive models adjusted for sex, age, BMI and 5 PCs. In the Pima cohort, we additionally adjusted for birth year since exams took place over many years.
In the Pima cohort, we used linear mixed models to account for estimated relatedness among individuals, whereas in the Urban American Indians cohort, we used a genomic control procedure. Human research was approved by the relevant Institutional Review Boards. All participants provided written informed consent.

Population Architecture using Genomics and Epidemiology (PAGE) study
The PAGE study aims to conduct genetic epidemiological research in ancestrally diverse populations within the U.S. [29] It includes four population-based cohorts with significant numbers of Latino participants: the Hispanic-Community Health Study/Study of Latinos Genotyped individuals self-identified as Hispanic/Latino, African American, Asian, Native Hawaiian, Native American or other. For this study, Latino samples were selected based on their genetically estimated ancestry following the same two-step filtering approach that we implemented with the discovery cohorts. We analyzed up to 6,761 type 2 diabetes sample size and statistical power, we leveraged 4 additional datasets of African ancestry for which he had available individual-level data. The 4 cohorts are: the UKBB (Neff=2,516), the MGB[5] (Neff=1,078), the GERA [6] (N=1,563) and the All of Us [31] (N=6,395). Except for the All of Us cohort that has whole exome sequencing data, we performed quality control and imputed each cohort to the TOPMed panel, as described for the Latino analyses. We used high quality imputed variants to perform a type 2 diabetes GWAS in each cohort, separately. Then, we meta-analyzed the results using the inverse of the corresponding squared standard errors with METAL. The Neff of the T2D GWAS African meta-analysis was 66,769.
To aggregate the rs2891691 allelic effects across Latino, African, and East Asian ancestries, we performed a fixed-effects meta-analysis. Additionally, to allow for heterogeneity in allelic effects correlated with ancestry, which is not accommodated with a fixed-effects meta-analysis, we used MR-MEGA software [33]. It implements a multiancestry meta-regression approach to model the allelic effects as a function of axes of the genetic variation, which are derived from a matrix of mean pairwise allele frequency differences between GWAS. Specifically for this analysis, we did not include the All of Us cohort, as we did not have available GWAS data (ESM Fig. 3).

Association with type 2 diabetes-related phenotypes
Given the lack of large-scale publicly available biobanks with Latino samples that may allow for better characterization of our novel signals, including those occurring at a low frequency, we assembled a collection of cohorts to perform QTL analyses focused on 46 glycemic, anthropometric and lipid traits. In addition to 5 of the Latino cohorts analyzed in the type 2 diabetes meta-analysis (i.e. SIGMA1, SIGMA2, SIGMA3, MXBB and MGB Biobank), we included three extra cohorts, which we also imputed to the TOPMed panel as QTL analyses were conducted using high-quality TOPMed imputed genotypes and cleaned phenotypes from non-pregnant Latino adults. We used inverse rank normal transformation when the normality assumption was not met. Association analyses were done with a maximum of 26,400 adult Latino individuals depending on the trait, of whom 19,459 were diabetes-free. We used SNPTEST and em method to run linear regressions assuming additive genetic models in each cohort, separately. Models were adjusted for sex, age, BMI, and 10 PCs. If the outcome was available in >1 cohort, we meta-analyzed the results using the fixed-effects inverse variance method.

Credible sets
For each novel variant, we identified the set of variants with 99% probability of containing the causal variant. We used a Bayesian method [35], considering variants in LD with the lead variant (r 2 >0.1). We calculated LD using genetic data from 1,996 Hispanic/Latino samples from TOPMed Freeze 5b. For each region, an approximate Bayes factor (ABF) was calculated for each variant as follows:

Genomic annotation of the 99% credible set variants
We used the 99% credible sets for each novel signal to annotate their genomic effect using the VEP [36] (GRCh38.p7) and SNPNEXUS [37] applications. This allowed us to gather information about gene and protein proximity, the effect of non-synonymous coding variants on protein function (SIFT, PolyPhen), non-coding variants scoring (CADD, FunSeq2), and the occurrence of regulatory elements.
We used GTEx V8 [38] to assess the influence of the variants in gene-level expression, as well as the TIGER Portal [39] for evaluating the gene-level expression in pancreatic islets and the Islet Gene View [40] for assessing the gene co-expression in human islets.
We also assessed individual variant associations with a variety of common phenotypes individuals with exome sequence data from the UKBB. [42] To obtain more evidence implicating the variants or their closest genes, with any disease or biological process, we used the Open Targets Platform, which aggregates a variety of resources and scores the collected information to contextualize and weigh the underlying evidence. [43]

Expression of genes near novel variants
We also assessed the expression levels of the genes  500 kb around the novel signals in human islets under different conditions pertaining to type 1 diabetes and type 2 diabetes. We examined the following datasets: Type 1 diabetes dataset (FACS-purified β cells obtained from donors with type 1 diabetes) downloaded from GSE121863 (n=4 type 1 diabetes versus 12 controls) [44]; type 2 diabetes samples gathered and integrated from three cohorts: one generated by T2DSystems (TIGER Portal), and the other two downloaded from the Gene Expression Omnibus (GEO) database under accession numbers GSE159984 and GSE50244 (a total of n= 47 type 2 diabetes cases versus 228 controls) [39,45,46]; brefeldin A-exposed human islets (0.1 μg/mL for 24h) downloaded from GEO under accession number GSE152615 (n=4) [47]; cytokine-exposed human islets (exposed to interferon-γ (1,000 U/ml) and interleukin-1β (50 U/ml) for 48 h or to interferon-α (2000 U/mL) for 2h, 8h, or 18h, n=5-6 per condition) from GSE108413 and GSE133221 [48,49]; human islets exposed to palmitate (0.5 mM), high glucose (22.2 mM) and palmitate plus high glucose for 48h (n=3-5 per condition) from GSE159984 [45]. Gene

Polygenic scores
Polygenic scoring using single ancestry summary statistics and LD reference panels was calculated via Bayesian Regression and Continuous Shrinkage priors as implemented in PRS-CS. [52] We used the UKBB LD reference panel and GWAS summary statistics from European [22], East Asian [21] and Latin American populations. GWAS Latino summary statistics were calculated using a meta-analysis with five of the discovery cohorts (i.e. SIGMA1, SIGMA2, SIGMA3, MGB, and GERA). Five phi shrinkage priors () were used (i.e. 1, 10 -2 , 10 -4 , 10 -6 and the one automatically estimated from the data). Then, we used the estimated posterior SNP effect sizes for each ancestry to calculate the PSs in a training cohort (i.e. MXBB).
To evaluate the performance of the PSs, we first fitted a simple model that included sex, age and 10 PCs to account for population stratification. We then fitted models that additionally included the PS standardized scores. We calculated the variance explained in type 2 diabetes status for each model using the Nagelkerke pseudo r 2 . We repeated the same strategy by comparing the individuals above different percentile cutoffs with those in the interquartile of the PS. The best shrinkage prior was selected based on the larger incremental r 2 of type 2 diabetes status in the MXBB training cohort. Then, the selected model was tested in a target cohort (i.e. the METS Cohort).
Given that the ancestry-specific PSs were not highly correlated (r 2 <0.3), we also used PRS-CSx [53], a novel method that improves cross-population polygenic prediction by  For low-frequency or rare variants, we observed deflated QQ plots, which is expected given the large sample sizes needed to reach statistical power. conditional on two nearby T2D-associated signals. Two type 2 diabetes-associated signals near the rs2891691 variant have been reported in Europeans. One is located within the upstream RELN gene (a) (rs39328, b38 chr7:103,804,531) [22], while the other is located within the downstream LHFPL3 gene (b) (rs73184014, b38 chr7:104,875,827) [32]. rs2891691 was not in LD with either of these variants (r 2 <0.0006), and neither of them was associated in our Latino meta-analysis (p=0.30 and p=0.07, respectively). After conditioning on each of them, rs2891691 remained significant (OR