Whole genome sequence analyses of eGFR in 23,732 people representing multiple ancestries in the NHLBI trans-omics for precision medicine (TOPMed) consortium

Background: Genetic factors that influence kidney traits have been understudied for low frequency and ancestry-specific variants. Methods: We combined whole genome sequencing (WGS) data from 23,732 participants from 10 NHLBI Trans-Omics for Precision Medicine (TOPMed) Program multi-ethnic studies to identify novel loci for estimated glomerular filtration rate (eGFR). Participants included European, African, East Asian, and Hispanic ancestries. We applied linear mixed models using a genetic relationship matrix estimated from the WGS data and adjusted for age, sex, study, and ethnicity. Findings: When testing single variants, we identified three novel loci driven by low frequency variants more commonly observed in non-European ancestry (PRKAA2, rs180996919, minor allele frequency [MAF] 0.04%, P = 6.1 £ 10 ; METTL8, rs116951054, MAF 0.09%, P = 4.5 £ 10 ; and MATK, rs539182790, MAF 0.05%, P = 3.4 £ 10 ). We also replicated two known loci for common variants (rs2461702, MAF=0.49, P = 1.2 £ 10 , nearest gene GATM, and rs71147340, MAF=0.34, P = 3.3 £ 10 , CDK12). Testing aggregated variants within a gene identified the MAF gene. A statistical approach based on local ancestry helped to identify replication samples for ancestry-specific variants. Interpretation: This study highlights challenges in studying variants influencing kidney traits that are low frequency in populations and more common in non-European ancestry. © 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)


Introduction
Reduced kidney function, assessed with the estimate glomerular filtration rate (eGFR), defines chronic kidney disease (CKD). Low eGFR is associated with cardiovascular disease morbidity [1], mortality [2,3], poor quality of life and high health care costs for its treatment [4]. CKD has a high burden among non-European ethnic groups [5]. In the U.S., the burden of CKD in African American is attributed in part to the presence of ancestry-specific genetic variants, i.e., APOL1 high-risk genotypes [6]. Genetic factors and underlying pathways influencing eGFR in populations can provide insights into CKD occurrence, mechanistic pathways, and downstream complications.
Genetic studies that included ethnically diverse populations have yielded important gains in gene discovery and have advanced fine mapping by leveraging differences in allele frequencies and in coinheritance of genetic variants across ancestral groups [7,8]. In addition, substantial evidence indicates that ancestry-specific genetic variants contribute to CKD [6,9]. The current study expands on prior genetic studies of kidney loci through interrogation of rare and low frequency variants from whole genome sequencing (WGS) in the National Heart Lung and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program. We aimed to understand a role of rare and low frequency variants that individually or in aggregate influence eGFR, and to identify ancestry-specific genomic regions associated with eGFR in African Americans and Hispanics/Latinos through admixture mapping. Using a newly described statistical approach based on local ancestry, we estimated ancestry-specific allele frequencies for rare sequencing variants and showed their utility for identifying ancestry-related replication samples.

Ethics statement
All human research was approved by the relevant institutional review boards and conducted according to the Declaration of Helsinki. All participants provided written informed consent.

Study design and participants
The study included 23,732 participants from ten studies with phenotype data and WGS from the TOPMed Freeze 5b, for five racial/ethnic groups: European Americans, African Americans, East Asians, Hispanic/Latinos, and Native Americans. We followed TOPMed guidelines when reporting race/ethnicity and ancestry (WEB Resources). Admixture mapping analyses included a subset of 9,479 admixed African Americans and Hispanics/Latinos. The following studies contributed data: Old Order Amish, Atherosclerosis Risk in Communities (ARIC), Framingham Heart Study (FHS), Genetic Epidemiology Network of Arteriopathy (GENOA), Genetic Epidemiology Network of Salt Sensitivity (GenSalt), Genetic Study of Atherosclerosis Risk (Gen-eSTAR), Hypertension Genetic Epidemiology Network (HyperGEN), Jackson Heart Study (JHS), Multi-Ethnic Study of Atherosclerosis (MESA) and Women's Health Initiative (WHI). Demographic, clinical data and kidney phenotypes were obtained from study clinical visits.

Phenotyping procedures
We performed centralised harmonization of the phenotype and eGFR was calculate using the serum creatinine-based Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation [10]. For studies with serum creatinine assayed before 2009 using a Jaffe assay, we multiplied the serum creatinine value to 0.95. The CKD-EPI eGFR estimation uses a race term for black race to account for biological variations in non-GFR determinants. This equation has been widely used in both research and clinical care, although some concerns have been raised related to race component of the equation [11]. To account for differences in trait distribution by study and among ethnic groups, eGFR was inverse normalised within study and racial/ethnic groups and then rescaled to recover original trait variance [12]. Therefore, results are reported using units of ml/min/1.73 m 2 .

Whole genome sequencing data generation and quality control
Contributing studies had WGS from TOPMed Freeze 5b. WGS was performed at an average depth of 38x using DNA from blood as previously reported 4 . Processing of whole genome sequences was harmonised across genomic centres using a standard pipeline (see URL in the Web Resources section). Briefly, participants were sequenced at the Broad Institute, the Northwest Genomic Center at the University of Washington, and the New York Genome Center. GeneSTAR samples were sequenced at Macrogen and Illumina. Central quality control and variant calling was performed jointly at the University of Michigan Informatics Resource Center. Further quality control that focused on sample identity was performed at the University of Washington Data Coordinating Center. All methods are described on the dbGaP website at: https://goo.gl/ntuJbR. After site level filtering, TOPMed freeze 5b consisted of~438 million single nucleotide variants (SNVs) and~33 million short insertion-deletion (indel) variants. Most indels were singleton or rare, with only 1À2% with allele frequency > 1%. Read mapping was done using the 1000 Genomes Project reference sequence versions for human genome build GRCh38. Functional annotations were performed using the WGSA Annotator [13], and WGSAParsr was used to generate simplified WGSA annotation files and variant grouping files for gene-based aggregate tests (see URL in the Web Resources section). Principal

RESEARCH IN CONTEXT
Evidence before this study Several loci have been identified for estimated glomerular filtration rate (eGFR) in genome-wide association studies. Genetic factors that influence kidney traits have been understudied for low frequency and ancestry-specific variants.

Added value of this study
The main findings of this study are the identification of ancestry-specific rare variants associated with eGFR either individually or in aggregate units within a gene. We also showed the utility of estimating ancestry-specific allele frequencies for rare sequencing variants using local ancestry to identify ancestryrelated replication samples when using multi-ethnic studies. This study also highlights challenges for the study of rare/low frequency variants in multi-ethnic studies, including finding suitable samples for replication of ancestry-specific variants.

Implications of all the available evidence
Rare and low frequency variants are more likely to be population-specific and their genetic contribution to eGFR variation is mostly unknown. Our study provides important information for future WGS studies of rare SNVs for kidney traits, with implications for study design of variant discovery and replication, particularly when studying diverse ancestry populations. components (PC) of ancestry were estimated among all samples using PC-Relate and PC-AiR [14,15]. We used the Omics Analysis, Search and Information System (OASIS) based on TOPMed data for the included individuals to calculate linkage disequilibrium among SNVs and to plot genomic regions (see URL in the Web Resources section).

Single variants association analysis and gene-based collapsing analysis
Association analyses were performed in a cloud computing environment under DNAnexus [16] (see URL in the Web Resources section).
Single variant association test. We fitted a linear mixed model using covariates of age, sex, and categories of study, race/ethnicity, and case-control status as needed. To account for genetic similarity among subjects, we used the genetic relationship matrix estimated from the WGS data from PC-Relate to specify the random effects covariance structure. We allowed for heterogeneous residual variance components, and grouped subjects by study, race/ethnicity, and case-control status. We used the Wald test for single variant association analyses of 43,622,178 autosomal variants filtered for a minor allele count > 10. The significance threshold was p < 5.0 £ 10 À9 , which has been determined to be the appropriate genome-wide threshold for sequencing studies [17,18]. We estimated the phenotypic variance explained (PVE) by each variant and their joint PVE using methods described in Supplemental material. Although this study is focused on rare and low frequency variants, we also examined the association of previously reported common variants at eGFR loci (Genome Catalogue, see URL in the Web Resources section) and the presence of secondary associations at the loci that were genomewide significant in our single variant analyses using conditional analyses. The conditional analyses used the most significant SNV in our data as a covariate and examined if there were additional SNVs with a p-value lower than the index SNV within a window of 1 Mbase of the index SNV.
Gene-based association tests. For gene-based tests, variants were aggregated by GENCODE genes (v24). Variants within a gene were filtered to retain a set of rare variants (minor allele frequency [MAF] < 1%) that were predicted as loss-of-function variants (LoF), protein altering small deletions/insertions (indels) or synonymous SNVs which have a deleterious functional annotation (FATMM-MKL score>0.5 or MetaSVM score > 0 for missense SNVs). Variants in a 5 kb window promoter region (upstream of transcription start site [19] and in a FANTOM5 [Functional ANalyses Through Hidden Markov models] peak) [18] and variants at the first intron of genes were also included. Genes with at least 10 individuals with at least one copy of any alternative allele were included. We performed both burden and SKAT tests and used a conservative significance threshold of p < 1.6 £ 10 À6 based on Bonferroni correction for two tests on each of 16,054 genes included in analyses. To identify the contribution of one or more variants within genes with a gene-based significant association, we tested the association of each single variant within the aggregate gene unit. We performed leave-one-variant-out analyses with variants aggregated within a gene for gene-based tests.

Admixture mapping
These analyses included only self-identified African American, African Caribbean, or Hispanic/Latino TOPMed participants (n = 20,048) of which 9,479 had eGFR data. The reference panel for local ancestry inference included 37 African, 35 European, and 20 Native American individuals with phased sequence data available from the Simons Genome Diversity Project (SGDP) [20]. After removing very low frequency variants (minor allele count < 2 in SGDP or < 5 in TOPMed), 9,137,968 autosomal SNVs remained for analysis. We used the HapMap genetic map [21], lifted over to build 38, to estimate genetic positions for each variant, which was needed for inferring local ancestry and to estimate the significance threshold using Significance Threshold Estimation for Admixture Mapping (STEAM). The various maps are highly correlated at the scale that is relevant for admixture mapping (Mbase) [22] and our prior studies have shown no differences when comparing two different choices of genetic maps for inferring local ancestry [23]. We inferred the number of alleles inherited from African, European, and Native American ancestral populations for each admixed individual using RFMix (version 1.5.4) with a window size of 0.1 cM. Generations since admixture (6 for African American samples and 10 for Hispanic/Latino samples) were chosen to reflect estimates from previous studies [24,25]. To estimate admixture proportions for each individual, we calculated the genome-wide average local ancestry. We used an iterative procedure to estimate kinship coefficients adjusted for population structure and admixture, which were used in our linear mixed model to adjust for relatedness in admixture mapping. In the final step of this iterative procedure, we used our estimated admixture proportions in place of principal components.
We performed admixture mapping using a linear mixed model (GENESIS) on each ancestral group (African, European, and Native American) separately [26]. eGFR was the outcome variable. Models were adjusted for sex, age, study and race/ethnic group (African American or Hispanic/Latino) and admixture proportions as fixed effects, and ancestry-adjusted kinship as a random effect. We allowed for heterogeneous variance within groups defined by study and race. To account for multiple testing, we used the genome-wide threshold of p < 5.4 £ 10 À6 , estimated using STEAM [23]. As secondary analyses, admixture mapping was conducted separately in the African American and Hispanic/Latino subjects; significance thresholds were p < 1.6 £ 10 À5 (testing just the African ancestral component) and p < 3.5 £ 10 À6 in the African American and Hispanic/Latino subsets, respectively.

Estimating ancestry-specific allele frequencies
We used our local ancestry calls to estimate ancestry-specific allele frequencies for loci of interest using the recently developed method Ancestry-Specific Allele Frequencies Estimation (ASAFE). RFMix only infers local ancestry at loci that are present in both the admixed sequence data and the reference panel, so inferred local ancestry will not be available at any loci that are not present in SGDP. However, because local ancestry segments extended over multiple loci, we can fill in the missing local ancestry calls at a locus of interest with reasonable confidence by looking at the inferred local ancestry at neighbouring loci. We can then use the local ancestry calls to estimate the ancestry-specific allele frequency for each ancestral population at a locus by calculating the frequency of the allele across haplotypes in our sample with local ancestry assigned to each ancestral population (African, European, and Native American). To account for uncertainty in the phase of genotypes relative to the local ancestry calls (particularly at loci where local ancestry was not inferred directly by RFMix), we used the EM algorithm approach implemented in the ASAFE program [27]. We ran ASAFE using local ancestry calls for the 9,479 subjects included in our admixture mapping analysis.

Replication
Replication was performed using ancestry-specific allele frequency information. For East Asian SNVs, we used data from the Rare Variants for Hypertension in Taiwan Chinese (THRV) study, which was sequenced in the TOPMed freeze 8 using methods described above. Additional replication for East Asians were obtained from a WGS of 1,524 participants (32%, women, mean age 49.5 years, mean eGFR 102.5 ml/min/1.73 m 2 ) from the BioBank Japan Project (BBJ). BBJ is a multi-institutional hospital-based study that collaboratively collected DNA and serum samples from the participants, mainly of Japanese ancestry, with a diagnosis of any of 47 diseases [28,29]. Participants on dialysis, and those with serum creatinine level outside of three times of interquartile range of upper/lower quartile were excluded. WGS had an average depth of 25.9x as described elsewhere [30]. Rank-based inverse transformation of the eGFR residuals after an adjustment for age, sex, and 47 disease affection status was used as phenotype in single variant and gene-based analyses with an adjustment for 20 top principal components as covariates. For the gene-based analysis for MAF gene in the BBJ, 34 variants comprising 4 nonsynonymous SNVs with FATHMM-MKL score>0.5 or MetaSVM score>0, and 30 variants within the first intron were tested. Burden test was conducted using rvtest software and SKAT and SKAT-O using EPACTS software.
For replication of the Amerindian ancestry variant at chromosome 19, we used data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), which was genotyped using a custom Illumina array and imputed to the TOPMed Freeze 5b multi-ethnic reference panel. HCHS/SOL analyses were performed among individuals with higher Amerindian ancestry proportion (Mainland sample, n = 6,767 individuals) [9]. In addition, we attempted to replicate our findings in two cohorts of southwest American Indians collected by the Intramural National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) program; namely, participants in a community-based longitudinal study who are predominately of Pima Indian heritage, and participants from the Family Investigation of Nephropathy and Diabetes (FIND), a study of diabetes and diabetic nephropathy in adults. The SNV rs539182790 was direct genotyped using Taqman on Demand in these southwestern American Indian studies and statistical analyses followed the same protocols from TOPMed.

Bioinformatics
We performed a look-up of genomic coordinate overlap (hg38) in the Roadmap Epigenomics and the Encyclopedia of DNA Elements (ENCODE) consortium data [31,32] across different tissues and samples. All datasets included had been released and passed quality control by respective consortia.

Role of funders
The funders had no role on the study design, execution or interpretation of findings.

Results
Study design is shown in Fig. 1 and the characteristics of participants in Supp Table 1. The study comprised up to 23,732 TOPMed participants from 10 multi-ethnic studies and five racial/ethnic groups (36% African Americans, 50% European Americans, 9% East Asians, 5% Hispanics/Latinos and 0.2% American Indians).

Gene-based test results
We next tested variants aggregated within genes (n = 16,054 genes) using a burden test that combined all variants in a score test, and the SKAT test, which allows for differences in the direction of effects for rare variants. Q-Q plots are shown in Supp Fig 1b-c, respectively. Although the burden tests were not significant, the SKAT analyses identified a significant association for the MAF gene, reflecting differences in these approaches for testing rare variants (Supp Table 2). Sixty-one variants including 32 singletons contributed to the MAF gene-based association in SKAT analysis. To investigate the contribution of single variants to the association at this gene, we performed single variant association analyses for each variant in the gene and identified a missense variant (rs1230233783, p.His191-Tyr, AA 191, MAF=0.008) contributing to most of the association with eGFR (P = 1.27 £ 10 À6 ). The SKAT analyses using a leave-one-variant-  out strategy supported the strong contribution of rs1230233783 to the gene-based association (Figure 4, a-c). This variant also overlaps epigenomic annotations (H3K27ac ENCODE data). Additional genes identified in gene-based analyses are shown in Supp Table 2.

Admixture mapping and estimating ancestry-specific allele frequencies
Local ancestry determination and complete phenotype information were available for 9,479 admixed individuals (8,303 African Americans and 1,176 Hispanic/Latinos of 23,732 TOPMed participants with eGFR (Supp Table 1). The inferred global ancestry proportions for each African American and Hispanic/Latino individual and the averages (and ranges) of African, Native American and European ancestries are shown in Supp Fig 3. These results showed a large variation in ancestry proportions in Hispanics/Latinos in our data. There were no genome-wide significant associations identified using overall sample (Supp Fig 4) or separately by African American (Supp Fig  5) or Hispanic/Latino ancestry (Supp Fig 6).
Local ancestry calls were used to estimate ancestry-specific allele frequencies and rs539182790 allele at MATK was exclusively present in the Native American ancestral population (MAF = 0.02) and not in African (MAF = 3.1 £ 10 À10 ) or European (MAF = 7.9 £ 10 À4 ) ancestral populations ( Table 2).

Replication
Chromosome 1 and 2 variants had higher allele frequencies in East Asians in the 1000 Genomes Project, so we attempted replication in two East Asian studies, the THRV study, a cohort of participants from Taiwan (n = 1,132) and the BBJ, a hospital-based study of 1,524 Japanese ( Table 3). Although the p-values were not significant, there was consistent direction of effects between our data and the BBJ results for rs180996919 and rs116951054. We additionally attempted replication of the gene-based MAF findings in BBJ, which included 34 SNVs but the most influential missense SNV rs1230233783 was not avaialble. The gene-based associations were not significant for the burden test (P = 0.99) or SKAT test (P = 0.66).
We also attempted to replicate the Amerindian indel variant on chromosome 19 (rs539182790) in HCHS/SOL Hispanics/Latinos for individuals selected for high Amerindian ancestry (Mexican, Central American, and South American) (n = 6,578, MAF=0.01, P = 0.30), and among American Indians using several samples: participants of a community-based study, who are full heritage Pima Indians (n = 1,438, MAF=0.02, P = 0.86), non-full Pima Indians (n = 757, MAF=0.01, P = 0.30) and American Indian participants of the FIND study (about 1/3 Pima Indians and 2/3 other tribes, n = 836, MAF=0.03, P = 0.74). Although the p-values were not significant, the direction of effects was consistent with TOPMed for full heritage Pima Indians and HCHS/SOL Hispanic participants ( Table 3).
At the UMOD locus, the most significant associated SNV was rs77924615, an intronic variant of PDILT that has been identified in prior trans-ethnic GWAS meta-analyses of eGFR [7] (Supp Fig 7a). Additional SNVs at the promoter of UMOD have been identified in GWAS meta-analyses of eGFR in European ancestry (e.g. rs12917707). To investigate why SNVs did not achieve genome-wide significance at this widely replicated locus, we compared association estimates and p-values among European and non-European ancestry samples, noting that rs12917707 was not in linkage disequilibrium with rs77924615 in our data (Supp Fig 7b). These SNVs showed larger variance in estimates in non-European compared to European ancestry samples and had lower MAF in non-European compared to European ancestry data (Supp Table 4).

Discussion
This is the largest genetic study addressing a role of low frequency and rare variants on eGFR. Our study used deep coverage WGS (~38x) from five ancestral groups for a comprehensive assessment of SNVs across diverse populations. It also employed approaches suitable for analyses of rare variants among populations with ancestral admixture. By combining multi-ethnic groups, we optimised the power to detect low frequency alleles shared among ethnic groups with admixture, who may carry ancestry-specific rare variants. We accounted for recent and ancestral relatedness in these analyses, and genetic effects that are heterogeneous across ancestral groups, which are usually not addressed in GWAS. Lastly, we allowed for heterogeneity in eGFR distribution observed among ethnic groups, as African Americans showed larger eGFR trait variance than other groups. Using this strategy, we identified ancestry-specific low frequency variants influencing eGFR. We also confirmed associations for common variants at known loci, although this was not a main goal of this study. Importantly, our study uncovered several challenges for the study of rare ancestry-specific variants including finding suitable replication samples for validation of associations.
The main findings are related to 3 rare variants identified in single variant analyses, which showed a large effect on decreasing eGFR (chromosome 1) or increasing eGFR (chromosomes 2 and 19) and estimates ranging from 10 to 14 ml/min/1.73 m 2 . However, the PVE was small for each variant and their joined effects. Two identified SNVs more commonly observed in East Asians were located at PRKAA2 (chromosome 1, rs180996919) and METTL8 (chromosome 2, rs116951054, intronic). The PRKAA2 SNV in intron 2 of the canonical mapped transcript (RefSeq NM_006252) is <500 bp 5 0 to exon 3 of the gene. PRKAA2 codes for the alpha2 isoform of the AMP-activated protein kinase (AMPK) subunit and knockdown of AMPKa2 has been shown to enhance the epithelial-mesenchymal transition, secretion of inflammatory factors, and concomitant fibrosis in proximal tubule cells in a mouse unilateral ureteral obstruction model, through upregulation of beta-catenin and Smad3 [34]. Based on this empirical evidence, we hypothesize that the rare allele is leading to down-regulation of the total expression of the gene, or a differential regulation of a splice form involving the proximal exon. This SNV overlaps DHS sites in muscle, and it may affect creatinine production instead of kidney function. Little is known about METTL8 but very recent work suggests that this gene is involved in mRNA editing through m 3 C epitranscriptomic processes, a potentially new mechanism of renal gene regulation [35]. Indeed, our functional annotation of the SNV supported a regulatory function in kidney (H3K4me1 broadPeak, an enhancer-associated mark in the Roadmap Epigenomics Consortium data). We were unable to replicate these associations given the paucity of ancestry-specific WGS samples for East Asians and the low frequency of these variants, with the variants not present in publicly available GWAS studies.
The chromosome 19 indel rs539182790 is a rare intronic SNV of the MATK gene. The MATK gene encodes for the megakaryocyte-Àassociated tyrosine kinase, which plays a role in the signal transduction of hematopoietic cells. Our SNV was identified as an Amerindian variant based on ancestry-specific allele frequency analyses. This SNV showed consistent direction of effect in replication samples of additional Hispanic individuals, and in one of two samples of southwest American Indians (Full Pima Indians) , although the replication p-values were not significant. Given the less known genetic architecture of American Indians and little relevant reference data for SNV, single nucleotide variant. MMP20, SPATS1, GSP1 genes had suggestive association in gene-based SKAT analyses. rs1230233783 alternative allele is more common in East Asians (see text). this population, we cannot rule out that the lack of replication of the Amerindian SNV is due to differences in ancestral backgrounds of Hispanic/Latino participants and the southwest American Indians who were used as replication.
Our findings also underscore the challenges to study low frequency variants in multi-ethnic studies and admixed populations when identified variants are specific to an ancestral group. Our genebased analyses identified associations with eGFR at the MAF gene. MAF encodes a leucine zipper transcription factor, which has a role in embryological development of kidney cells [36]. The gene is highly expressed in adult kidney, with transcripts mapping to proximal tubule cells in healthy human kidney cells [37]. A common variant near MAF was associated with uric acid levels in East Asians [33] and it was associated with accelerated eGFR decline among East Asian diabetic subjects [38]. The most influential SNV at MAF gene-based analysis, rs1230233783, is a missense variant with predicted deleterious effects based on multiple annotation algorithms including a Combined Annotation Dependent Depletion (CADD) score of 13.4. This SNV has both regulatory and coding-effect annotations (H3K27ac peak in the ENCODE data), and further functional studies are needed to assess its relevance in the context of human disease genomics. However, additional studies are needed to confirm our findings as we were not able to identify a suitable large sample for replication of the SNV.
We also identified differences in allele frequency and a larger variance in effect estimates for two well-replicated UMOD common variants when comparing non-European to European ancestry samples, which may explain the lack of genome-wide association significance at this locus in our multi-ethnic sample. Allele heterogeneity across ancestries may also contribute to these findings.
To our knowledge, this is the first multi-ethnic WGS of eGFR. A prior study from Iceland performed WGS in 2,230 individuals and imputed sequenced variants into 81,656 chip genotyped individuals [19]. This study identified low frequency missense and LoF variants associated with serum creatinine at the SLC6A19, SLC25A45, SLC47A1, RNF186 and RNF128 genes. The study was restricted to individuals of European ancestry. A gene-based analyses of eGFR of variants in the exome array identified associations at the SOS2 gene in individuals mostly of European ancestry [39]. These genes were not significant in our gene-based analyses.
Rare and low frequency variants are more likely to be populationspecific and their genetic contribution to eGFR variation is mostly unknown. Our study provides important information for future WGS studies of rare SNVs for kidney traits, with implications for study design of SNV discovery and replication, particularly when studying diverse populations. An important contribution of this study is the application of a recently developed method to identify suitable replication populations for ancestry-specific variants identified in WGS, for SNVs that may not be available in public repositories and/or have unknown frequency in populations. Using ASAFE, we used local ancestry to estimate the allele frequency of our significant variants across populations and determined that the chromosome 19 variant is Amerindian, while the most common variant at the MAF gene is more common in East Asians. We expect that this approach will help to guide replication efforts for WGS studies of complex traits in multi-ethnic studies or admixed populations. Local ancestry in admixture mapping approaches could provide additional discovery when large WGS samples are available in multi-ethnic populations.
In summary, we performed a comprehensive genome-wide discovery study of eGFR in multi-ethnic studies using WGS of over 23,000 individuals that included association and admixture mapping approaches. We identified ancestry-specific low frequency variants associated with eGFR in both single variant test and gene-based analyses and used estimated local ancestry to guide replication of findings. Our study exemplifies the challenges of studying diverse populations including finding suitable replication samples for ancestry-specific low frequency variants identified in multi-ethnic studies and admixed populations. In addition, resources for functional characterization of these identified WGS rare variants are currently not available.

Funding sources
Additional funding sources are shown in Supplemental Data