The ethnogeographic variability of genetic factors underlying G6PD deficiency

Glucose-6-phosphate dehydrogenase (G6PD) deficiency caused by genetic variants in the G6PD gene, constitutes the most common enzymopathy worldwide, affecting approximately 5% of the global population. While carriers are mostly asymptomatic, they are at substantial risk of acute hemolytic anemia upon certain infections or exposure to various medications. As such, information about G6PD activity status in a given patient can constitute an important parameter to guide clinical decision-making. Here, we leveraged whole genome sequencing data from 142,069 unrelated individuals across seven human populations to provide a global comprehensive map of G6PD variability. By integrating established functional classifications with stringent computational predictions using 13 partly orthogonal algorithms for uncharacterized and novel variants, we reveal the large extent of ethnogeographic variability in G6PD deficiency and highlight its population-specific genetic composition. Overall, estimated disease prevalence in males ranged between 12.2% in Africans, 2.7–3.5% across Asia and 2.1% in Middle Easterners to < 0.3% in Europeans, Finnish and Amish. In Africans, the major deficient alleles were A-202A/376 G (minor allele frequency 11.6%) and A-968 C/376 G (0.5%). In contrast, G6PD deficiency in Middle Easterners was primarily due to the Mediterranean allele (1.3%) and the populationspecific Cairo variant (0.4%). In South Asia, the most prevalent deficient alleles were Mediterranean (1.7%), Kerala (1.1%), Gond (0.9%) and Orissa (0.2%), whereas in East Asian populations the Canton (1.1%), Kaiping (0.7%) and Viangchan (0.3%) variants were predominant. Combined, our analyses provide a large dataset of G6PD variability across major ethnogeographic groups and can instruct population-specific genotyping strategies to optimize genetically guided therapeutic interventions.


Introduction
The glucose-6-phosphate dehydrogenase (G6PD) gene encodes the central enzyme of the pentose phosphate pathway that provides NADPH and contributes to glutathione homeostasis and redox balance. G6PD deficiency is a hereditary monogenic defect affecting over 400 million people worldwide that caused by mutations or genetic variants in G6PD, leading to reduced enzyme activity and a variety of biochemical and clinical phenotypes [1,2]. In addition to an increased risk of neonatal hyperbilirubinemia, G6PD deficient individuals are at risk for developing severe acute haemolysis after the intake of certain drugs and 37 drugs contain warnings about their use in G6PD deficient individuals in their labels (https://www.fda.gov/drugs/science-and-research-drug s/table-pharmacogenomic-biomarkers-drug-labeling).
G6PD variants are categorized by the World Health Organization (WHO) into four deficiency classes based on their level of enzyme activity. Variants with < 10% activity of normal are classified as class I or II depending on whether they also cause severe enzyme deficiency with chronic non-spherocytic hemolytic anemia (class I) or not (class II). Variants resulting in 10-60% G6PD activity are classified as class III and variants that do not have major impacts on G6PD function are considered as class IV (60-200% of normal). Notably however, this classification included variants that were characterized using different methodologies and accumulating evidence suggests little functional differences that do not allow for a precise categorization, particularly between class II and class III variants [3]. As G6PD is localized on the X-chromosome, the pattern of inheritance differs between the sexes, mostly affecting hemizygous males and homozygous females. In contrast, only around 8-20% of heterozygous females exhibit drug-induced acute hemolysis [4][5][6][7].
To date, the vast majority of studies into the genetic basis of G6PD deficiency have utilized genotyping approaches in which individual candidate single nucleotide polymorphism (SNPs) were interrogated. In African populations, population-scale studies are generally restricted to the common A-alleles A-202A/376 G and A-968 C/376 G [8][9][10]. Similarly, studies across Asia focus only on the few variants that are predominant in the respective regions [11][12][13]. Such approaches however are selective and do not consider novel variants or known variants that are regarded as too rare in the given population. The increasing availability of Next-Generation Sequencing (NGS) data generated by large scale population sequencing projects enables, for the first time, the comprehensive and systematic analysis of G6PD variability and its ethnogeographic profiles beyond commonly interrogated SNPs.
Here, we analyzed whole exome and whole genome sequencing (WES and WGS, respectively) data from 142,069 unrelated individuals and provide consolidated population-specific overviews of clinically important G6PD allele frequencies on an unprecedented scale. Furthermore, by integrating variants with clinically established G6PD deficiency with stringent functionality predictions for variants with unknown pathogenicity, we derive comprehensive maps of G6PD genotypes and phenotypes across eight major human populations. We anticipate that the presented data will provide a powerful resource for clinical geneticists, drug developers and public health experts to inform population-specific genotyping strategies and facilitate the implementation of precision public health.

Data collection and definitions
Single-nucleotide variant (SNV) and indel frequency data across G6PD were collected from WES and WGS data from 141,456 unrelated individuals (including 12,487 Africans, 17,720 Latinos, 5185 Ashkenazi Jews, 9977 East Asians, 15,308 South Asians, 12,562 Finns, 64,603 non-Finnish Europeans, and 3614 with unclear or other ethnicity) from the Genome Aggregation Database (gnomAD) v2 [14]. These data were furthermore complemented with 455 Amish and 158 Middle Easterners from gnomAD v3, resulting in a total sample size of 142,069 individuals. An overview of the respective cohorts that are integrated in gnomAD is available on the project website (https://gnomad.broadinstitute. org/about) and further ethnogeographic annotations of the cohorts are provided in the Supplementary Methods. The Novel variants were defined as relative to dbSNP version 154. Genomic coordinates are indicated according to GRCh37. For each variant, the transcript that yielded the most severe consequence as defined by Sequence Ontology [15] was retained. PharmGKB and ClinVar were used to extract information about the pathogenicity of G6PD variants and haplotypes. Variants with minor allele frequencies (MAFs) above 1% were considered as common, while variants below 1% or below 0.1% were considered as rare and very rare, respectively.

Evaluation of variant pathogenicity
Variant functionality was assessed using a tiered model that considered different levels of evidence (Fig. 1). Variants were first matched against WHO classification data. Specifically, variants classified per WHO as class I, II or III were considered as deficient, while class IV variants were assigned normal activity. Only those variants for which no WHO classification was established, were in a second step matched against ClinVar annotations. Variants annotated as pathogenic or likely pathogenic by ClinVar were classified as G6PD deficient, whereas benign or likely benign variants were considered as neutral. Only for those variants for which no classification data was available, computational algorithms were used. To establish the optimal classification threshold, we evaluated the performance of different combinations of algorithms (Supplementary Figure 1). Based on these results, variants were classified as G6PD deficient if they were flagged as deleterious by at least 10 out of 13 algorithms (for missense variants), a threshold which achieved a sensitivity of 56% and a specificity of 100% on a set of 43 and 21 variants with known pathogenic and benign effects, respectively. Furthermore, variants were considered deficient if were annotated as loss-of-function by LOFTEE (for frameshift and stop-gain variants). Otherwise, variants were annotated as functionally neutral.
To estimate the distribution of G6PD activity across populations, we calculated the aggregated frequency of G6PD deficient alleles in a given population (q) based on the independence probability theory as previously reported [28] as (1 − f i )] 2 , with f i being the MAF of every pathogenic allele i = {1,…,n} in the given population. Males hemizygous for a reduced function variant (class I to III) were assumed to be G6PD deficient. Females were considered G6PD deficient if they were homozygous or compound heterozygous for a reduced function variant (class I to III). Furthermore, we considered in our calculations a fraction of heterozygous females as G6PD deficient based on previous genotype-phenotype association studies. As estimates differ, we provide ranges using the two scenarios of assuming 8% [5][6][7] and 20% [4] G6PD deficiency among heterozygous females.

The genetic landscape of the human G6PD locus
Across 142,069 individuals, we identified a total of 715 variants within the G6PD locus. A high percentage of those (337 variants; 47.1%) affected introns, or non-coding regions upstream or downstream of transcriptional start and stop sites, respectively ( Fig. 2A). While these can have potentially relevant regulatory functions, current algorithms do not allow to accurately predict their functional impacts and these non-coding variants were thus omitted from further analyses. Of the remaining 378 coding variants, the majority caused amino acid exchanges (205 variants), followed by synonymous variants (129 variants) and variations in the untranslated regions (UTRs) of the transcript (35 variants; Fig. 2A, inset). Furthermore, few variants were identified that introduced stop codons (3 variants), caused frameshifts (2 variants) or affected splice sites (1 variant).
Notably, only 8 variants were common with a MAF > 1%, whereas approximately 40% were singletons only identified in a single allele (Fig. 2B). Moreover, we observed a large number of population-specific G6PD variants. Specifically, 225 variants were exclusively found in Europeans, while Africans and Asian populations harbored 72 and 124 population-specific variants, respectively (Fig. 2C). Latinos harbored 56 population-specific variants, while only 5, 2 and 1 unique variants were identified in Ashkenazi Jews, Middle Easterners and Amish, respectively.

Global frequencies of functionally relevant G6PD polymorphisms
To understand the distribution of genetic risk factors for G6PD deficiency, we first analyzed the frequencies of G6PD alleles with established functional consequences (Table 1). Within the analyzed dataset, rs1050829 was the most common variant that, if occurring in isolation, encodes the "A" allele, which does not impact G6PD functionality (class IV). Frequency of "A" was highest in Africans (MAF=19.5%) and Latino populations (MAF=1%), whereas it was absent in individuals of European descent and East Asians. Among the reduced function variants, A-202A/376 G and A-968 C/376 G were most common, particularly in Africans (MAFs of 11.6% and 0.5%, respectively) and, to a lesser extent, in admixed Latinos (MAFs of 0.4% and 0.08%, respectively). In Europeans, the highest frequencies were observed for the Seattle variant (MAF=0.11%) and the Mediterranean allele (MAF=0.07%). As expected, in genetically more homogeneous populations, the pattern of variants that form the molecular basis for G6PD deficiency was less complex. In Finnish and Ashkenazim, the only alleles with known deficiency were Seattle (MAF=0.09%) and Mediterranean (MAF=0.6%), respectively, whereas no deficient alleles were identified in Amish.
In Asia, the molecular composition of G6PD deficiency is diverse and a multitude of alleles have been identified. Here, we found that the Canton (MAF=1.1%) and Kaiping (MAF=0.7%) variants were the most common causes of G6PD deficiency in East Asia, whereas the Mediterranean (MAF=1.74%), Kerala (MAF=1.14%) and Gond (MAF=0.87%) alleles were most frequent in South Asia. G6PD deficiency in the Middle Eastern population was attributed to both European (Mediterranean; MAF=1.26%) and African (A-202A/376 G ; MAF=0.42%) alleles, as well as unique variants exclusively found in this ethnogeographic group (Cairo; MAF=0.42%). Combined, these results consolidate information about the prevalence of well-described G6PD risk alleles and illustrate the remarkable diversity of G6PD molecular genetics.

Ethnogeographic distribution of genetically encoded G6PD deficiency
Next, we integrated information about the previously known deficient G6PD alleles described above, with frequency data of variants that were identified as putatively deficient using 13 partly orthogonal in silico tools. Notably, computational predictions identified 29 variants (23 missense, 3 stop-gain, 2 frameshift and 1 splice variant) with putative functional impacts that had not been characterized previously.
Aggregation of all deficient alleles and variants revealed that genetically encoded G6PD deficiency was highest among African males with a prevalence of 12.4%, followed by African females with 3.3-5.9% ( Fig. 3A; Table 2). Notably, the vast majority of G6PD deficiency in Africa (93.5%) was attributed to the A-202A/376 G allele (Fig. 3B). Besides African populations, G6PD deficiency was common across Asia with prevalence between 4.4% in South Asian males and 0.5-1.2% in East Asian females. The genetic composition differed drastically between East and South Asian populations and was considerably more complex than in Africa. In South Asia, the three most common G6PD deficient alleles (Mediterranean, Kerala and Gond) combined explained around 85% of G6PD deficiencies. Similarly, interrogation of five variant alleles (Canton, Kaiping, Viangchan, Gaohe and Chinese-5) was required in East Asia to capture 85% of G6PD deficiencies. Furthermore, there was only minimal overlap between the common G6PD deficient alleles in South and East Asia.
In the Middle East, overall G6PD deficiency pivoted between 2.1% in males and 0.4-0.9% in females and was allotted to the Mediterranean (60.1%), Cairo (20%) and A-202A/376 G (19.9%) alleles. While the prevalence of G6PD deficient alleles in Latinos was overall considerably lower than in African populations (0.9% in males and 0.2-0.4% in

Discussion
G6PD deficiency is considered the most common genetic disorder, affecting approximately 5% of the world's population [29] and its enzymatic basis and X-chromosomal mode of inheritance are known since almost 70 years [30]. Importantly, G6PD deficiency exhibits conspicuous variability across ethnic groups. Frequencies are highest in Sub-Saharan Africa and Southeast Asia with reported prevalence rates ranging from 1% to 30%, whereas frequencies in Europe and Japan are considerably lower (0-10%). Besides various histological and enzymatic assessments, genetic testing of risk variants constitutes a common strategy to predict G6PD deficiency. Over the last decades, a multitude of studies have evaluated the prevalence of G6PD variants across many different regions and Table 1 Variant frequencies of well characterized G6PD alleles across ethnogeographic groups. Functionally relevant alleles with frequencies above 0.1% in a given population are indicated in bold. Genomic coordinates provided according to GRCh38. N indicates the number of analyzed alleles for a given population. AMR = Latino; AFR = African; ASJ = Ashkenazi Jewish; EAS = East Asian; FIN = Finnish; NFE = Non-Finnish Europeans = ME = Middle Eastern; SAS = South Asian.  ethnogeographic groups, naturally with a focus on populations in which deficiency is most prevalent, such as Sub Saharan Africa [31,32], West Africa [33], Kenya [34], Sudan [35], Sierra Leone [36], Gabon [37], Mauritania [38], Botswana [39], Cameroon [40], Saudi Arabia [41], Mexico [42], Colombia [43], the Dominican Republic [44], Iraqi Kurds [45], Pashtun in Afghanistan [46], India [47,48], China [49][50][51][52], Laos [53] and Thailand [54]. Yet, the molecular genetics of G6PD is complex, and most studies evaluate only one or few candidate variants that are most common in the population under study. To consolidate the available data, various literature surveys and meta-analyses have been conducted [55][56][57]. However, these are limited to variants for which a sufficient number of analyses are available and study integration can be problematic due to differences in genotyping strategies and assay panels. Furthermore, current studies focused on specific populations and often analyzed cohorts with differing severity of malaria infections. Given that G6PD deficiency has been associated with protective effects against malaria [58,59], analyses of Plasmodium infected cohorts might not accurately represent the frequency of G6PD variability in the population at large.
The increased amount of available population-scale sequencing data allows for the first time to systematically derive frequency estimates of all G6PD variants. Using fully consistent and compatible data from 142,069 unrelated individuals across seven ethnogeographic groups, we map the prevalence of WHO-classified G6PD risk alleles and pinpoint variants for population-adjusted genotyping strategies. Furthermore, by using stringent computational assessments based on 13 partly orthogonal algorithms, we provide estimates for the relative contributions of previously uncharacterized or novel G6PD variants. The resulting dataset provides, to the best of our knowledge, the most comprehensive overview of G6PD variability presented to date with potentially important implications for the optimization of precision public health strategies. Of note, while very rare, class I deficiencies that are characterized by chronic transfusion-dependent anemia, are likely underrepresented in the analyzed data set, as patients with clinically manifest signs of genetic disease were excluded during recruitment.
To generate representative allelic frequency data for different ethnogeographic groups multiple strategies can be pursued. Firstly, the composition of the analyzed cohort can be balanced based on the  Fig. 3. The distribution and molecular genetics of G6PD deficiency differs drastically between populations. A, The overall prevalence of G6PD deficiency is shown for males and females for each population analyzed based on both well-characterized deficient variants and the variants predicted as reduced function using stringent computational assessments (see methods section for details). The confidence intervals for the respective estimates are provided in Table 2. Frequencies of G6PD deficiency range between 12.4% in African males to 0% in Amish. B, Pie charts showing the genetic composition of population-specific G6PD deficiency. Note the drastic differences in genetic complexity between populations. While two reduced function alleles explain >97% of G6PD deficiencies in Africa, disease risk is distributed across a multitude of different variants in Asian populations and Latinos. The total number of analyzed alleles (N) is shown for each population.
absolute numbers of individuals belonging to its subpopulations. This approach attempts to approximate the likelihood of identifying a carrier of a given variant if a random individual from the superpopulation is blindly sampled. However, the disadvantage of this approach is that very populous populations dominate the overall result whereas minority populations have little impact. Alternatively, data can be averaged across populations irrespective of the size of the respective group. This strategy is very sensitive to population definitions and requires a substantial number of groups to not skew frequencies in the direction of outlier populations. Furthermore, both strategies have problems in accounting for individuals with admixed ancestry, which account for a major fraction of many global populations [60]. Based on these limitations, we here followed a third pragmatic approach that utilizes all available data of sufficient quality. This strategy allows for the incorporation of admixed individuals and individuals for which the geographic sampling location has not been reported, thus making maximal use of the data; however, its downsides are that frequencies are sensitive to the underlying sampling strategies and that ethnogeographic resolution is mostly limited to aggregated superpopulations. Based on these considerations, we think that large population-scale data, as presented here, should complement rather than replace smaller studies into more specific populations to provide increasing information about variant frequencies in different populations across multiple levels of granularity. Genotype-to-phenotype concordance of G6PD deficiency is limited, particularly for heterozygous females. While G6PD activity shows a bimodal distribution in males that clearly separates hemizygously deficient and enzymatically normal individuals, these differences are blurred in females in which the activity pattern is continuous and normally distributed [61]. The cause of this phenomenon is random X-inactivation as the mechanism for dosage compensation in females. Consequently, heterozygous females do not have approximately 50% of G6PD activity in each cell but each erythrocyte is rather enzymatically normal or G6PD deficient, resulting in two distinct cell populations. This entails that upon exposure to a trigger of acute hemolytic anemia, only the deficient population will be broken down. The proportion of normal vs. deficient erythrocytes shows substantial inter-individual variability, giving rise to variable severity of hemolytic anemia in heterozygous females. Here, we accounted for these events by considering 8-20% of heterozygous women to be clinically G6PD deficient based on previous genotype-phenotype correlations [4][5][6][7].
The presented findings have important implications for pharmacogenomics and personalized medicine. Due to the geographic overlap between G6PD deficiency and malaria endemic regions, low G6PD activity is particularly relevant for the treatment with antimalarials of the 8-aminoquinolines class, such as primaquine, that can induce hemolytic anemia. In contrast, there is overwhelming evidence that 4-aminoquinolines, such as chloroquine and hydroxychloroquine, are safe in G6PD deficient individuals [62]. Examples of other drugs for which G6PD deficiency is clinically relevant include the rasburicase, used for the treatment of acute hyperuricemia, the nonsteroidal antiandrogen flutamide, as well as sulfonamide antibiotics. Current quantitative assessment methods for G6PD deficiency are inaccurate, particularly for heterozygous females with intermediate phenotypes, or are difficult to conduct in a low-resource setting [63]. Thus, further improvements in diagnostic tests are required to improve the predictive accuracy. From a public health perspective, it is furthermore important to establish whether newborn screening for G6PD deficiency constitutes a cost-effective allocation of health care resources and, if so, to reevaluate whether screening should be conducted irrespective of ethnicity and which genotyping strategy should be selected. Currently, the WHO recommends screening of all newborns in populations with a prevalence of more than 3% in males and the presented data might provide updated guidance for the selection of such ethnogeographic groups.
In conclusion, our analyses comprehensively mapped the genetic variability of the G6PD locus across major aggregated populations. These data provide a valuable resource for the ethnogeographic distribution of G6PD deficient alleles, promote the integration of comprehensive NGS-based genotyping and highlight key variants for population-adjusted G6PD genotyping strategies.

Conflict of interest
V.M.L is CEO and shareholder of HepaPredict AB, co-founders and shareholder of PersoMedix AB, and discloses consultancy work for Enginzyme AB. G.P.P is Full Member and National Representative at the European Medicines Agency, Committee for Human Medicinal Products (CHMP) -Pharmacogenomics Working Party, Amsterdam, the Netherlands. The other authors declare no conflict of interest.  Table 2 Estimated frequencies of population-specific G6PD deficiency. Two-sided 90% confidence intervals were calculated based on the binomial proportion. Asterisks indicate one-sided confidence intervals due to a measured frequency of zero. AMR = Latino; AFR = African; ASJ = Ashkenazi Jewish; EAS = East Asian; FIN = Finnish; NFE = Non-Finnish Europeans = ME = Middle Eastern; SAS = South Asian.