Large-scale exome array summary statistics resources for glycemic traits to aid effector gene prioritization

Background Genome-wide association studies for glycemic traits have identified hundreds of loci associated with these biomarkers of glucose homeostasis. Despite this success, the challenge remains to link variant associations to genes, and underlying biological pathways. Methods To identify coding variant associations which may pinpoint effector genes at both novel and previously established genome-wide association loci, we performed meta-analyses of exome-array studies for four glycemic traits: glycated hemoglobin (HbA1c, up to 144,060 participants), fasting glucose (FG, up to 129,665 participants), fasting insulin (FI, up to 104,140) and 2hr glucose post-oral glucose challenge (2hGlu, up to 57,878). In addition, we performed network and pathway analyses. Results Single-variant and gene-based association analyses identified coding variant associations at more than 60 genes, which when combined with other datasets may be useful to nominate effector genes. Network and pathway analyses identified pathways related to insulin secretion, zinc transport and fatty acid metabolism. HbA1c associations were strongly enriched in pathways related to blood cell biology. Conclusions Our results provided novel glycemic trait associations and highlighted pathways implicated in glycemic regulation. Exome-array summary statistic results are being made available to the scientific community to enable further discoveries.


Methods
To identify coding variant associations which may pinpoint effector genes at both novel and previously established genome-wide association loci, we performed meta-analyses of exome-array studies for four glycemic traits: glycated hemoglobin (HbA1c, up to 144,060 Eiji Kutoh, Gyoda General Hospital, Saitama, Japan Higashitotsuka Memorial Hospital, Yokohama, Japan 2.

Open Peer Review
Any reports and responses or comments on the article can be found at the end of the article.

Results
Single-variant and gene-based association analyses identified coding variant associations at more than 60 genes, which when combined with other datasets may be useful to nominate effector genes.Network and pathway analyses identified pathways related to insulin secretion, zinc transport and fatty acid metabolism.HbA1c associations were strongly enriched in pathways related to blood cell biology.

Conclusions
Our results provided novel glycemic trait associations and highlighted pathways implicated in glycemic regulation.Exome-array summary statistic results are being made available to the scientific community to enable further discoveries.

Introduction
Genome-wide association studies (GWAS) have identified hundreds of loci associated with glycemic traits and type 2 diabetes (T2D) risk [1][2][3] .Despite this tremendous success, the challenge remains to link the often lead non-coding variants with effector genes and mechanism of action.To complement these approaches, exome array studies 4,5 and more recently, whole-exome sequencing approaches have focused on coding variant associations [6][7][8][9] .These can be helpful to pinpoint potential effector genes for downstream functional studies.
Here, we provide exome-array GWAS meta-analysis results for glycated hemoglobin (HbA1c, up to 144,060 participants), fasting glucose (FG, up to 129,665 participants), fasting insulin (FI, up to 104,140) and 2hr glucose post-oral glucose challenge (2hGlu, up to 57,878).Most of the data are from self-reported and genetically clustered European ancestry individuals (85%), with the remaining participants being of African American (6%), South Asian (5%), East Asian (2%) and Hispanic ancestry (2%).We identify single coding variant and gene-based associations to prioritize likely effector genes, and additionally perform pathway analyses to highlight relevant gene sets regulating each glycemic trait.Summary statistics from these analyses are publicly available through our website (www.magicinvestigators.org), as well as through the GWAS catalog (https://www.ebi.ac.uk/gwas/summary-statistics, study accessions GCST90256400 -GCST90256420) 10 .

Methods
Study design, cohorts, phenotypes and genotypes MAGIC (Meta-Analysis of Glucose and Insulin-related traits Consortium) was established to focus on the genetic analysis of glycemic traits in individuals without diabetes.In this MAGIC effort, individuals without diabetes of self-reported and genetically clustered European (85%), African American (6%), South Asian (5%), East Asian (2%) and Hispanic (2%) ancestry from up to 64 cohorts participated.Sample sizes were up to 144,060 for HbA1c, 129,665 for FG, 104,140 for FI and 57,878 for 2hGlu.Participating cohorts and their characteristics are detailed in Supplementary Table S1 11 .Each cohort obtained ethical approval and written informed consent.

Genotyping and QC
The Illumina HumanExome BeadChip is a genotyping array containing variants that have been observed in sequencing data of ~12,000 individuals.Non-synonymous variants seen at least three times across at least two datasets were included on the exome chip.More lenient criteria were used for splice and nonsense variants.Besides the core content of proteinaltering variants, the exome chip contains additional variants including common variants identified in GWAS, ancestry informative markers, mitochondrial variants, randomly selected synonymous variants, HLA tag variants and Y chromosome variants.In this study we analyzed association with glycemic traits of 247,470 autosomal and X chromosome variants present on the exome chip.Genotype calling and quality control were performed following protocols developed by the UK Exome Chip or CHARGE consortium 12 .The exact genotyping array, calling algorithm and QC procedure used by each cohort are depicted in Supplementary Table S1 11 .

Annotation and functional prediction of variants
Annotation of the exome chip variants was performed using the Ensembl Variant Effect Predictor v78 with plugin dbNSFP v2.9 to add in silico functional prediction from Polyphen HumDiv, Polyphen HumVar, LRT, Mutation Taster and SIFT (ensembl66 version) 13,14 .

Statistical analyses
Single variant analyses.Individual cohorts ran linear mixed models using the raremetalworker (v 4.13.2) or rvtests (v20140723) software (Supplementary Table S1 11 ).For each glycemic outcome, analyses were performed using an additive model for the raw and the inverse normal transformed trait.
In the manuscript and in all tables and figures effect estimates and standard errors are for the raw trait, while the p-values are from the inverse normal transformed trait analyses.Analyses were adjusted for age, sex, BMI, study-specific number of PCs and other study-specific covariates (Supplementary Table S1 11 ).Raremetal (v4.13.7 or higher) was used to combine results within and across ancestries by fixed-effect meta-analyses.
Variants with P <10 -4 for deviation from Hardy-Weinberg equilibrium or with call rate <0.99 in individual cohorts were excluded from meta-analyses.In single variant analyses, the threshold for significance was P <2.2×10 -7 for coding variants (stop-gained, stop lost, frameshift, splice donor, splice acceptor, initiator codon, missense, in-frame indel and splice region variants).This P-value threshold was based on a Bonferroni correction weighted by the enrichment for complex trait associations among the functional annotation categories 15,16 .We performed so called distance-based clumping; significant association signals located more than 500 kb apart were considered to represent distinct loci.Significantly associated variants located more than 500 kb from any variant already found to be associated in published large-scale glycemic trait and T2D GWAS analyses 1,3,17,18 were considered novel glycemic trait associations.Gene-based and single-variant analyses results presented in the paper are for the meta-analyses of all ancestries combined, unless mentioned otherwise.
Gene-based analyses.Raremetal (v4.13.7 or higher) was used to perform gene-based burden and sequence kernel association (SKAT) tests.For both burden and SKAT tests, two in silico masks for inclusion of variants in the test were used: NSstrict and NSbroad.The NSstrict mask includes predicted protein truncating variants (PTVs, splice donor, splice acceptor, stop gained, frameshift, stop lost or initiator codon variant) OR variants that are missense and predicted to be damaging by five prediction algorithms (SIFT, Polyphen HumDiv, Polyphen HumVar, LRT, MutationTaster).The NSbroad mask additionally includes missense variants predicted to be damaging by at least one of the five prediction algorithms AND that have a MAF <1% in each ancestry group.These MAFs were derived from our single variant HbA1c meta-analyses results (N up to 144,060).Gene-based analyses were performed on genes containing at least two variants fulfilling the mask criteria.The P-value threshold for significance in gene-based analyses was 2.5 x 10 -6 (Bonferroni correction for 20,000 genes).

GeneMANIA network analysis
For network analyses, we used GeneMANIA (v3.5.1), a network approach that searches many large, publicly available biological datasets to find related genes.These include proteinprotein, protein-DNA and genetic interactions, pathways, reactions, gene and protein expression data, protein domains and phenotypic screening profiles.GeneMANIA uses a label propagation algorithm for predicting gene function given the composite functional association network (calculated from the databases selected).The weights needed for the label propagation method to work are selected at the beginning of the process.In our case, and according to the defaults, we weighted the network using linear regression, to make genes in the input list interact as much as possible with each other.We analyzed all loci that had at least one non-synonymous variant with P <1 x 10 -5 with any trait, and then mapped the most significant non-synonymous variant at each locus to the gene (input genes).We performed four network analyses: (1) HbA1c-associated variants only, (2) FI-associated variants only, (3) FG-associated variants only, and (4) 2hGlu-associated variants only (Figure 1, Supplementary Figure S1 11 ).We selected the 50 In (C), these are genes with variant associations of P <1 × 10 -5 for FG, FI, and/or 2hGlu, and in (D) these are genes with variant associations of P <1 × 10 -5 for FG.Rows in the heatmap represent significant meta-gene sets (FDR <0.05).The color of each square indicates DEPICT's z-score for membership of that gene in that gene set, where dark red means "very likely a member" and dark blue means "very unlikely a member."The gene set annotations indicate whether that meta-gene set was significant at FDR <0.05 or not significant (n.s.) for each of the other EC-DEPICT analyses.For heatmap intensity and EC-DEPICT P-values, the meta-gene set values are taken from the most significantly enriched member gene set.The gene variant annotations are as follows: (1) the European minor allele frequency (MAF) of the input variant, where rare is MAF <1%, low-frequency is MAF 1-5%, and common is MAF >5%, 2) whether the gene has an Online Mendelian Inheritance in Man (OMIM) annotation as causal for a diabetes/glycemic-relevant syndrome or blood disorder, 3) to 6) whether each variant was significant (P <2 × 10 -7 ), suggestively significant (P <1 × 10 -5 ), or not significant in Europeans for each of the four traits, and 7) whether each variant was included in the analysis or excluded by filters (see Methods).AWS: array-wide significant.
default databases to create the composite network, and we allowed the method to find at most 50 genes that are related to our query input list.The resultant networks were investigated to find enriched Gene Ontology (GO) terms and Reactome Pathways.Gene Set Enrichment (GSE) of networks and sub-networks were assessed with ClueGO 19 using GO terms and Reactome gene sets 20 .The enrichment results were grouped using a Cohen's Kappa score of 0.4, and terms were considered significant with a Bonferroni-adjusted p-value <0.05, provided that there was an overlap of at least three network genes in the relevant GO gene set when calculating GO enrichment.For the pathway selection (Reactome), we set a threshold that the network genes should represent at least 4% of the pathway.These values were applied given the recommended defaults when running ClueGO 19 .Cohen's Kappa statistic was used to measure the gene-set similarity of GO terms and Reactome pathways and allowed us to group enriched terms into functional groups to improve visualization of enriched pathways.We used all genes with GO annotations and at least one interaction in our network database as the background set.

Gene set enrichment analysis (GSEA)
An extension of the GWAS GSEA method DEPICT 21 , EC-DEPICT 22,23 , was used for GSEA.The key feature of EC-DEPICT is the use of "reconstituted" gene sets, which are gene sets collected from many different databases (e.g.canonical pathways, protein-protein interaction networks, and mouse phenotypes) that have been extended based on large-scale microarray co-expression data 21,24 .
Six groups of variants were analyzed: (1) HbA1c-associated variants only, (2) FI-associated variants only, (3) FG-associated variants only, (4) 2hGlu-associated variants only, (5) all trait-associated variants, and (6) all trait-associated variants except for HbA1c.For each trait, the associated variants based on the European summary statistics were identified and clumped using a +/-500 kb window.Then, the most significant nonsynonymous variant for each locus was included in the analysis, with a cut-off of P <10 -5 .Annotations from the CHARGE consortium were used to assign variants to genes (see URL).After GSEA, highly correlated gene sets were grouped by affinity propagation clustering of all 14,462 gene sets 25 into "meta-gene sets" using SciKitLearn.clustering.AffinityPropagation version 0.17 26 .For all visualizations, the gene set within a meta-gene set with the best enrichment P-value was used; heat maps were created with the ComplexHeatmap package in R 27 .

URL: CHARGE Consortium ExomeChip annotation file (v6).
Method and choice of data for permutations: We performed the EC-DEPICT analysis as described elsewhere 22,23 .All analyses are based on a group of 14,462 "reconstituted" gene sets, which contains a z-score for probability of gene set membership for each gene (for details, see 21,24 ).
The basic EC-DEPICT method is as follows.We first obtain a list of significant input variants (the most significant nonsynonymous variant per locus) and then map variants to genes based on annotations from the CHARGE consortium (see URL).
For each gene set, we obtain the gene set membership z-scores for all trait-associated input genes and sum them to generate a test statistic.We then take 2,000 permuted ExomeChip association studies (described in more detail below) and calculate the average permuted test statistic for that gene set, as well as the permuted standard deviation.For each permutation, the number of top genes we take as "input genes" is matched to the actual observed number of input genes.We then calculate (observed test statistic -average permuted test statistic)/ (permuted standard deviation) to generate a z-score, which is converted to a p-value via the normal distribution.False discovery rates were calculated by comparing the observed p-values to a permuted P-value distribution generated with an additional set of 50 permuted association studies.
The permuted ExomeChip association studies are conducted by (1) generating 2,200 sets of normally distributed phenotypes and (2) using these randomly generated phenotypes to conduct 2,200 association studies with real ExomeChip data.Using these permutations to adjust the observed test statistics corrects for any inherent structure in the data (e.g. that pathways made up of longer genes may be more likely to come up as significant by chance).
For these analyses, we first generated permutations based on ExomeChip data we had used previously for this purpose: 11,899 samples drawn from three cohorts (Malmö Diet and Cancer [MDC], All New Diabetics in Scania [ANDIS], and Scania Diabetes Registry [SDR]).For simplicity, we refer to these cohorts as the "Swedish permutations." As part of our GSEA pipeline, we remove input trait-associated variants that are not present in the permuted data to ensure that all variants are appropriately modeled.When using the Swedish permutations, this generally results in removing a substantial fraction of the variants, especially of the very rarest variants (due to the smaller sample size of the Swedish data relative to the data being analyzed).We have previously observed that this filtering can actually improve the GSEA signal, possibly due to more heterogeneous biology or a higher false-positive rate in these very rare variants 23 .However, in this case, we observed that in performing this filtering, we excluded variants in several known monogenic disease genes, such as HNF1A and SLC2A2.Therefore, we wished to repeat the analysis with a set of permutations which would allow us to retain these variants.We thus repeated the analysis with a second set of permutations consisting of 152,249 samples from the UK Biobank (referred to as the "UKBB permutations").
The larger sample size in the UKBB permutations means more variants are present and can therefore be included in the analysis.

Concordance of results from two different sets of permuted distributions across phenotypes:
For completeness, we report the results from the use of both sets of permutations.We note that the results are strongly concordant.The larger number of significant gene sets reported based on the UK Biobank permutations is generally a combination of 1) overall improved power (i.e. more variants are included) and 2) the inclusion of variants in key driver genes absent in the Swedish permutations, encompassing both the monogenic genes mentioned above (e.g.SLC2A2) and additional genes with clearly relevant biology (e.g.SLC30A8).The results from both sets of permutations are summarized below.For all analyses, "significance" refers to a false discovery rate of <0.05.
All-trait analysis: After filtering, 78 input genes were included for the analysis with the UKBB permutations and 60 for the analysis with the Swedish permutations.(Note that the difference in the number of input genes is due to the presence of a larger number of input variants in the UKBB permutations -see above).We found 234 significant gene sets in 86 meta-gene sets based on the UKBB permutations (Supplementary Figure S2 11 ) and 133 gene sets in 51 meta-gene sets based on the Swedish permutations (Supplementary Figure S3 11 ).
The correlation between the UKBB and Swedish analyses was r = 0.902, P <10 -300 .
All-traits-except-HbA1c analysis: After filtering, 45 input genes were included for the analysis with the UKBB permutations and 33 for the analysis with the Swedish permutations.We found 128 significant gene sets in 53 meta-gene sets based on the UKBB permutations (Supplementary Figure S2 11 ) and 45 significant gene sets in 18 meta-gene sets based on the Swedish permutations (Supplementary Figure S3 11 ).The correlation between the UKBB and Swedish analyses was r = 0.882, P <10 -300 .
HbA1c-only analysis: After filtering, 41 input genes were included for the analysis with the UKBB permutations and 33 for the analysis with the Swedish permutations.We found 191 significant gene sets in 73 meta-gene sets based on the UKBB permutations (Supplementary Figure S2 11 ) and 120 gene sets in 41 meta-gene sets based on the Swedish permutations.
FG-only analysis: After filtering, 26 input genes were included for the analysis with the UKBB permutations and 22 for the analysis with the Swedish permutations.We found 106 significant gene sets in 39 meta-gene sets based on the UKBB permutations (Supplementary Figure S2 11 ) and 48 significant gene sets in 15 meta-gene sets based on the Swedish permutations (Supplementary Figure S3 11 ).The correlation between the UKBB and Swedish analyses was r = 0.939, P <10 -300 .
2hGlu-only analysis: After filtering, 12 input genes were included for the analysis with the UKBB permutations and seven for the analysis based on the Swedish permutations.We found 56 significant gene sets in 17 meta-gene sets based on the UKBB permutations (Supplementary Figure S2 11 ), with no significant gene sets based on the Swedish permutations.The correlation between the UKBB and Swedish analyses was r = 0.787, P <10 -300 .
FI-only analysis: After filtering, 11 input genes were included for the analysis with the UKBB permutations and eight for the analysis with the Swedish permutations.There were no significant gene sets from either analysis.The correlation between the UKBB and Swedish analyses was r = 0.860, P <10 -300 .
Visualization: As in previous work 22,23 , we have included all trait-associated variants in the heat maps, even if they were excluded from the analysis (e.g. because they were absent in the permutations or did not have a nonsynonymous annotation in the CHARGE annotation file).This is because we assume that if the genes harboring those variants have strong predicted membership in significantly trait-associated gene sets, they are still good candidates for prioritization.In fact, this may be even stronger evidence in favor of these genes because they did not contribute to the enrichment analysis and therefore their prioritization is independently derived (and provides even more support to the implicated biology).

Study design overview
We performed single-variant and gene-based association analyses with FG, FI, HbA1c, and 2hGlu levels on exome-array coding variants in up to 144,060 individuals without diabetes (to exclude any consequence of diabetes treatments or related interventions on these quantitative traits) of European (85%), African-American (6%), South Asian (5%), East Asian (2%), and Hispanic (2%) ancestry from up to 64 cohorts (Supplementary Table S1 11 , Methods).We used a linear mixed model to test single-variant associations in each individual cohort and combined results by fixed-effect meta-analyses within and across ancestries.As body mass index (BMI) is a major risk factor for T2D and is correlated with glycemic traits, all analyses were adjusted for BMI to identify loci influencing glycemia independently from their effects on overall adiposity.We have previously demonstrated that collider bias did not significantly affect results with BMI adjustment 1 .We used distance-based clumping to define distinct loci and considered signals to be novel if they were located more than 500 kb from a variant with an established association with any of the glycemic traits or T2D in large published GWAS (Methods).We considered a coding variant to meet exomewide significance for association if P <2.2 × 10 -715,16 (Table 1, Methods).To increase power to detect rare variant associations, we additionally performed gene-burden and sequence kernel association (SKAT) tests for gene-level analyses to identify genes with significant evidence of association (P <2.5 × 10 -6 ) (Table 2, Methods).Finally, to identify relevant biological pathways enriched in associations with glycemic traits we conducted pathway and network analyses.

Identification of single-variant associations
Our single variant analyses identified 62 distinct coding variant associations at 58 genes associated with at least one of the glycemic traits at exome-wide significance (P <2.2 × 10 -7 ) (Table 1).Of these, four variants at three genes represented novel associations.These included a missense (rs1983210,  p.E1365D) and a splice region variant (rs3183099) in OBSL1 associated with FI, another missense variant (rs1886686, p.G12A) in WDR78 associated with FG, and a missense variant (rs31244, p.D543N) in SV2C associated with HbA1c (Table 1).In addition, the missense variant (rs146886108, p.R187Q) in ANKH which was previously associated with T2D was associated for the first time with FG.

Identification of gene-based associations
Our gene-based analyses identified six genes associated with glycemic traits, including G6PC and TF that had not been associated with glycemic traits before (Table 2 and Supplementary Table S2 11 ).These findings provide new hypotheses for downstream follow-up studies in the context of glycemic trait biology.G6PC, encoding glucose-6-phosphatase, is associated with FG and FI and is a homolog of G6PC2.G6PC2 is an established effector gene at a GWAS locus which contains multiple coding variants known to influence FG and HbA1c but not FI levels 4,5,[28][29][30] .Loss-of-function variants at SLC30A8 have been previously associated with reduced risk of T2D [31][32][33] , while VPS13C maps to the VPS13C/C2CD4A/C2CD4B T2D risk locus.Follow-up studies at this locus have with varying levels of evidence suggested C2CD4A, encoding a calcium-dependent nuclear protein, as the causal gene for T2D through its potential role in the pancreatic islets [34][35][36][37] .We found evidence of association at MAP3K15 with reduced levels of FG and HbA1c (Table 2 and Supplementary Table S2 11 ), which is consistent with recent reports of the gene's association with reduced levels of HbA1c and glucose, and reduced T2D risk 6,38 .Our analyses also detected TF (encoding transferrin) as a novel gene-based association signal associated with HbA1c but not any of the other glycemic traits, consistent with the role of the protein as the main iron carrier in the blood (Table 2 and Supplementary Table S2 11 ).
Pathway analyses identify relevant gene sets regulating glycemia Next, we used our coding variant association results to identify pathways enriched for glycemic trait associations, and to subsequently determine the extent to which different associations within the same trait implicate the same or similar pathways (as indicated by the functional connectivity of the network).To do this we used GeneMANIA network analysis 39 , which takes a query list of genes and finds functionally similar genes based on large, publicly available biological datasets, that include protein-protein, protein-DNA and genetic interactions, pathways, protein domains, protein and gene expression data.GeneMANIA taps on updated versions of these databases for its core and network analyses, to identify related genes of known functions based on our input list of genes.To increase power to connect genes in a network, we considered all genes harboring non-synonymous variants that reached P <1 × 10 -5 (Supplementary Table S3 11 ) for any of the four glycemic traits in our study and mapped the most significant nonsynonymous variant at each locus to the respective gene (totaling 121 associations across all traits) (Methods).A high degree of connectivity was observed within the HbA1c network, with enrichment of processes related to blood cell biology such as porphyrin metabolism, erythrocyte homeostasis and iron transport (Figure 1A and Supplementary Table S4 11 ).In comparison, the network generated from FG-associated genes captured several processes known to contribute to glucose regulation and islet function, including insulin secretion, zinc transport and fatty acid metabolism (Figure 1B and Supplementary Table S4 11 ).Given that there were fewer genes associated with FI and 2hGlu, we were less powered to draw meaningful insights from the enriched pathways in those traits (Supplementary Figure S1 and Supplementary Table S4 11 ).
We also performed gene set enrichment analysis (GSEA) using EC-DEPICT 22,23 (Methods).The primary innovation of EC-DEPICT is the use of 14,462 gene sets extended based on largescale co-expression data 21,24 .These gene sets take the form of z-scores, where higher z-scores indicate a stronger prediction that a given gene is a member of a gene set.To reduce some of the redundancy in the gene sets (many of which are strongly correlated with one another), we clustered them into 1,396 "meta-gene sets" using affinity propagation clustering 25 .These meta-gene sets are used to simplify visualizations and aid interpretation of results.As before, we considered all loci with variants that reached P <1 × 10 -5 (Supplementary Table S3 11 ) for any of the four glycemic traits for defining input genes (Methods).When looking across all traits combined, we found 234 significant gene sets in 86 meta-gene sets with false discovery rate (FDR) of <0.05 (Supplementary Table S5A, Supplementary Figure S2A 11 ).As expected, we observed a strong enrichment of insulin-and glucose-related gene sets, as well as hormone secretion and cytoplasmic vesicle gene sets (in keeping with pancreatic beta cell insulin vesicle release).In agreement with the GeneMANIA network analyses, we also noted a particularly strong enrichment for blood-related pathways represented by gene sets such as erythrocyte differentiation and heme metabolic process, which was primarily driven by HbA1c-associated variants.This was likely because HbA1c levels are influenced not only by glycation but also by blood cell turnover rate 1,40,41 .To disentangle blood cell turnover from effects due to glycation, we repeated the analysis excluding variants that were significantly associated with HbA1c only and found 128 significant gene sets in 53 meta-gene sets (FDR <0.05) (Figure 1C, Supplementary Table S5B, Supplementary Figure S2B 11 ).Indeed, we noted that majority of the gene sets now implicated pathways relevant to the pancreatic islets and metabolic tissues, such as "abnormal glucose homeostasis", "peptide hormone secretion", "Maturity Onset Diabetes of the Young", and multiple pathways involved in the regulation of glycogen, incretins, and carbohydrate metabolism, that were also seen in the FG only analysis (Figure 1D, Supplementary Table S5D, Supplementary Figure S2D 11 ).
We also analyzed each of the four traits separately, to reveal trait-specific enriched gene sets (Supplementary Table S5, Supplementary Figure S2C-E, Supplementary Figure S3C-D 11 , Methods).Overall, our network and pathway enrichment analyses provide insight into the biology underlying each glycemic trait and may facilitate the prioritization of specific genes or pathways across multiple different phenotypes.

Discussion
Here we have described large scale meta-analyses results for coding variant and gene-based associations for four glycemic traits, FG, FI, HbA1c and 2Glu, and the downstream pathways and networks that are regulated by the associated genes.
Our results identified three genes with novel single-variant associations with glycemic traits OBSL1 (FI), WDR78 (FG) and SVC2 (HbA1c).OBSL1 encodes a cytoskeletal protein related to obscurin, mutations in which have been shown to lead to an autosomal recessive primordial growth disorder (OMIM: 612921).Loss of OBSL1 leads to downregulation of CUL7, a protein known to interact with IRS-1, downstream of the insulin receptor signaling pathway 42 .WDR78 encodes a WD repeat-containing protein 78, the same variant rs1886686-C has been previously associated with a decrease in systolic blood pressure 43 .However, none of the OBSL1 (rs1983210, b = -0.018,p = 1.20 x 10 -4 , N = 144,114; rs3183099, b = -0.019,p = 1.36 x 10 -4 , N = 125,397) or WDR78 (rs1886686, b = -0.017,p = 3.83 x 10 -5 , N = 164,878) variants we detected here reached exome-wide significance in our recent large multi-ancestry study 1 .This, despite larger sample sizes and good genotype quality (info >0.8 for each of the variants for the majority of cohorts), suggesting caution in the interpretation of these findings, and the need for additional datasets testing these associations.The final variant, p.D543N in SV2C, was associated with HbA1c with p = 5.5 x 10 -5 in the European meta-analysis 1 , and with p = 1.37 x 10 -12 in UK biobank 44 .A second missense variant at this gene, p.T482S, is also strongly associated with HbA1c (p = 1.9 x 10 -16 ) and with red blood cell distribution width in UK biobank (p = 3.3 x 10 -11 ) 44 , and with mean corpuscular volume (p = 3 x 10 -11 ) 45 .Given that variation in red blood cell traits can influence HbA1c levels 1,41 , associations between these missense variants suggest SV2C as the likely effector gene at this locus.Also, the absence of evidence for association between this gene and other glycemic traits suggests its effect on HbA1c is independent of glycemia.
The novel gene-based association of G6PC with FG and FI was notable.Homozygous inactivating alleles in G6PC, including both p.R83C and p.Q347X which are contained in our gene-based association (Table S2), are known to give rise to glycogen storage disease type 1a (GSD1a).GSD1a is a rare autosomal recessive metabolic disorder 46,47 , but this is the first time that rare coding variants in G6PC have been shown to influence FG and FI levels in normoglycemic individuals.The other novel gene-based association was between TF and HbA1c.TF encodes transferrin, an iron-binding transport protein that circulates at high levels in blood plasma as an important biological carrier of iron.Dysregulation of iron concentrations due to reduced transferrin levels or function could affect the measurement of HbA1c independently of glycemia 48 .The presence of multiple coding variants within TF associated with red blood cell traits in UK biobank 44 lends additional support to this hypothesis.
Overall, our network and pathway analyses were highly concordant with each other and with other published data identifying processes related to glucose regulation and islet function, including insulin secretion and zinc transport associated with FG loci, and red blood cell biology processes amongst HbA1c associated loci 1 .The FG network revealed linking nodes (that are not among the association signals) with known links to glucose homeostasis and diabetes, such as GCK (encoding the beta cell glucose sensor glucokinase), GCG (encoding the peptide hormone glucagon secreted by the alpha cells of the pancreas) and GIP (encoding the incretin hormone gastric inhibitory polypeptide).Notably, lipid related pathways associated with fasting glucose.One gene within the FG cluster for lipid-related pathways is CERS2, which encodes ceramide synthase 2, an enzyme known to be associated with the sphingolipid biosynthetic process (Figure 1B, Supplementary Table S3 11 ).Although CERS2 is only nominally associated with FG and is significantly associated with HbA1c (rs267738: P FG = 3.54 × 10 -7 ; P HbA1c = 6.96 × 10 -10 ), it does not cluster together with any HbA1c-enriched pathway, suggesting that CERS2 is regulating FG and HbA1c indirectly through its role in lipid metabolism.

Conclusions
In conclusion, our results provided novel glycemic trait associations and highlighted pathways implicated in glycemic regulation.The summary statistics results are being made publicly available through various platforms so they can be harnessed with other data to aid effector gene identification.
• Table S2: Supplementary Table S2 -Full gene-based results including all variants included in the masks, for both novel and previously-established genes • Table S3: Supplementary Table S3 -All variants associated with FG, FI, HbA1c and/or 2hGlu in our analyses with P<10-5 • Table S4: Supplementary Table S4

COPSAC2000
The funding agencies did not have any influence on study design, data collection and analysis, decision to publish or preparation of the manuscript.No pharmaceutical company was involved in the study.We gratefully express our gratitude to the participants of the COPSAC2000 cohort study for all their support and commitment.We also acknowledge and appreciate the unique efforts of the COPSAC research team.

CROATIA_Korcula
We would like to acknowledge the contributions of the recruitment team in Korcula, the administrative teams in Croatia and Edinburgh and the people of Korcula.Exome array genotyping was performed at the Clinical Research Facility University of Edinburgh, Edinburgh, UK

DIABNORD
We are grateful to the study participants who dedicated their time and samples to these studies.We also thank the VHS, the Swedish Diabetes Registry and Umeå Medical Biobank staff for biomedical data and DNA extraction.We also thank M Sterner, G Gremsperger and P Storm for their expert technical assistance with genotyping and genotype data preparation.

EFSOCH
The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

EPIC-Norfolk
We are grateful to all the participants who have been part of the project and to the many members of the study teams at the University of Cambridge who have enabled this research EPIC-Potsdam Exome chip genotyping of EPIC-Potsdam samples was carried out under supervision of Per Hoffmann and Stefan Herms at Life & Brain GmbH, Bonn.We are grateful to the Human Study Centre (HSC) of the German Institute of Human Nutrition Potsdam-Rehbrücke, namely the trustee and the data hub for the processing, and the participants for the provision of the data, the biobank for the processing of the biological samples and the head of the HSC, Manuela Bergmann, for the contribution to the study design and leading the underlying processes of data generation.

EpiHealth
Genotyping was performed by the SNP&SEQ Technology Platform in Uppsala.We thank the EpiHealth participants for their dedication and commitment.
ERF STUDY ERF study is grateful to all study participants and their relatives, general practitioners and neurologists for their contributions and to P. Veraart for her help in genealogy, J. Vergeer for the supervision of the laboratory work and P. Snijders for his help in data collection.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscripts.

Fenland
We are grateful to all the volunteers for their time and help, and to the General Practitioners and practice staff for assistance with recruitment.We thank the Fenland Study Investigators, Fenland Study Coordination team and the Epidemiology Field, Data and Laboratory teams.

Open Peer Review
Current Peer Review Status: -Overall, the paper is well-written and provides a clear presentation of the background, methods, results, and conclusions.However, there are several points that I would like to address: -The main conclusions are clear, but their reproducibility might be questionable.The authors could strengthen their findings by validating the results using alternative approaches, preferably not relying solely on in silico methods but incorporating real experiments, such as molecular and cellular biology techniques.
-There are too many authors listed, and it is unclear "who did what."Authorship could be perceived as casual, with individuals included without clear contributions.However, I do not intend to intervene on this matter.
-Are there any potential biases or confounding factors that could have influenced the results?The inclusion and exclusion criteria might introduce selection bias, especially given the narrowly defined population.
-While the study includes participants from various ancestral groups, it is unclear how these individuals were selected or why the study predominantly focuses on certain ethnic groups (85% European).This could introduce bias and raises questions about the robustness of the findings in non-European populations.Will certain associations be more detectable in one group over another?What challenges arise from the smaller representation of other ethnicities?The generalizability of the findings is thus questionable, especially regarding whether ancestry might affect the genetic associations.
-The justification for using exome arrays is unclear.Are there specific advantages of exome arrays compared to whole-exome sequencing?This method is not considered cutting-edge technology.
-The descriptions of phenotypes are somewhat vague.While the metrics for FG, FI, 2hGlu, and HbA1c are provided, there is little detail on how these were measured across different cohorts.Was there any standardization across cohorts?
-The distance-based clumping method for defining loci (500 kb apart) lacks explanation.Why was this particular threshold chosen?Could it exclude significant associations that are closer together?
-What are the clinical interpretations and implications of these results?This aspect seems to be missing.The paper heavily emphasizes in silico data (statistical and computational findings), but more context is needed regarding the physiological and biological significance of the identified gene sets for glycemic traits.For instance, the results mention variants associated with traits but do not thoroughly discuss the clinical relevance or potential functional implications of these variants.How might the novel missense variant rs146886108 in ANKH, for example, influence FG or T2DM risk?
-The exclusion of individuals with diabetes is mentioned, but the rationale could be elaborated upon.Could this exclusion introduce bias?Does it ensure that the identified associations are specific to glycemic traits in non-diabetic individuals?
-The authors report identifying 62 distinct coding variant associations at 58 genes with exomewide significance.However, there is little detail on the methods used to control for false positives beyond the Bonferroni correction threshold of P<2.2×10−7.
-The association with HbA1c is considered significant, but there is insufficient discussion about potential confounding factors, such as the influence of red blood cell (RBC) traits on HbA1c.I would suggest softening the interpretation of SV2C as an effector gene, given the complex relationship between RBC traits and HbA1c.Experienced clinicians in this field would likely agree that it is not appropriate to base conclusions about glycemic control solely on HbA1c levels.

Toshimasa Yamauchi
Department of Diabetes and Metabolic Diseases, University of Tokyo Graduate School of Medicine, Tokyo, Japan In the presented study, the authors have undertaken a comprehensive exome-wide association study (ExWAS) to identify genetic loci linked to glycemic traits.A characteristic aspect of this research is the utilization of ExWAS meta-analysis to examine variants in coding regions, an approach that complements previous studies emphasizing non-coding variants.By analyzing data from a large participant pool, predominantly of European ancestry, the study pinpoints single coding variants and gene-based associations that could act as potential effector genes for glycemic traits such as glycated hemoglobin (HbA1c), fasting glucose (FG), fasting insulin (FI), and 2hr glucose post-oral glucose challenge (2hGlu).Additionally, the study extends to pathway analyses, offering insights into gene sets regulating these traits.The transparency and accessibility of the study are beneficial to the research community, with summary statistics made available on their website and through the GWAS catalog.
The study's methodology, while not novel, adheres to established conventions in the field, ensuring a foundation for their analyses.The discovery of a modest number of new loci and genes associated with glycemic traits, though limited in quantity, is worth reporting.These findings include the identification of four variants in three genes that represent novel associations, underscoring the potential for uncovering new pathways in glycemic regulation.The gene-based analysis further highlights six genes, including G6PC and TF, previously unlinked to glycemic traits.
The findings, while not groundbreaking, are biologically consistent.The study reveals a notable enrichment in blood-related pathways, especially those involving erythrocyte differentiation and heme metabolic processes.This enrichment, predominantly driven by HbA1c-associated variants, underscores the multifaceted influence on HbA1c levels, which are affected by both glycation and blood cell turnover.By excluding variants solely associated with HbA1c, the researchers effectively isolated 128 significant gene sets within 53 meta-gene sets (FDR <0.05).This refinement of analysis illuminated pathways more directly related to pancreatic islet function and metabolic tissues.
These pathways, including "abnormal glucose homeostasis", "peptide hormone secretion", and "Maturity Onset Diabetes of the Young", as well as those involved in glycogen regulation, incretin function, and carbohydrate metabolism, align with findings from fasting glucose-only analyses.Such insights could enhance our understanding of the complex genetic and biological mechanisms underlying glycemic control.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Yes

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Diabetes, Obesity, Genetics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1 .
Figure 1.Network and pathway analyses identify relevant gene sets regulating glycemia using two different methods for variant associations with P <1 × 10 -5 .(A-B)The networks represent composite networks for (A) HbA1c and (B) FG, from the GeneMANIA analysis using genes with variant associations at P <1 × 10 -5 for each trait as input.Nodes outlined in red correspond to genes from the input list.Other nodes correspond to related genes based on 50 default databases.Based on the network, GO terms and Reactome pathways that were significantly enriched are depicted.To summarize these results, the most significant term of all calculated terms within the same group is represented.Barplots with the Bonferroni-adjusted -log10(p-values) of the most significant terms within each group are are shown.Each group was assigned a specific color; if a gene is present in more than one term, it is displayed in more than one color.(C-D) Heatmaps showing EC-DEPICT results from analysis of (C) all traits except HbA1c and (D) FG.The columns represent the input genes for the analysis.In (C), these are genes with variant associations of P <1 × 10 -5 for FG, FI, and/or 2hGlu, and in (D) these are genes with variant associations of P <1 × 10 -5 for FG.Rows in the heatmap represent significant meta-gene sets (FDR <0.05).The color of each square indicates DEPICT's z-score for membership of that gene in that gene set, where dark red means "very likely a member" and dark blue means "very unlikely a member."The gene set annotations indicate whether that meta-gene set was significant at FDR <0.05 or not significant (n.s.) for each of the other EC-DEPICT analyses.For heatmap intensity and EC-DEPICT P-values, the meta-gene set values are taken from the most significantly enriched member gene set.The gene variant annotations are as follows: (1) the European minor allele frequency (MAF) of the input variant, where rare is MAF <1%, low-frequency is MAF 1-5%, and common is MAF >5%, 2) whether the gene has an Online Mendelian Inheritance in Man (OMIM) annotation as causal for a diabetes/glycemic-relevant syndrome or blood disorder, 3) to 6) whether each variant was significant (P <2 × 10 -7 ), suggestively significant (P <1 × 10 -5 ), or not significant in Europeans for each of the four traits, and 7) whether each variant was included in the analysis or excluded by filters (see Methods).AWS: array-wide significant.

Version 1 Reviewer
Report 13 September 2024 https://doi.org/10.21956/wellcomeopenres.20795.r96202© 2024 Kutoh E. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Eiji Kutoh 1 Gyoda General Hospital, Saitama, Japan 2 Higashitotsuka Memorial Hospital, Yokohama, Japan I have completed the review of the research paper titled "Large-scale exome array summary statistics resources for glycemic traits to aid effector gene prioritization" by Dr. Willems et al.Below are my comments: the work clearly and accurately presented and does it cite the current literature?Yes Is the study design appropriate and is the work technically sound?Partly Are sufficient details of methods and analysis provided to allow replication by others?Partly If applicable, is the statistical analysis and its interpretation appropriate?Partly Are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions drawn adequately supported by the results?Yes Competing Interests: No competing interests were disclosed.Reviewer Expertise: diabetology, molecular endocrinology, molecular and cellular biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.Reviewer Report 22 January 2024 https://doi.org/10.21956/wellcomeopenres.20795.r69837© 2024 Yamauchi T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table 1 . Single-point coding variant associations meeting the significance threshold for coding variants of P <2.2 × 10 -7 .
1his table includes all coding variants meeting this threshold, irrespective of whether they fall in completely new loci or in previouslyestablished loci, provided that the association at the established locus was not shown to be due to a non-coding variant (TableS2) or another coding variant at the same locus.Novel loci are highlighted in bold.HbA1c: glycated haemoglobin; FG: fasting glucose; FI: fasting insulin; 2hGlu: 2h glucose; Alleles E/O: effect allele/other allele; EAF: effect allele frequency; Effect (SE): effect size (standard error); P: p-value; N: number of samples in the analysis; Novel/previous glycemic trait association: Novel corresponds to a new association result in this study; Locus name of previous association -name used for previously reported locus.1Significant in the European-only analysis in our study.Genes in this table are listed in order of chromosomal position.

Table 2 . Gene-based results from broad (NSbroad mask) and strict (NSstrict mask) analyses. Genes
in bold are newly discovered from this effort.N var: total number of variants in that gene-based analysis; P burden : p-value from burden test which assumes all variants have the same direction of effect; P SKAT : p-value from SKAT test which allows for different directions of effect between variants.The lowest p-value is highlighted in bold.
approved by the Icelandic National Bioethics Committee, VSN: 00-063.The researchers are indebted to the participants for their willingness to participate in the study.ASCOT trial participants, physicians, nurses, and practices in the participating countries for their important contribution to the study.In particular we thank Clare Muckian and David Toomey for their help in DNA extraction, storage, and handling.This work forms part of the research programme of the NIHR Cardiovascular Biomedical Research Unit at BartsAthero-Express Biobank StudyClaudia Tersteeg, Krista den Ouden, Mirjam B. Smeets, and Loes B. Collé are graciously acknowledged for their work on the DNA extraction.Astrid E.M.W. Willems, Evelyn Velema, Kristy M. J. Vons, Sara Bregman, Timo R. ten Brinke, Sara van Laar, Louise M. Catanzariti, Joyce E.P. Vrijenhoek, Sander M. van de Weg, Arjan H. Schoneveld, Arnold Koekman, Arjan Boltjes, Petra H. Homoed-van der Kraak, and Aryan Vink are graciously acknowledged for their past and continuing work on the Athero-Express Biobank Study.We would also like to thank all the (former) employees involved in the Athero-Express Biobank Study of the Departments of Surgery of the St. Antonius Hospital Nieuwegein and University Medical Center Utrecht for their continuing work.Jessica van Setten is graciously acknowledged for her help in the quality assurance and quality control of the genotype data.Lastly, we would like to thank all participants of the Athero-Express Biobank Study; without you these kinds of studies would not be possible.CHSA full list of principal CHS investigators and institutions can be found at CHS-NHLBI.org.The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.