Cross-population analysis for functional characterization of type II diabetes variants

Background As Genome-Wide Association Studies (GWAS) have been increasingly used with data from various populations, it has been observed that data from different populations reveal different sets of Single Nucleotide Polymorphisms (SNPs) that are associated with the same disease. Using Type II Diabetes (T2D) as a test case, we develop measures and methods to characterize the functional overlap of SNPs associated with the same disease across populations. Results We introduce the notion of an Overlap Matrix as a general means of characterizing the functional overlap between different SNP sets at different genomic and functional granularities. Using SNP-to-gene mapping, functional annotation databases, and functional association networks, we assess the degree of functional overlap across nine populations from Asian and European ethnic origins. We further assess the generalizability of the method by applying it to a dataset for another complex disease – Prostate Cancer. Our results show that more overlap is captured as more functional data is incorporated as we go through the pipeline, starting from SNPs and ending at network overlap analyses. We hypothesize that these observed differences in the functional mechanisms of T2D across populations can also explain the common use of different prescription drugs in different populations. We show that this hypothesis is concordant with the literature on the functional mechanisms of prescription drugs. Conclusion Our results show that although the etiology of a complex disease can be associated with distinct processes that are affected in different populations, network-based annotations can capture more functional overlap across populations. These results support the notion that it can be useful to take ethnicity into account in making personalized treatment decisions for complex diseases.


Background
Genetic variations constitute an important part of the factors that contribute to many complex diseases. To identify genetic variations that are associated with specific complex diseases, Genome-wide and whole-genome association studies have been widely performed in recent years. These studies have identified many germline variants (in particular, single nucleotide polymorphisms or SNPs) associated with complex diseases in many different populations. Most of these identified variants exhibit subtle disease associations, and these variants usually have limited predictive power in risk assessment [1].
While the effect of a single genetic variation, like a SNP, could be small or large, the collective effect of many variations, and their relationships, provide valuable information into the mechanisms of complex diseases. To this end, there is an eminent need for studying the collective effect of many SNPs and their interrelationships as we try to characterize the functional underpinnings of the relationship between genotype and phenotype. One useful source of information in this regard is the disease associations identified by GWAS on different populations.
Since many genomic variants can be population-specific and their distribution can follow geographical patterns, GWAS on different populations offer different [2] and potentially complementary disease associations, which may enrich our understanding of disease mechanisms. Furthermore, rare variants, which are often thought to be the causal variants for many complex diseases, are usually population-specific [1]. Therefore, elucidation of the functional relationship between rare variants identified in different populations can be useful for the design of personalized treatment strategies in precision medicine. In the future, the comprehensive knowledge we collect from the use of multiple populations in whole-genome association studies can be utilized by healthcare systems in smoothing out and gradually eliminating health disparities [3,4].
Systems biology uses computational and mathematical approaches to model the complex interactions among genetic variations as related to a phenotype. Such a holistic approach of characterizing and integrating the widespread genetic variations and the interplay between them yields a more intuitive understanding of complex diseases. Type II Diabetes (T2D) is one of the complex diseases, where most disease-associated variants identified by GWAS are different across different populations. Since variants identified on different populations can be linked to T2D through similar functional mechanisms, trying to annotate the significant SNPs from each population separately will not fully utilize the information obtained from different populations. Furthermore, since there are distinct underlying disease processes and treatment regimens for T2D, analyzing the functional overlap between variants identified on different populations can shed light into the differences between different populations in terms of disease mechanisms.
In this paper, we propose a computational pipeline to systematically assess the functional overlap between genomic markers of complex diseases that are identified on different populations. For this purpose, we use T2D as a suitable model disease, and we compile results from GWAS that have been performed on 9 different populations across the globe. These results mainly represent genomic loci that are found to be associated with T2D on samples collected from these nine populations. Interestingly, there is little overlap between disease associated loci identified on different populations.
To systematically assess the functional overlap between these different sets of SNPs, we develop computational algorithms and statistical frameworks, expecting that the variants identified in certain populations correspond to similar biological processes. For this purpose, we develop a multi-layered framework, where genomic loci, protein-coding genes, biological pathways in which these proteins are active, and networks of physical and functional interactions between these proteins are systematically evaluated for potential overlap. Figure 1 illustrates the framework. Our results show that the overlap between different populations grow as the level of abstraction coarsens from genomic location to biological function. More interestingly, our results also show that differences in the biological processes that are implicated in different populations align with the targets of T2D first-line therapy in each population.
To further assess the generalizability of our methods, we repeat the same pipeline of analyses on datasets for another complex disease, prostate cancer as a second test case. Our results on prostate cancer also show that more functional overlap can be detected among populations as the level of abstraction coarsens.

Populations and datasets for type II diabetes
We use nine T2D Genome Wide Association datasets representing nine populations of Asian and European ethnic origins. The basic statistics of these studies and their results are shown in Table 1. The first two datasets are case-control studies for which the genotypes for all samples and genotyped loci are available. The other seven datasets are published results of case-control studies. These datasets consist of the list of significant loci that are identified by the study and the associated statistical significance figures.
The nine datasets are the following:  Fig. 1 Workflow of multi-level functional overlap analysis for disease-associated genomic variants identified in different populations. The computational pipeline takes as input a set of genomic loci that are found to be significantly associated with a disease of interest in each of the populations that are considered. It then maps these loci to functional elements with coarser granularity, first considering linkage disequilibrium, then mapping loci to protein-coding genes, subsequently mapping these genes to biological pathways, and finally identifying functionally associated subsets of these genes via network analysis. At each level, the resulting functional elements that are found to be associated with the disease are compared across populations, to systematically characterize the overlap between different populations Table 1 Descriptive statistics of the T2D studies, datasets, and results used in this study. The letter code shows the population from which the samples were obtained, Significant tagSNPs shows the number of tagSNPs that were found to be significantly associated with T2D at the significance threshold applied by the corresponding study (also shown in the table). Significant tagSNPs+LD shows the total number of these significant SNPs and the number of SNPs that are in linkage disequilibrium with these SNPs, but were not screened by the corresponding study  [19].
Functional overlap among genomic variants found to be associated with T2D in different populations In this section, we present the overlap between T2D-associated SNPs at five different functional levels. For each functional level, we compute (i) an overlap matrix and (ii) a cumulative overlap function. Each overlap matrix is a k × k matrix that represents the pairwise overlap between the disease-associated loci in pairs of populations based on a certain notion of functional overlap, where k is the number of populations. Each cumulative overlap function is a function in the form f:{1, …, k} → [0,1], assessing the fraction of biological entities (individual loci, loci in LD, genes, functions, subnetworks) that are found to be associated with the disease in at least a given number of the populations. We hierarchically cluster the populations using each of the five overlap matrices and visualize the overlap matrices as heatmaps with hierarchical clustering. To assess the statistical significance of the overlap functions, we report the results of permutation tests obtained through 1000 permutations (the procedure we use for the permutation tests is described in Methods). We compare the overlap function computed on the original dataset against the distribution of overlap functions computed using permutation tests, representing one thousand simulated runs.
The SNP overlap matrix and the SNP overlap function for T2D-associated SNPs in the nine populations are shown in Fig. 2(a) and Fig. 3(a), respectively. As seen in the figures, the overlap between any pair of populations is considerably low, but there is some overlap between pairs of populations (Chinese and Saudi Arabian, French and Lebanese, Finnish and Korean, Lebanese and Japanese). Although the pairwise overlap between T2D-associated SNPs is considerably low, the permutation test for the overlap function for k = 2 (two populations) suggests that the pairwise overlap is statistically significant (z-score = 23.8, p < 3.82E-125). In other words, a SNP that is found to be associated with T2D in one population is likely to be found to be associated with T2D in at least one other population. However, for values of k larger than 2, the overlap between T2D-associated SNPs is not statistically significant, i.e. the T2D-associated SNPs do not tend to be shared across 3 or more populations.
The linkage disequilibrium (LD) SNP overlap matrix and the LD-SNP overlap function for T2D-associated SNPs in the nine populations are shown in Fig. 2(b) and Fig. 3(b), respectively. When we take linkage disequilibrium into account (thereby considering loci with correlated genotypes as overlapping), we observe considerably larger overlap between pairs of populations in the corresponding heatmap. In particular, the Lebanese and French populations cluster together with very high overlap. Another cluster is formed by the Chinese, Japanese, Finnish, and Saudi Arabian populations. This cluster also exhibits considerable overlap with the French-Lebanese cluster. The permutation test for the LD-SNP overlap function suggests statistically significant overlap (p < 4.27E-17) for up to 7 populationsthus we can conclude that, at such genomic level, there is some overlap between the T2D-associated loci among most of the populations that are considered.
The gene overlap matrix and the gene overlap function for T2D-associated genes in the nine populations are shown in Fig. 2 Fig. 3(c), respectively. Here, a gene is considered T2D-associated in a population if at least one SNP in the genes' region of interest (ROI) is found to be significantly associated with T2D in that population. As seen in the heatmap, when genes are considered, more overlap is detected between pairs of populations, as compared to LD Analysis, e.g., there is solid pairwise overlap between populations like the Lebanese and French as well as the Chinese and Saudi Arabian. However, when more than two populations are considered, less overlap is detected in genes than in LD-SNPsand no gene that is common to 6 populations is identified. This is shown by the distribution of the values of the overlap function for the permutation test, but the gene overlap is statistically significant for k = 2,3,4,5 populations (p < 1.1E-245).
The functional overlap matrix and the functional overlap function for T2D-associated functions in the 9 populations are shown in Fig. 2(d) and Fig. 3(d), respectively. As seen in the heatmap, there are no functions that are enriched in SNPs found to be associated with T2D in the French and Lebanese populations, so the two populations are eliminated from further analysis. There is an improved statistically significant pairwise overlap (z-score = 433.8, which corresponds to a very small p-value), between the Finnish and Korean populations, however the extent of overlap decreases to less than five populations as shown by the distribution of the values of the overlap function for the permutation test. In other words, a function that is found to be associated with T2D in one population is likely to be found to be associated with T2D in another three populations. However, for values of k larger than 4, the overlap between T2D-associated functions is not statistically significant i.e. the T2D-associated functions do not tend to be shared across 5 or more populations.
The network overlap matrix for T2D-associated subnetworks and the network overlap function for the seven populations (excluding Fr and L) are shown in Fig. 2(e) and Fig. 3(e), respectively. As seen in the heatmap, the amount and extent of overlap between populations is considerably higher than all previous overlap analyses. There is very high overlap between the Saudi and Chinese populations, populations of European ethnic origin represented by UK, US and other parts of Europe, and between the Finish and Korean populations. Moreover, there is highly statistically significant overlap (z-score > 1294, which corresponds to a very small p-value) between up to 6 populations. This is suggested by the distribution of the values of the overlap function for the permutation test. In other words, a subnetwork that is found to be associated with T2D in one population is likely to be found associated with T2D in almost all of the populations that are considered.
In order to visualize the functional overlap across the populations from a different perspective, we input the entire set of loci from all populations to the Prix-Fixe network analysis tool [20]. This tool outputs the most Fig. 2 Overlap matrices at five functional levels for the genomic variants that are found to be associated with Type II Diabetes in 9 different populations. The plots from (a) to (e) show the overlap matrices for SNP, LD SNP, gene, functional and network overlap matrices respectively, depicted as heatmaps. Each heatmap has hierarchical clustering of the 9 populations (D = dbGAP, W=WTCCC, J = Japanese, F=Finish, K=Korean, Fr = French, L = Lebanese, C=Chinese, S=Saudi). The color intensity from white to brick, shows the degree of overlap, with brick being the highest significant functionally coherent subnetwork across all seven populations. Subsequently, we plot the resulting protein-protein interaction (PPI) network, color coded for different populations. The T2D PPI network across seven populations is shown in Fig. 4. As shown in the figure, the resulting most functionally coherent subnetwork has some strong inter-population interactions between T2D-associated proteins and the strength of interaction, represented by the width of the edges, suggests a strong overlap between the Asian populations; specifically Chinese, Saudi, Korean, Japanese and Finnish populations and some overlap between the Finnish and British as well as between the American and other European and the Korean populations. This conforms to the previous results from network analysis.

Correspondence between genes identified in different populations and T2D subtypes
According to Cantley J and Ashcroft [21], there are two main molecular mechanisms that underlie the etiology of T2D; insulin secretion deficiency and insulin resistance. Prasad and Groop [22] classify T2D associated genes according to their roles in these two subtypes. To investigate the correspondence between the genes that are found to be associated with T2D in each population and their association with these subtypes, we refined our The results of the analysis is shown in Fig. 5. In the figure, the number of genes that are found to be associated with T2D (based on the mapping of the variants identified in each population to the genes' regions of interest) is stratified according to T2D subtypes. Since analysis at the network level provides system-level information, we also repeat this analysis at the network level and we observe increased overlap with network-based analysis. At the network level, we consider a gene to be identified in a population if the most significant subnetwork identified in that population contains the gene. Therefore, a gene that does not have a significant variant in a population can be found to be associated with T2D in that population, if it is functionally associated with other genes that harbor significant loci. Similarly, a gene that harbors a significant locus in a population may not be considered as associated with T2D at the network level, if its network neighborhood is not enriched in genes that are significant in that population.
As seen in Fig. 5, the genes identified in Asian populations are mostly associated with insulin secretion deficiency at both levels of genomic granularities, as opposed to European and American populations, in which insulin resistance seems to be the prevailing predisposing genetic factor for T2D, at both individual gene and network levels.  Fig. 4 Cross-population protein-protein interaction network associated with Type II Diabetes. The proteins in the network are color-coded based on the population on which the corresponding gene is found to be associated with T2D. Blue:UK, Red:US+Other parts of Europe, Cyan:Finnish, Magenta: Korea, Purple:Japanese, Citric:Chinese, Dark Grey:Saudi, Light Grey is used for interconnection nodes. The edge width represents the interaction weight as calculated using Prix-Fixe network algorithm [20] Finnish population, are dominated by insulin resistance-implicated genes, while having some share of insulin secretion deficiency-related genes interacting with them. On the other hand, the Asian populations have more insulin secretion deficiencyimplicated genes at the level of individual genes, except for the Korean population, which shows an equal share of insulin secretion deficiency and insulin resistance-implicated genes. At the network level, all Asian populations show more insulin secretion deficiency as opposed to insulin resistance, including the Korean population. Interestingly, none of the insulin resistance associated genes are identified to be associated with T2D in the Japanese population, according to the network level analysis.

Application to prostate Cancer
In this section, we present the results obtained by applying the proposed framework to prostate cancer data. We use published results of Prostate Cancer Genome Wide Association Studies for seven populations representing seven ethnic origins. The basic statistics of these studies and their results are shown in Table 2. The seven datasets are the following: Fig. 5 The correspondence between genes identified as associated with Type II Diabetes in different populations and their role in T2D subtypes. The top panel shows the distribution of the genes that are found to be associated with T2D in each population with respect to their association with two common T2D subtypes. For each population, among the genes that are found to be associated with T2D in that population, yellow bars represent the number of genes that are associated with Insulin Secretion Deficiency, red bars represent the number of genes associated with Insulin Resistance subtype and the orange bars represent the number of genes that are associated with both subtypes. The bottom panel shows the corresponding distribution for network-based overlap, where a gene is considered to be identified in a population if it is in the most significant subnetwork identified by network analysis on that population. The color codes used in the right panel are identical to that in the left panel. The rightmost bar shows the number of all other T2D genes that were identified in at least one population, color-coded for the population  [37].
We report the results of the five cumulative overlap functions assessing the fraction of biological entities (individual loci, loci in LD, genes, functions, subnetworks) that are found to be associated with prostate cancer in at least a given number of the populations. We compare the overlap function computed on the original dataset against the distribution of overlap functions computed using permutation tests, representing one thousand simulated runs (the procedure we use for the permutation tests is described in Methods). The results in Fig. 6(a-e) show that more significant overlap between populations is realized as the level of abstraction coarsens, from genomic location to biological function.
The permutation test for the overlap function for k = 2 (two populations) suggests that the pairwise overlap is statistically significant (Fig. 6(a), z-score = 43.151). In other words, a SNP that is found to be associated with prostate cancer in one population is likely be associated with prostate cancer in at least one other population. However, for values of k larger than 2, the overlap between prostate cancer associated SNPs is not statistically significant, i.e. the prostate cancer associated SNPs do not tend to be shared across 3 or more populations. In Fig. 6(b) the permutation test for the LD-SNP overlap function suggests statistically significant overlap for up to 3 populations (z-score = 281.63). In Fig. 6(c) the distribution of the values of the gene overlap function for the permutation test, suggests that the overlap is statistically significant for up to 3 populations (z-score = 489.521). In Fig. 6(d) the distribution of the values of the functional overlap function for the permutation test suggests that the overlap is statistically significant for up to 4 populations (z-score = 489.031) and in Fig. 6(e) the distribution of the values of the network overlap function for the permutation test, suggests that the overlap is statistically significant for up to 5 populations (z-score = 361.762).

Discussion
According to literature [38,39], most T2D treatment regimens are based on one of two groups of medications. The first of these two groups is Sulfonylureas; used to improve insulin secretion, by targeting the ABBC8 and KCNJ11 genes and the other is Metformin, which is used to improve insulin sensitivity and targets the PRKAB1 [40][41][42][43]. The drug targets for Sulfonylureas are implicated in the Asian populations of this study, as well as the Finnish population. The drug target for Metformin and the genes that interact with it (PRKAG2 and PRKAG1) are implicated in the European, American and the Korean populations. Table 2 Descriptive Statistics for the genomic variants associated with Prostate Cancer. Significant tagSNPs shows the number of tagSNPs that were found to be significantly associated with Prostate Cancer at the significance threshold applied by the corresponding study (also shown in the table). Significant tagSNPs+LD shows the total number of these significant SNPs and the number of SNPs that are in linkage disequilibrium with these SNPs, but were not screened by the corresponding study In Fig. 7, we show the interrelationship between T2D subtypes, genes identified in each population in this study, first line T2D treatments in these populations, the drug targets of these first line treatments, and their implications in T2D subtypes. It is interesting that the first line treatment for T2D in each of the populations in this study conform to the population's unique T2D mechanisms. For example, literature confirms that decreased insulin secretion capacity takes a bigger role in the development of T2D in the Japanese population than insulin resistance. Furthermore, Sulfonylureas have been the most prescribed class of drugs, and has been the first line treatment in Japan until recently when it started to be supplemented with glucose lowering medications as well [44][45][46]. In China, the majority of oral anti-diabetic drugs belong to the Sulfonylureas class. This is the oldest of the anti-diabetic drug classes and the majority of hypoglycemic medicines on China's 2012 EDL are within this category in spite of the availability of new classes [47][48][49][50][51]. Sulfonylureas has consistently been the first line treatment for T2D in Korea, with no competition until 2010, when Metformin started getting popular in the Korean market and its consumption and sales increased by 2013 [52].
In contrast to the Asian populations, we found that the genes identified in the Finnish, European, and American populations are mostly related to insulin resistance. The Saudi population is also characterized, as Fig. 6 Cumulative overlap functions at five functional levels for the genomic variants that are found to be associated with Prostate Cancer in 7 different populations. The five cumulative overlap functions for prostate cancer, with increasingly statistically significant overlap as we go down the pipeline of analyses from SNP to Network Overlap analysis well as many Arab countries, by Adipocyte dysfunction which associates obesity to insulin resistance and diabetes [53]. Research shows that Metformin is the first line treatment for T2D in USA, UK, Finland and Saudi Arabia [54][55][56][57]. In fact Metformin has been on the list of the top ten prescription drugs in the USA for years, ranking four in 2018 and 2019 [58] and on the list of top 100 most prescribed drugs in the UK [59] and ranks fifth on the list of most prescribed drugs in Saudi Arabia from 2010 to 2015 [60].
In 2017, Metformin had the highest average antidiabetic drug prescriptions per physician in Finland which also falls under the top 10 most commonly prescribed medicine categories [61]. The Finnish Medicine Agency -Fimea [62] estimates what proportion of the population theoretically receives Metformin; in terms of drug daily dose per 1000 inhabitants per day. The Fimea 2017 report shows that Metformin consumption is highest among all antidiabetic drugs between 2014 and 2017 (31.65, 31.65,30.95 and 31.61 respectively with an increasing gap with Sulfonylureas; the latter shows a decreasing daily consumption of 3.94, 3.15, 2.38 and 1.76 for the same 4 years respectively). Metformin has also been associated with a good change in the gut microbiota, which improves insulin sensitivity [63].
T2D has a heterogeneous and multifactorial etiology, with many associated factors including gut microbiome, and possibly genetic subtypes that are yet to be uncovered. Although T2D treatments work at different degrees of efficiency from one person to another, the Fig. 7 Interrelationship between subtypes, populations, first line treatments and drug targets for Type II Diabetes. The figure shows the triangular interrelationships between T2D subtypes, genes identified in each population in this study, first line T2D treatments in these populations, the drug targets of these first line treatments, and their implications in T2D subtypes. For example, in the US, insulin resistance is the most common T2D subtype and Metformin is the first-line T2D treatment. The drug target of Metformin is PRKAB1 which is implicated in insulin resistance and is harbored by the US population in this study above analysis confirms previous research [64,65] indicating that, of the currently known T2D subtypes, certain subtypes seem to be most common in certain ethnicities, and that Asian populations are more characterized by decrease in insulin secretion capacity as opposed to American, European, and other Caucasian populations which have insulin resistance as the most common reason for T2D [66,67]. Also, the network-based T2D subtype analysis (Fig. 5) shows more overlap between the two subtypes of T2D than the gene-based analysis, which supports our results and previous findings [68].
The experimental results we obtained using prostate cancer data show consistency to T2D results in the sense that more statistically significant overlap is realized as we go through the pipeline from SNP to network overlap analysis, which supports our hypothesis and demonstrates the generalizability of the methods.

Conclusions
In this work, we developed computational algorithms and statistical frameworks to assess the functional overlap between disease-associated variants in different populations, expecting that the variants identified in certain populations correspond to similar biological processes. For this purpose, we developed a multi-layered framework, where genomic loci, protein-coding genes, biological pathways in which these proteins are active, and networks of physical and functional interactions between these proteins are systematically evaluated for potential overlap.
Our results, show that the overlap between different populations grow as the level of abstraction coarsens from genomic location to biological function. More interestingly, we were able to show that differences in the biological processes that are implicated in different populations align with the targets of first-line treatments of T2D in each population. We were also able to assess the generalizability of our method by testing its applicability to another complex disease. To this end, our results represent an innovative and potentially significant tool for preventing, curing, and treating disease, in that population-specific functional annotation of disease-associated genes can be used to design personalized treatment strategies in precision medicine.
It is important to note that the results presented here do not conclusively show a causal link between the genomic markers identified via GWAS and the first-line T2D treatments in these populations. Establishment of such a link would require further quantitative analysis to understand whether specific types of diabetes II are over represented in specific populations, and whether the medicine that was prescribed for each patient was appropriate. Furthermore, the prevalence of a specific kind of medicine in a country may not be related to the etiology of disease, but can rather be due to historical or political reasons. Further research is required to answer these questions.

Overlap matrices and cumulative overlap functions
The objective of this study is to characterize the functional overlap between loci that are identified to be significantly associated with a complex disease based on samples from different populations. To address this problem, we assume that we are given a collection L = {L 1 , L 2 , …, L k } of sets of genomic loci identified to be associated with the disease across k populations (in this study, we have k = 9), such that the set L i contains the loci that are found to be significant based on the samples obtained in the ith population. Based on this information, we compute five overlap matrices and five cumulative overlap functions. Each overlap matrix is a k × k matrix that represents the pairwise overlap between the diseaseassociated loci in pairs of populations based on a certain notion of functional overlap. Each cumulative overlap function is a function in the form f:{1, …, k} → [0,1], assessing the fraction of biological entities (individual loci, loci in LD, genes, functions, subnetworks) that are found to be associated with the disease in at least a given number of the populations.
We compute the following overlap matrices: 1. SNP Overlap Matrix [Σ ij ] kxk assesses the overlap between the loci that are found to be associated with the disease in populations i and j, where k is the number of populations. 2. LD SNP Overlap Matrix [Λ ij ] kxk assesses the overlap between the loci that are found to be associated with the disease in populations i and j, such that two loci are considered to be overlapping if they are in linkage disequilibrium. 3. Gene Overlap Matrix [Γ ij ] kxk assesses the overlap between genes that harbor loci that are found to be associated with the disease in populations i and j. 4. Functional Overlap Matrix [Φ ij ] kxk assesses the overlap between the biological processes that are enriched in genes harboring loci that are found to be associated with the disease in populations i and j. 5. Network Overlap Matrix [Ν ij ] kxk assesses the overlap between the subnetworks of protein-protein interaction networks that are enriched in genes harboring loci that are found to be associated with the disease in populations i and j.
In the following discussion, we explain how we compute each of these overlap matrices. The notation used in this section is provided in Table 3.

SNP overlap matrix
We define the SNP Overlap Ratio Σ ij between two populations i and j as the Jaccard coefficient of the overlapping SNPs between the populations, i.e., the fraction of common significant tagSNPs in the two populations among the number of significant tagSNPs in the two populations: Where S i and S j denote the sets of tagSNPs that are found to be significantly associated with the disxeases in populations i and j, respectively.
In order to quantify the overall overlap between the k populations, we define the cumulative SNP overlap function σ(l) for 1 ≤ l ≤ k as follows: Where S l denotes the set of tagSNPs that are found to be significantly associated with the disease in at least l populations. Observe that 0 ≤ σ(l) ≤ 1, σ(1) = 1, and σ(l) is a monotonically non-increasing function of l. All cumulative overlap functions we define below also exhibit these properties.
Since several probes map to the same SNP, while computing the sets S i for 1 ≤ i ≤ k, we first remove duplicate SNP lists in every population. For the same reason, we also compute the overlap ratio for a pair of populations as the number of common tagSNP over the number of all unique tagSNPs in the populations combined.

LD SNP overlap matrix
We define the linkage disequilibrium (LD) SNP overlap ratio Λ ij between two populations i and j as the fraction of common significant tagSNPs in the two populations that have significant LD partners in the other population among the number of significant tagSNPs in the two populations : Where L ij denotes the set of tagSNPs that are found to be significantly associated with the diseases in population i and have significant LD partners in the other population j, and L ji denotes the set of tagSNPs that are found to be significantly associated with the diseases in population j and have significant LD partners in the other population i.
In order to quantify the overall overlap between the k populations, while using LD to expand the definition of tagSNPs, we define the cumulative LD SNP overlap function λ(l) for 1 ≤ l ≤ k as follows: Where L l denotes the set of tagSNPs that have LD partners found to be significantly associated with the disease in at least l populations.
To find SNPs that are in linkage disequilibrium (LD), we input each population's tagSNPs into SNPsnap [69] for LD search using HapMap3 release 2 [70] dataset. For this purpose, we use the European panel (CEU) for the European and American populations (W and D), European and Toscani in Italia (TSI) for the French and Lebanese populations, and the Japanese and Chinese panels (JPT + CHP) for Finland and the Asian populations. We consider two SNPs to be in LD if they have an r 2 of at least 0.5 and they are within 500Kbs of each other on the genome.

Gene overlap matrix
We define the Gene Overlap Ratio Γ ij between two populations i and j as the fraction of common genes that harbor significant tagSNPs (in a region of interest of 100 kb  S k Set of tagSNPs that are significant in at least K populations L k Set of tagSNPs that have significant LD partners in at least K populations G k Set of genes that are associated with at least K populations F k Set of functions that are associated with at least K populations N k Set of genes, constituting the most significant networks, that are associated with at least K populations Λ ij ¼j L ij ∪L ji j = j S i ∪S j j upstream and 10 kb downstream) in the two populations among the number of genes in the two populations, i.e.: Where G i and G j denote the sets of genes that harbor significant tagSNPs that are found to be significantly associated with the diseases in populations i and j, respectively.
In order to quantify the overall overlap between the k populations, we define the cumulative gene overlap function γ(l) for 1 ≤ l ≤ k as follows: Where G l denotes the set of genes that harbor significant tagSNPs that are found to be significantly associated with the disease in at least l populations.
In order to map SNPs to Genes, all Refseq transcripts [71] for hg38 assembly are downloaded from UCSC Table  browser and extended 100 kb upstream and 10 kb downstream. Refseq IDs are translated to HGNC symbols. SNPs are mapped to their positions on hg38 assembly through biomaRt [72] and are intersected with gene coordinates. Such intersections result in gene lists matching each population.

Functional overlap matrix
We define the Functional Overlap Ratio Φ ij between two populations i and j as the fraction of common biological processes that are enriched in genes harboring significant tagSNPs (+ 100 kb/− 10 kb) in populations i and j among the number of biological processes that are enriched in genes harboring significant tagSNPs (+ 100 kb/− 10 kb) in the two populations, i.e.: Where F i and F j denote the sets of significant biological processes in populations i and j, respectively.
In order to quantify the overall overlap between the k populations, we define the cumulative functional overlap function ϕ(l) for 1 ≤ l ≤ k as follows: Where F l denotes the set of biological processes that are found to be significantly associated with the disease in at least l populations.
To identify biological processes that are enriched in T2D-associated SNPs for each population, we use enrichment analysis of gene sets using the WebGestalt R package [73,74]. Biological Process terms are used and False Discovery Rate (FDR) threshold is set to 5%.

Network overlap matrix
For each population, we feed the loci that are found to be associated with the disease to the Prix-Fixe [20] network analysis tool. The Prix-Fixe tool uses a network based disease-associated subnetwork identification algorithm that uses genome-scale shared function networks to identify the most functionally coherent subnetwork of genes spanning the disease associated loci. Using shared function networks as a reference, the algorithm evaluates gene combinations, constraining the choice to one gene from each disease associated locus, for their shared function, hence similar to choosing compatible food items from a prix fixe restaurant menu, with one dish from each course. The algorithm outputs a list of genes, with their disease association scores, such that the top scored genes constitute the most coherent disease associated subnetwork in each population.
We define the Network Overlap Ratio Ν ij between two populations i and j as the fraction of common genes constituting subnetworks of protein-protein interaction networks that are enriched in genes harboring loci that are found to be associated with the disease in populations i and j, among all genes in populations i and j, i.e.: Where N i and N j denote the sets of genes constituting the most significant subnetworks in populations i and j respectively.
In order to quantify the overall overlap between the k populations, we define the cumulative network overlap function ν (l) for 1 ≤ l ≤ k as follows: ν l ð Þ ¼j N l j = j N j Where N l denotes the set of genes constituting the subnetworks that are found to be most significantly associated with the disease in at least l populations.

Permutation tests
To test the statistical significance of our results, we create a thousand simulated data sets at each level of genomic granularity (SNPs, LD SNPs, Genes, Functions, Networks) with a simulated set size for each population matching that of the original data set. A thousand permutations are done and in each permutation, permuted overlap functions are calculated as described above. This results in 1000 overlap function vectors for k = 1 to 9 corresponding to each permuted data set. For each k, in each permuted set, the minimum, maximum and mean values of the overlap function across all permutations are calculated. These calculations are then used to visualize distribution of the overlap function in permuted datasets and assess the significance of the observed overlap based on this distribution.

Permutation tests for SNP overlap
In order to perform the permutation tests for SNP overlap, we prepare a SNP pool. For each study, the genotyping array model is identified. Array annotation files are downloaded from UCSC genome annotation database [75]. Further, SNPs from WTCCC and dbGAP populations are extracted and pooled together with SNPs from all arrays, this resulted in 2,028,276 unique SNPs. (Note: Imputation data is not used, since these studies do not provide the final imputation results after quality control). All the populations together with their SNPs constitute a SNP set, which is used as a pool for permutation tests. SNP sets are then randomly sampled, with replacement across populations, such that the size of each set matches that of the observed SNP set for each population.

Permutation tests for LD-SNP overlap
The SNP sets generated in each iteration of permutation tests for SNP overlap are provided to SNPsnap [69] tool for LD SNP extraction using the parameters used for the original data set.

Permutation tests for gene overlap
A gene pool is created from all human genes; all Refseq transcripts [71] for hg38 assembly are downloaded from UCSC Table browser and extended 100 kb upstream and 10 kb downstream and Refseq IDs are translated to HGNC symbols. All HGNC symbols matching Refseq IDs are combined and duplicates removed. The final HGNC list served as the pool for random gene sampling. In each of the thousand permutations, gene sets are randomly sampled from the gene pool, with replacement across populations, matching the size of the observed gene set.

Permutation tests for functional overlap
Enrichment analysis is done for one thousand randomly sampled gene sets. In each permutation enrichment analysis is performed with the same parameters as the actual enrichment analysis. At this point, two populations (L and Fr) were not enriched in any biological processes, so the two populations were dropped from the analysis and further analysis is done on k = 7 populations only.

Permutation tests for network overlap
The SNP sets generated in each of the permutation tests for SNP overlap are provided to Prix-Fixe [20] network analysis tool, which outputs the most coherent disease associated subnetwork for each population. This results in 1000 most coherent disease associated subnetworks for each population.