From SNPs to Genes: Disease Association at the Gene Level

Interpreting Genome-Wide Association Studies (GWAS) at a gene level is an important step towards understanding the molecular processes that lead to disease. In order to incorporate prior biological knowledge such as pathways and protein interactions in the analysis of GWAS data it is necessary to derive one measure of association for each gene. We compare three different methods to obtain gene-wide test statistics from Single Nucleotide Polymorphism (SNP) based association data: choosing the test statistic from the most significant SNP; the mean test statistics of all SNPs; and the mean of the top quartile of all test statistics. We demonstrate that the gene-wide test statistics can be controlled for the number of SNPs within each gene and show that all three methods perform considerably better than expected by chance at identifying genes with confirmed associations. By applying each method to GWAS data for Crohn's Disease and Type 1 Diabetes we identified new potential disease genes.


Introduction
Genome-Wide Association Studies (GWAS) link genetic variants to phenotypes. One common study design in human disease genetics is to compare a group of diseased individuals (cases) to a group of healthy individuals (controls) for a large number of Single Nucleotide Polymorphisms (SNPs). The frequency of each allele is compared between cases and controls using a x 2 statistic, which can be transformed into a measure for the probability of the data arising under no association between disease and SNP (p-value). Currently, GWAS are carried out using microarray technology, genotyping up to one million SNPs in parallel. Because a statistical test is performed for each SNP, careful multiple hypothesis testing procedures are employed to ensure the identification of association signals with genome-wide significance, typically with a p-value p,5N10 28 [1]. In most GWAS only a few SNPs pass this correction and although this approach has led to the discovery of several novel disease-linked variants, it ignores thousands of SNPs with ''suggestive'' p-values that fail to reach the stringent threshold for genome-wide significance, but may reflect evidence for association. Several approaches try to make use of these ''suggestive'' p-values through the incorporation of prior biological knowledge [2,3,4,5,6,7,8,9,10,11,12]. The best known is Gene Set Enrichment Analysis (GSEA) [3,13], which assesses whether predefined sets of genes are overrepresented within a sample. Genes that are members of the same gene-set are typically involved in a common biological process as defined by e.g. the Gene Ontology [14] or biological pathways as defined by databases such as KEGG [15]. In a similar way, protein networks have been consulted [10,11] with the objective of identifying subnetworks of interacting proteins. Individually none of the proteins within such a subnetwork might be significantly associated, but overall a subnetwork might show statistically significant association with a disease.
All of these studies face very similar methodological problems: GWAS report association for individual SNPs, whereas functional information typically exists for proteins or genes. Therefore SNPs have to be assigned to genes and their individual association signals combined. This can be done in different ways and one must take into consideration that the number of SNPs per gene can vary to a great extent. The most widely used approach is to take the most significant p-value per gene [2,3,4,5,6,7,8]; however this can introduce a substantial bias in the downstream analysis if the number of SNPs per gene is not controlled for [9]. In this work we systematically compare three methods to analyse GWAS data at the gene level. We also propose a way to control for differences in the number of SNPs per gene based on permutations of the disease status and demonstrate its effectiveness. Based on GWAS data for Crohn's disease (CD) and Type 1 Diabetes (T1D) genotyped by the Wellcome Trust Case Control Consortium [16], we evaluate the performance of the different methods using sets of disease genes that were identified and replicated by the most recent metaanalyses [17,18].

Quality Control and Association Testing
GWAS of seven diseases have been performed by the WTCCC [16]. Approximately 3,000 shared controls and 2,000 cases were genotyped for seven diseases, including Crohn's Disease (CD) and Type 1 Diabetes (T1D), on the Affymetrix GeneChip 500K Mapping Array Set. We re-analyzed the WTCCC I data using PLINK v1.06 [19]. In addition to SNPs and individuals in the exclusion lists provided with the genotyping data, we applied more stringent quality control criteria than the original study, because our analysis includes moderate associations which are more susceptible to study biases. Based on the pooled case/control dataset we excluded SNPs with Hardy-Weinberg equilibrium p,0.001, a minor allele frequency of less than 0.01 or genotyping call-rates of less than 0.97. Association testing was performed using the Cochran Armitage trend test (1df). We manually checked the most strongly associated SNPs for every disease to ensure consistency with the original WTCCC I results. To take into account inflated test statistics caused by population stratification we corrected test statistics using the genomic control metric l median [20]. The estimated l median (for simplicity denominated as l) for CD (l = 1.12) and T1D (l = 1.06) are in good agreement with the original values reported by the WTCCC (l = 1.11 and l = 1.05 for CD and T1D, respectively). For both diseases, 500,000 permutations of the disease status were performed using the PLINK max(T) permutation method and association p-values were calculated. Table 1 summarises the GWAS data analysis for CD and T1D.
To further assess the effect of population stratification on our analyses we performed principal component analysis (PCA) of the CD and T1D data using EIGENSTRAT [21]. We then performed association testing using logistic regression to incorporate the first two principal components as covariates. For both diseases, 1,000 permutations of the disease status were performed using logistic regression and the PLINK max(T) permutation method.

Gene to SNP assignment
A tab-delimited text-file (seq_gene.md) containing genomic coordinates for all genes was downloaded from the NCBI ftpserver [22] in November 2009. Only entries for the human reference sequence (NCBI assembly GRCh37) and protein-coding genes were retained. Genes mapping to sex-chromosomes, the mitochondrial chromosome, unassembled contigs or alternative haplotypes were discarded. SNPs on the GeneChip 500K Mapping Array Set were assigned to the remaining genes. Because this genotyping platform is based on the previous assembly of the human genome (NCBI 36) all SNP positions were converted to the latest assembly using the ''Lift-Over'' tool on the GALAXY website [23]. SNPs were assigned to a gene if they are located within its primary transcript or 40 kilobases (kb) upstream or downstream. These boundaries are chosen based on the distribution of association signal with respect to protein-coding genes [24]. When a SNP could be assigned to multiple genes because of overlapping flanking windows, the closest gene was chosen.
The WTCCC study found the strongest association signal for Type 1 Diabetes (T1D) within the Major Histocompatibility Complex (MHC) region on chromosome 6. The MHC region has high levels of linkage disequilibrium (LD) and harbours many genes. This causes the association signal to be spread over many genes, thereby artificially inflating the number of genes with associated SNPs. We therefore excluded the MHC region (chromosome 6, position 25,930,839 to position 33,495,825, NCBI assembly GRCh37) in all analyses of the T1D dataset, which removed 1,473 SNPs and 185 genes. In total, approximately 290,000 SNPs were assigned to 17,000 protein coding genes. Table 1 summarises the SNP to gene assignment for CD and T1D.

Assessment of LD on SNP to gene assignment
In order to assess the effect of LD we repeat our analyses, but take into account LD to extend the assignment of SNPs to genes. We use PLINK v1.06 [19] to obtain a list of SNP pairs in LD (r 2 .0.8) based on the GWAS data for CD and T1D [16]. SNPs are added to the initial assignment if they are in LD (r 2 .0.8) with a SNP in a gene or its 40 kb flanking windows, including SNPs that have already been assigned to other genes. Taking into account LD adds approximately 6,000 (2%) additional SNPs to the analyses.

Deriving a gene-wide test statistic for each gene
Each gene has n SNPs assigned to it with n M N 0 . Let the test statistics in the gene be T i , i = 1, … n. Under the null hypothesis of no association, T i has a x 1 2 distribution (x 2 distribution with one degree of freedom); high values of T i indicate evidence for association. To obtain a gene-wide test statistic, we use three summary statistics for T i : 1. maxT: the maximum value of T i (maximum x 1 2 value) for each gene is chosen; 2. meanT: the arithmetic mean test statistic (mean x 1 2 value) for each gene is calculated; 3. topQ: the highest quartile of all test statistics T i (highest quartile of all x 1 2 values) in a gene are selected and their mean is calculated. If n is not a multiple of 4 the number of SNPs considered for topQ is rounded up to the next integer (e.g. if a gene has 5 SNPs the mean of the largest two test statistics is calculated).
Deriving an empirical p-value (p emp ) for each gene We derive test statistics for each gene in the observed dataset and in 500,000 randomised datasets derived from permutations of the disease status. For each gene we tabulate the number of permuted data sets in which we observe a higher gene-wide test statistic than in the observed data set, thus deriving an empirical pvalue p emp .
Because we compare observed and permuted test statistics for every gene, a significantly associated gene requires a p emp value that is also controlled for the number of genes tested. Assuming there are approximately 20,000 protein-coding genes in the human genome, a Bonferroni correction requires a p-value threshold of p emp = 0.0561/20,000 = 2.5610 26 . In order to be able to obtain

Uncontrolled vs. empirical p-value
To compare the different methods we rank genes for each genewide test statistic method. This is done before and after deriving p emp values (i.e. controlling for the number of variants per gene and LD) resulting in six different sets of ranks. When p emp values are identical for two or more genes we use the gene-wide test statistics to resolve ties. Based on the ranks we calculate pairwise Spearman rank correlation coefficients between all six sets for the top 500 genes: For each gene, we sum the ranks across all six gene sets, and select the 500 genes with the highest summed ranks.
To analyse the effect of deriving p emp values for individual genes we convert the gene-wide test statistics to p-values assuming test statistics have a x 1 2 distribution. For each gene the uncontrolled p-value is plotted against the p emp value for all three methods.  [25]. In this analysis we used as positives a list of successfully replicated disease genes from meta-analyses of T1D [18] and CD [17]. We only chose loci that either contain a single gene or a for which a unique candidate gene has been proposed [17,18]. This results in 39 and 27 true positive genes for CD and T1D, respectively (Tables 2, S1 and S2). We assume that all other genes are negatives. We rank all genes within both lists (positives and negatives) by their p emp values, and used their gene-wide test statistics to resolve ties when p emp values are identical for two or more genes. For each gene the relative rank within the positives is plotted against the relative rank within the negatives to derive the ROC curve, and the areas under the curve (AUC) were calculated.
All scripts written for the analyses presented are available from authors upon request.

Number of SNPs per gene
The Affymetrix 500K GeneChip includes approximately 500,000 SNPs distributed over the whole genome. We assign these SNPs to their closest protein-coding gene if a SNP is located less than 40 kb from a gene. Approximately 290,000 SNPs were assigned to genes, of which 227,000 were left after QC for specific disease data sets (Table 1). Genes vary substantially in size, which leads to different numbers of SNPs assigned to each gene ( Figure 1). Of 20,919 protein-coding genes 17,006 have at least one SNP assigned; most of these genes (,77% or 13,083 genes) have fewer than 10 SNPs; 6.5% (1,097 genes) have more than 50 SNPs. The largest number of SNPs assigned to a single gene is 1,008 (CSMD1, gene length: 818 kb).
We performed analyses of GWAS data for both Crohn's Disease (CD) and Type 1 Diabetes (T1D). In the following section we present results for CD. Results for T1D are comparable and presented in supplementary material.

Deriving a gene-wide test statistic for each gene
To measure association of a SNP with the disease we compare genotype frequencies between cases and controls and calculate a genomic control-corrected test statistic based on an Armitage trend test for every SNP. To obtain a gene-wide measure of association we first derive three summary statistics: maxT (the maximum test statistic for each gene), meanT (the mean test statistic for each gene), and topQ (the mean of the highest quartile  of all test statistics in a gene). Here we illustrate how each summary statistic is subject to confounding factors that have to be controlled for. The gene-wide test statistic is correlated with the number of SNPs per gene, n (Figures 2 and S1), as follows.
N For maxT the test statistic increases approximately linearly with n (Pearson correlation coefficient r = 0.36). Even if there is no association, genes with many SNPs assigned are more likely to have a SNP with a high test statistic, by chance.
N A different effect occurs for meanT, whereby genes with many SNPs tend to have gene-wide test statistics close to one, whereas genes with few SNPs tend to be at the extremes of the distribution, i.e. to have either very low or very high gene-wide test statistics. Under the null hypothesis of no association, the test statistic has a x 1 2 distribution, with a mean of 1. When calculating meanT, genes with more SNPs are therefore likely to have gene-wide test statistics close to 1, whereas genes with few SNPs are more affected by individual SNPs with extreme test statistic.
N An effect similar to meanT is observed for topQ: Genes with fewer SNPs tend to have extreme gene-wide test statistics whereas genes with many SNPs tend to have a gene-wide test statistic close to x 2 <3. This value is higher than for the meanT method since only the top 25% of SNPs per gene are selected.

Deriving an empirical p-value for each gene
The distribution of the summary statistics for each gene is not known and impossible to derive analytically, since it depends on the pattern of LD within each gene. We therefore derive an empirical p-value p emp for each gene from permuted datasets (see Methods). By comparing the observed to the permuted test statistics we maintain LD structure and account for differences in the number of SNPs per gene. The observed p emp values are appropriately controlled for the number of SNPs per gene; we observe no correlation between the number of SNPs per gene and the p emp value (Figures 3 and S2).

Uncontrolled vs. empirical p-value
Although different methods yield different levels of association for a given gene, the results are correlated. Between the three methods to derive p emp values, we observe an average Spearman rank correlation coefficient of 0.74 when considering the top 500 genes (Tables S1 and S2). The average Spearman rank correlation coefficient between the three methods before deriving p emp values (i.e. controlling for the number of variants per gene and LD) is only 0.30, which reflects the different biases introduced by the methods to derive gene-wide test statistics The p emp values are controlled for the number of SNPs per gene and the correlation structure, but how does the control affect individual genes? To address this question, we convert the combined test statistics to p-values assuming test statistics have a N For the maxT method, genes with many SNPs (large n) are more likely to have a high test statistic and therefore a low uncontrolled p-value. When deriving p emp values we control for n. The control has very little impact on genes with n = 1 and in that case the empirical and the uncontrolled p-values are very similar (lying along the diagonal in Figures 4 and S3). For genes with higher n the control is stronger and p emp values are higher than the uncontrolled p-values.
N For meanT we observe a sigmoid-like distribution. That is explained by the effect of varying n: We compare permuted to observed test statistics. If there is no association the expected test statistic is 1. Therefore the expected meanT values for the permuted datasets are 1, i.e. with increasing n the permuted meanT is more likely to be 1. For genes with large n this leads to extreme p emp values when we compare observed to the permuted meanT. As a result the distribution for genes with large n shows a stronger curvature than for genes with small n. When the observed meanT value is 1 (uncontrolled p-value N The distribution for topQ is similar to maxT, but the gradient for genes with many SNPs is less steep.

Performance
To assess the performance of the different methods of combining test statistics we plot Receiver Operating Characteristic (ROC) curves for CD and T1D ( Figure 5) using two sets of confirmed disease genes [17,18] under the assumption that all other genes are not associated (see Methods). The known disease genes are based on meta-analyses CD [17] and T1D [18]. Based on genomic loci that successfully replicated the authors selected the most likely candidate gene considering known involvement in the immune system, association with other auto-immune disorders and location of the most strongly associated SNP. Although the resulting gene list may contain genes which are not associated with the trait, it is the best currently available dataset to assess the performance of our methods for measuring genetic association at the gene-level.
All three p emp methods give considerably better results than expected by chance. For both diseases the topQ method performs slightly better than maxT and meanT, although all three methods perform similarly with differences in the areas under the curve (AUC) of less than 2%. The performance of the different methods for the two diseases might depend on the number of SNPs assigned to the known disease genes. For genes with many SNPs the association signal can get diluted, as it is the case for the CD disease gene ZNF365, which has 91 SNPs ( Table 2). Its maxT is 23.74 which corresponds to p emp = 0.0001, but the meanT and the topQ for this gene are 2.46 (p emp = 0.0041) and 8.32 (p emp = 0.0010), respectively. Consequently the performances measured here by the AUCs depend on the properties of the known disease genes and we can only assume that they are characteristic for disease genes that have not been identified yet.
Several known disease genes were consistently ranked very low by all three methods ( Table 2). For some of these genes the associated SNPs are over 40 kb from the gene (e.g. PTPN22), or the associated SNP is located in the adjacent gene (e.g. ORMDL3). Other confirmed disease genes were ranked low because the associated SNP has not been genotyped by the WTCCC (e.g. JAK2) or did not show any association (e.g. PLCL1).

Linkage Disequilibrium
Our analysis is influenced by linkage disequilibrium (LD) and some of the top ranked genes ( Table 3) are part of the same LD region, reflecting the fact that a true association signal could extend over a large region of the genome if it falls into a large LD block. Most of the SNPs in such a region would appear to be associated with the phenotype which can result in several genes with significant empirical p-values. For example, CYLD and SNX20 have p emp values smaller than 5.4610 25 ; they are located upstream and downstream of NOD2 and are located in the same LD block as NOD2. Their association is most probably an artefact of the confirmed association of the NOD2 gene [26,27,28]. To further assess the impact of LD on our analyses we extended the initial gene to SNP assignment. In addition to SNPs located within the gene or a 40 kb flanking window we include SNPs in LD (r 2 .0.8) with any SNP in this region. This increases the average number of SNPs per gene to 15.5 (from 13.9) and the total number of SNPs assigned to genes to over 296,000 (from 290,000) ( Figure  S4). Including LD in the gene to SNP assignment has only a moderate effect: Although AUC values show a small increase for each method (,1.3%), only a small minority of genes is affected ( Figure S5). Gene ranks obtained with and without taking into account LD are highly correlated (Spearman rank correlation r = 0.98 for each method and disease). Only 3 genes out of the top 100 have a rank above 100 when including LD (maxT for CD) and all genes discussed here and shown in the tables only marginally change their rank or p-value.

Population Stratification
Our primary analysis method is testing for association with the Cochran Armitage Trend Test, with genomic control correction for population ancestry, as this makes performing large numbers of permutations computationally tractable. To assess the effect of population stratification on our analysis in more detail we performed Principal Component Analysis [21] for both datasets. We repeated association testing using logistic regression and adjusting for the first two principal components (PC-correction). This reduced the genomic control measure for CD from l = 1.12 to l = 1.08, with no reduction observed for T1D (l = 1.06). Adjusting for up to 10 PCs did not reduce l any further. The correlation between gene ranks of our primary analysis and after correction for population stratification was high (CD-maxT R = 0.932, CD-meanT R = 0.942, CD-topQ R = 0.940, T1D-maxT R = 0.997, T1D-meanT R = 0.998, T1D-topQ R = 0.998). Gene ranks for CD are more affected than for T1D: out of the top 100 genes of our primary analysis, 78 are within the top 100 genes after PC-correction, and all 100 are within the top 204 genes (maxT, Figure S6). For T1D, 86 out of the top 100 genes of our primary analysis are within the top 100 after PC-correction and all 100 are within the top 143 genes (maxT, Figure S6). Correcting for two principal components only marginally affects the performance of our methods: AUC values increased by ,0.6% for both CD and T1D.

Associated Genes
All genes discussed here only marginally change their rank or pvalue after correcting for two principal components or when considering LD for the SNP to gene assignment. For CD we find 7 out of 39 known disease genes (true positives) within the top 30 genes when we rank all genes based on p emp values (derived from maxT). We use their gene-wide test statistics to resolve ties when p emp values are identical for two or more genes ( Table 3). The genes STAT3 (maxT rank 27) and SBNO2 (maxT rank 26) are located within known disease loci, but are not part of the true positive list because the association signal extends over several genes [17]. Both loci did not reach genome-wide significance in the original WTCCC study and their association was only confirmed in a more recent large-scale meta-analyses. STAT3 and SBNO2 can be linked to the IL10/STAT3 anti-inflammatory pathway [29], which has been implicated with CD [2,17,30].
Another promising candidate for CD might be DAG1 (dystroglycan 1), ranked 23rd for maxT. It is located within a large LD block whose association has been replicated and that encompasses about 35 genes [17]. DAG1 is a cell surface receptor which is used by several known pathogens [31,32] and there has been speculation about a role for DAG1 in the uptake of Mycobacterium avium ssp. paratuberculosis and the aetiology of Crohn's Disease [33].
For T1D five out of 27 known disease genes are within the top 30 (based on maxT, Tables S3 and S4). Of the top 30 genes, 14 fall into a large LD region on chromosome 12 (position 111,348,628 to position 112,947,717), which contains 15 genes. According to Todd et al. [34] the most probable causal gene for this region is SH2B3. The authors detected a highly associated non-synonymous SNP in exon 3 of SH2B3, which had not been genotyped in the WTCCC study [16]. Two SNPs that were genotyped in the WTCCC are assigned to SH2B3 and show moderate association (p = 3610 25 and p = 7610 24 ). Since 40 other SNPs in the region show stronger association, SH2B3 is only ranked 26 (by maxT).

Discussion
Based on GWAS data for two common diseases we present three different methods to combine individual test statistics at a gene level. For all methods the gene-wide test statistic is correlated with the number of SNPs per gene. Based on permutations of the disease status we derive an empirical p-value for each gene and show that it is controlled for the number of SNPs within the gene. To assess the performances of the p emp methods we derive ROC curves based on two sets of disease genes that were replicated in the most recent meta-analyses [17,18]. The p emp methods distinguish different genetic architectures underlying a disease: for maxT a single mutation within a gene contributes to the disease (i.e. one SNP within a gene shows association); for meanT mutations spread all over the gene contribute to the disease (i.e. all or many SNPs within a gene show association): in the case of topQ only a few mutations within a gene contribute to the disease (i.e. a subset of the SNPs within a gene show association). All three methods performed substantially better than expected by chance at identifying these genes, thus justifying our approach. The performances of the three methods were similar, demonstrating the robustness of the permutation approach. This is also reflected by the correlations between empirical p-values for each method for the top 500 genes. For some genes however, results can vary across the methods, as illustrated by ZNF365 (Table 2). To identify all potentially associated genes, results from all methods should be considered. As the methods are correlated, integration results in a moderate increase in the number of genes. For example, the union of the top 500 genes for all three methods consists of 678 genes.
In this work we perform gene-wide analyses on two independent GWAS datasets. We observe the same overall properties for gene-wide test statistics and p emp -values. Furthermore for both datasets our methods successfully reproduced known disease associations showing the robustness of our approach. In addition to the methods presented here other methods have been proposed, including multi-marker association tests [35,36,37,38] and variations [39,40,41] of Fisher's method to combine p-values [42]. Recently, two studies proposed approaches to control for confounding factors (e.g. number of SNPs per gene) which do not require genotyping data [12,43]. Further studies will be required to determine how these methods compare. An open problem that still has to be addressed is the effect of LD. Correlation between the SNPs of a gene can impact the combined test statistic for meanT and topQ method. Because multiple associations can be caused by a single causal SNP a high meanT or topQ might not reflect several independent associations. Correlation between the SNPs of a gene can therefore change the nature of the method to combine test statistics. Furthermore LD makes it difficult to allocate association signal to the correct gene. A number of groups have proposed computational approaches to prioritize genes within LD blocks [6,44,45]. They have been shown to give reasonably good results and could be combined with our approach.
Another approach is to use imputed genotypes, which will increase the density of SNPs and therefore the proportion of genes that are captured. Hong et al. [9] were able to include over 800 additional genes (5%) in their gene-wide analysis of GWAS data, but levels of statistical significance for most other genes remain unchanged compared to using genotyped SNPs only. Assigning SNPs to genes is not straight forward as regulatory elements such as enhancers can be many kilobases away from the transcribed region. In addition some disease-associated variants are located in so-called gene deserts that cannot be linked to protein-coding genes or any other functional elements. Ultimately functional studies are necessary to determine which gene is implicated in a disease process. The methodology demonstrated here is instrumental in automatically identifying the relevant genes that might be implicated in inherited disorders and provides an unbiased ranked list of genes for experimental validation.
Currently GWAS are moving from microarray based technology towards next-generation sequencing (NGS). NGS, in principle, allows for the identification of all genetic variants. As the number of genetic variants in a given individual is far higher [46] than the number of SNPs genotyped using microarray technology, the number of tests is going to increase dramatically. There is a need for new analytical methods that combine association signals over several genetic variants or all variants within a gene, particularly for rare variants which may individually lack power to show significant association. Testing for combined association of all rare variants within a gene overcomes this problem, as demonstrated for simulated data and sequence data of previously known disease genes [47,48,49].
With the emergence of next-generation sequencing, GWAS will increasingly be analysed on gene level. Gene-level association measurements allow the application of gene-set enrichment analysis and related methods, which will ultimately improve the understanding of the underlying molecular mechanism. The methods proposed here provide an accurate and powerful approach to summarise evidence for association within genes and could be used to design functional follow-up studies. Figure S1 Confounding effect of the number of SNPs per gene (Type 1 Diabetes). Multiple test statistics are combined for each gene using three different methods (maxT, meanT, topQ). For each method, the gene-wide test statistic is correlated with the number of SNPs per gene. For these histograms, genes are binned according to their gene-wide test statistic (left axis). The red dots show the mean number of SNPs per gene for every bin (right axis). (TIFF) Figure S2 Distribution of empirical p-value (p emp ) for Type 1 Diabetes from 500,000 permutations of the disease labels. Genes were assigned to 50 bins according to their p emp . Histogram shows the number of genes with p emp values (left axis). The red line shows the mean number of SNPs per gene for every bin (right axis). In contrast to the gene-wide test statistics we observe no correlation of the number of SNPs per gene with p emp for any method. We observe an increase of genes with very low p emp values caused by the actual association signal. (TIFF) Figure S3 Empirical p-values vs. uncontrolled p-values (Type 1 Diabetes). For each gene the p emp is plotted against the uncontrolled p-value (based on the gene-wide test statistic). Each point represents a gene and is coloured according to the number of SNPs assigned to a gene (n). Genes with few SNPs have p emp values similar to the uncontrolled p-value and therefore cluster along the diagonal. For genes with higher number of SNPs the distribution depends on the method to combine test statistics. (TIFF) Figure S4 Distribution of the number of SNPs assigned to genes. We assigned SNPs on the Affymetrix 500K genotyping array to protein-coding genes. SNPs were assigned to a gene if they are located within the transcribed region or within a 40 kilobase flanking window around the transcribed region. In addition SNPs in linkage disequilibrium (LD, r2.0.8) with these SNPs were included. (TIFF) Figure S5 Effect of Linkage Disequilibrium (LD). Gene ranks after assigning SNPs to genes based on genomic distance only are plotted against gene ranks after assigning SNPs to genes based on genomic distance and linkage disequilibrium (LD, r2.