Statistical analysis for genome-wide association study

Abstract In the past few years, genome-wide association study (GWAS) has made great successes in identifying genetic susceptibility loci underlying many complex diseases and traits. The findings provide important genetic insights into understanding pathogenesis of diseases. In this paper, we present an overview of widely used approaches and strategies for analysis of GWAS, offered a general consideration to deal with GWAS data. The issues regarding data quality control, population structure, association analysis, multiple comparison and visual presentation of GWAS results are discussed; other advanced topics including the issue of missing heritability, meta-analysis, set-based association analysis, copy number variation analysis and GWAS cohort analysis are also briefly introduced.


Introduction
Genome-wide association study (GWAS) refers to study in which hundreds of thousands of single nucleotide polymorphisms (SNPs) are genotyped across the genome and tested for association with the phenotype of interest. In the past few years, numerous genetic susceptibility loci have been identified to be associated with many complex diseases via GWAS, including a variety of cancers [1][2][3][4][5][6][7] , bipolar disorder, coronary artery disease, Crohn's disease, hypertension, rheumatoid arthritis, type I and II diabetes [8][9][10][11][12][13] , inflammatory bowel disease [14] , non-obstructive azoospermia [15] , obesity [16] , and etc. GWAS discoveries can be found at the National Human Genome Research Institute (NHGRI) catalogue [17] . There are 1,724 publications and 11,680 SNPs until 01 October, 2013 ( Fig. 1) [17] . These findings offer new genetic insights into understanding the pathogenesis of diseases and disorders, and are expected to promote preventive strategies, diagnostic tools and treatments. However, the massive amount of GWAS data poses many statistical and computational problems as well as data storage and management issues [18][19][20][21][22] . GWAS can be conducted in both population-based and family-based manners, but here we focus primarily on case-control studies, which are currently the most common design. This article aimed to give an overview of the widely used approaches and strategies for GWAS, including quality control, treatment of population structure, association analysis, multiple comparisons, visual presentation of the results, and other advanced topics.

Data quality control
Data quality control becomes extremely important for GWAS, it is now an imperative step to check and clean the raw GWAS data stringently prior to any analysis [20,23] . Here, we only discuss the statistical procedures to ensure the validity of data, i.e., the data filtering procedures that are applied once one already has genotype calls. The checks are performed throughout individuals and SNPs, those not meeting the criteria are deleted. All of the quality control procedures mentioned below can be conducted through PLINK [24] , a freely available tool that implements large scale GWAS data managements and analyses in a computationally efficient manner.

Missing rate
Individuals with high missing call rate are implications of poor DNA quality. The missing call rate of individual is the proportion of SNPs whose genotypes are not called for a given individual. We generally remove the subjects with missing call rate .1%-5%. SNPs with high missing genotype rate (e.g., .5%) imply some problems with the genotyping process, so these SNPs are eliminated. The missing call rate of SNP is the proportion of individuals whose genotypes are not called for a given SNP.
An alternative way to handle missing SNPs is to perform data imputation, i.e., replacing the missing markers by their expectations conditional on information of the observed markers based on reference datasets, such as the panels of HapMap and 1,000 genomes. Imputing genotype data not only provides complete data for multiple markers analysis but also allows testing of untyped SNPs and combination of different data across genotyping platforms to conduct metaanalysis [11,16] . Further information regarding genotype imputation can be found in the literature [25][26][27][28][29] .

Minor allele frequency
In practice, SNPs with minor allele frequency (MAF) ,1% are excluded from subsequent analysis as current SNP-chips genotyping rare variants (i.e., locus with MAF,1%) is difficult and error-prone. Thus, very low frequency alleles are likely to represent genotyping error and may result in spurious associations [68] . Furthermore, the statistical power is extremely low for such rare SNPs [8,20,30] . For example, provided that the odds ratio (OR) is 1.30, a study of 6,000 cases and 6,000 controls provides only approximately 0% and 3% powers to detect disease susceptibility loci with MAF of 0.01 and 0.02, respectively [30] .

Independence of individuals
Sometimes the apparently independent subjects in GWAS have hidden relationship which may lead to spurious associations [31] . Independence among samples is also the fundamental assumption of case-control study. Thus, the related individuals are excluded or the relations should be taken into account during association analysis. The probable relatives and duplicates are detected based on pairwise identify-by-state (IBS) from which a variable called PIHAT is calculated via PLINK [ 2 4 ] . Individuals with high PIHAT (e.g., PIHAT . 0.25) are removed.

Hardy-Weinberg equilibrium
Since some association analysis approaches are developed under the assumption of linkage equilibrium. It is beneficial to check whether the SNPs deviate from Hardy-Weinberg equilibrium (HWE) for quality control. For this aim, each SNP is examined by using the asymptomatic chi-squared test or the exact one [32] , and then the SNPs with P-values less than 10 -5 -10 -6 are safely removed from further analysis. However, it is noted that departure from HWE can occur due to selection, population admixture, cryptic relatedness, genotyping error and genuine genetic association. Thus, checking HWE only in control is usually recommended [8,20] . In fact, deviation from HWE in case is typically regarded as a signal of true association.

Sex check
Sex is an important covariate in some GWAS such as lung cancer study [6] , where sex check becomes necessary. We use X chromosome data to estimate sex and compare it to self-reported sex. Discrepancy of reported and estimated SEX is further examined. Sometimes, the original documents should be retrieved to resolve this issue. Those records with discrepancies on sex are suggested to be removed if the discrepancy cannot be resolved.

Population outliers
The check of population outliers can be finished by performing an IBS-based nearest neighbor analysis in PLINK [24] . For each individual, the distance to its nearest neighbor is calculated, and then from the distribution of distance, we calculate a sample mean and variance and transform this measure into a z score. If an individual has an extremely low z score (e.g., , 4 standard deviations), this individual has significantly different genetic background from the rest of the study sample and should be excluded as outlier.
Other quality controls include the checks of Mendel error rate and heterozygosity rate. If the GWAS is family-based, the Mendel error rate can provide evidence of non-Mendelian transmission. The increased heterozygosity rate implies poor DNA quality [21] . It should be mentioned that there are no universally accepted thresholds for the exclusion criteria in quality control, but the values presented above are widely utilized in practical GWAS literature [6,8] . It is also noteworthy that the values of exclusion criteria used in quality control are case-specific and dependent on other factors, such as effect sizes of SNP, sample sizes and genotyping platform.

Population structure
In association analysis, population structure can cause spurious findings if not accounted for, and it is one of the most often cited reasons for non-replication of previously confirmed variants [33][34] . When the allele frequency differences between the case and control is due to systematic ancestral differences, it is said to have population stratification. Population structure also refers to population admixture, family structure, and cryptic relatedness. The population structure can occur in apparently homogenous study. A famous instance presented by Campbell et al. [35] , where it was reported that an SNP was strongly associated with the height of European Americans, which are usually considered a homogenous population, but the relationship was later proven to be attributed to stratification. Various approaches have been proposed to detect and correct for the possible population structure [33][34]36] .

Genomic control
Genomic control is a widely employed method to evaluate whether or not the population structure exists [37][38] . In the presence of population structure, genomic control assumes that the chi-squared statistic X 2 is inflated by a constant inflation factor l, which is defined as the empirical median of L unrelated statistics divided by the expected median under the null distribution l~m edian(X 2 1 ,X 2 2 ,:::,X 2 L ,) 0:456 ; where L is generally selected less than or equal to the number of SNPs m, and 0.456 is the expected median of the chi-squared distribution with d.f. 5 1. In theory, l should be equal to one in a homogeneous population. So a value greater than one implies population structure. Note that the inflation of statistics may be not due to population structure alone. For example, only part of the inflation is explained by population structure in terms of a recent study [23] , and it was found that there were other confounders such as differential bias or informative missingness, collectively leading to the inflation.
Genomic control corrects for population structure by rescaling each test statistic using uniform inflation factor, i.e., using X 2 /l in place of X 2 . It is easy and fast to compute, and can deal with cryptic relatedness as well as population stratification. However, some SNPs exhibit more differences in their allele frequencies than others; thus, the uniform adjustment is inappropriate and leads to a loss of power [39] . The validity of genomic control relies on several assumptions [37] , it is not known whether these assumptions hold in practice. It is also not clear how to choose appropriate threshold of inflation factor l to assess the effectiveness of adjustment, and empirically a value of less than 1.05 is deemed as safety.

Structured association
Structured association is a model-based clustering method. It firstly uses a subset of unlinked null SNPs to infer the population structure and allocate individuals to subpopulations according to their likelihoods, and then performs testing for association conditional on these allocations [40][41][42] . The advantages of structured association are that it explicitly infers the genetic ancestry and that it is based on a rigorous Bayesian clustering algorithm. However, this method is computationally intensive when applied to large scale GWAS data, and is sensitive to the number of clusters [31] . The structured association is carried out by the software STRUCTURE.

Principal component analysis
Principal component analysis (PCA) is frequently applied to account for population stratification [31,39] . The basic idea of PCA is to explicitly capture the hidden ancestry genetic background by extracting the top several independent axes of variation. Specifically, it suggests that individuals with similar principal components (PCs) are likely from the same subpopulation. The PCs are calculated using the singular value decomposition on the genotype matrix G.
By extensive simulations, it has been demonstrated that the PCA method, called EIGENSTRAT, has the following merits [39] . The PCA performs well even under mismatching of case and controls; it can implicitly and automatically match cases and controls to extract the maximum possible amount of power from the data while avoiding false positives due to stratification. It is computationally feasible on GWAS data. Secondly, the continuous axes of variation can be used as covariates to correct for stratification in multi-marker association analysis, and it is not sensitive to the number of axes of variation used as long as there are a sufficient number of axes to capture true population structure effects [43] . Thirdly, it is robust to inclusion or exclusion of the causal SNPs.
EIGENSTRAT is executed by the online software EIGENSOFT [39] . Fig. 2 shows the scatter plot of two top PCs and the PCA correction for population structure using a simulated case-control data. The top PCs in EIGENSTRAT may be not able to capture the complicated covariance structure due to the family structure or cryptic relatedness in the sample, for which the novel mixed models that explicitly utilize the kinships among the subjects provide an effective control [31,[44][45][46] .

Multidimensional scaling
PLINK also provides an approach to population stratification by clustering based on pairwise IBS distance [24] . Specifically, PLINK first considers every individual as a separate cluster, then clusters individuals into homogeneous subsets, and finally performs a multidimensional scaling (MDS) analysis to visualize substructure. Subsequent association analyses are conducted in each cluster if some clear evidence of population stratification is observed.

Association analysis
Single SNP scan Association analysis by comparing allele or genotype frequency between the case and the control is central to GWAS. Although considerable efforts have been made in developing strategies for association analysis of GWAS, single SNP scan is still the most commonly utilized approach [18] . It proceeds by testing each SNP sequentially with the null hypothesis of no association. The additive genetic model, implying that each additional number of copies of the minor allele increases the risk by the same amount, is often employ ed for a ssociation a nalysis a lthoug h other genetic models are also considered [18,20] . Let G ij be the genotypes AA, Aa, and aa for the j th SNP (j 5 1, 2, ..., m) on the i th individual (i 5 1, 2, ..., n), and a is the minor allele and A is the major allele. The additive genetic model corresponds to AA 5 0, Aa 5 1, and aa 5 2 with 1 degree of freedom The additive genetic model is tested using the Cochran-Armitage trend test [8,37,47] , which is equivalent to the score test in the logistic regression [48] . Logistic regression is another popular method for single SNP scan. Let p ij be the disease risk of the j th SNP on the i th individual, the logistic regression is [49] log it(p ij )~log(p ij (1-p ij ))~b 0 zb 1 G ij ; Here, b 1 5 0 corresponds to the null hypothesis of lack of dependence. One of the three asymptotically equivalent tests, i.e., likelihood ratio test, score test, and Wald test, can be applicable. Under the null hypothesis, any of the three test statistics has an chisquared distribution with d.f. 5 1 [48] . Logistic regression offers a flexible tool that can accommodate interaction effects, covariates (e.g., sex, age, and smoking) and PCs adjusting for population structure [18] .
Similarly, linear regression, analysis of variance, and t-test are natural choices for association analysis if the phenotypes are quantitative, and survival models (e.g., Cox proportional hazards regression) are considered if the phenotypes are survival data.

Multi-marker analysis
Complex diseases are determined by a group of loci in conjunction with environmental factors. Despite its success in identifying numerous loci associated with disease traits, single SNP scan is underpowered for multifactorial diseases. Thereby, it is more preferable to implement multiple markers analysis to capture the joint effects of SNPs [18,20,50] . In GWAS analysis, however, one of the main challenges is to examine hundreds of thousands of SNPs simultaneously. In a typical setting, the number of SNPs far exceeds the sample size, which makes it impossible to analyze the data using traditional multiple regressions, so placing regularizations on the regression coefficients is necessary [51] .
Recently, a number of penalization-based statistical techniques have been developed to deal with the high dimensional GWAS data, among which the least absolute shrinkage and selection operator (Lasso) has turned out to be a promising method [44,[52][53][54][55][56] . Let where l is a non-negative tuning parameter determining the strength of the penalty. The second term in the formula above is called L 1 -penalty, due to which the Lasso can shrink some small values of b to be exactly zeros; hence, parameter estimation and variable selection are achieved simultaneously. The optimal l value is selected by crossvalidation (CV), generalized cross-validation (GCV), and Bayesian information criterion (BIC) [51][52] . The Lasso can deal with genome-wide SNPs at the same time, but performing a full Lasso regression is not practical because the computational burden is intensive. Therefore, it is more efficient to first reduce the total number of SNPs to a manageable level via a screening procedure, and look for causal loci among these passing the prespecified threshold. This screening-modeling strategy is theoretically and empirically reasonable for the large scale analysis [22,[57][58][59] .

Multiple comparisons
Single SNP scan for GWAS analysis is computationally practical. However, it suffers from serious multiple comparison problems due to implementing a large number of hypotheses at the same time. Type I error will be inflated if no measure is taken. Assume that each SNP is tested at the traditional significance level a 5 0.05, then the total type I error is 1-(1-a) m , which will approach 1 quickly as the number of marker m increases. For example, if m 5 100,000, it is expected that about 5,000 false positive associations are observed by chance even none of SNPs is diseaserelated. Thus, multiple comparison is an important consideration in GWAS analysis, and must be handled appropriately [18] .

Bonferroni correction
Bonferroni correction offers a convenient way to control the family-wise error rate (FWER) by dividing a by m provided that the markers are uncorrelated. The FWER is the probability of rejecting at least one null hypothesis when all the nulls are correct [60][61] . The resulting significance level is a/m, an SNP is then considered to be statistically significant if its P-value is less than the adjusted significance level. However, it is well known that Bonferroni correction is conservative. Perhaps none of SNPs can achieve such small threshold; in this case, a few SNPs with relatively small P-values can be chosen for further investigation.
Bayes factor has recently been applied to the measurement of significance as an alternative to Pvalue [62][63][64] , which avoids adjustment for multiple comparisons.

False discovery rate
False discovery rate (FDR) [65] is another commonly used error measure, which provides a less conservative way by controlling the expected proportion of false rejections of nulls among all rejections. FDR is controlled through the linear step-up procedure. For a given FDR level (e.g., a 5 0.05) find the largest k value that satisfies P (k) # ak/m, where P (k) is the order P-value from the smallest to the largest, and the SNPs with P-values # P (k) are declared as significance. The step-up procedure is valid when the tests are independent or positively dependent, and it often leads to a lower threshold and thus improves the power [65][66] .
The Bonferroni correction, false discovery rate procedures, and other adjustment methods can be performed using PLINK [24] .

Independent validation
To minimize the risk of false positive associations in GWAS, statistically significant SNPs should be further validated by independent replications [20,67,68] . The main objective of replications is to evaluate systemically whether or not the discovered SNPs in initial GWAS are spurious signals. The replication samples should be collected the same way as the original study. Association analysis has to be based on the same genetic model as being employed in the original study to ensure consistency and robustness of associations.
The effect sizes of markers should show the same signs in both the replicated and original study. Mulitple criteria for establishing positive replication are suggested by Chanock et al. [67] .
However, the failures of replication do not necessarily mean the original findings of association are false positive because there are many reasons for non-replications. For example, hidden population structures are not taken into account in both the replicated and original samples [34] , the found markers have very small effects and cannot be rediscovered easily, and the replications have small sample sizes leading to low power to confirm the initial outcomes [67][68] . In conclusion, it has now been agreed that any significant associations in GWAS must be validated strictly by follow-up studies or biological interpretations before being reported.

Visual presentation
Visual presentation is helpful to understand the results of GWAS, a large number of visual tools have been developed. Here, we briefly introduce a few of them.

Quantile-quantile plot
It has been realized that the statistics will be biased due to population structure. It is a routine pattern to apply the quantile-quantile (Q-Q) plot to evaluate the existence of population structure before and after correction. The Q-Q plot is constructed as a scatter plot of the observed ranked chi-squared statistics from the smallest to the largest against the theoretical values under the null hypothesis of no association. If the statistics come from null distribution, the plot should go along the diagonal linearly. The Q-Q plot can be conveniently implemented by R package using the outputs of PLINK [69] .
The Q-Q plots for population structure using a simulated data with genomic control and PCA adjustments are shown in Fig. 2. The simulation settings are similar to those in the report of Price, et al. [39] . Fig. 2 shows that the population structure leads to inflation of test statistics, and both the genomic control and PCA provide effective corrections. The 95% confidence interval (CI) of inflation factor is calculated using the method of bootstrap [70] , repeating 1,000 runs.

Manhattan plot
The P-values of GWAS are generally shown by Manhattan plot. This plot is produced by scattering the P-values in -log 10 scale in the vertical axis and the physical position of SNP along chromosomes in the horizontal axis. Different chromosomes are generally distinguished with colors. Using -log 10 scale is to highlight the small P-values, which suggest potential disease-related SNPs. The Manhattan plot is executed through the R package or the program Haplo View [69,71] . For example, see Fig. 3.

Haploview plot
Once significant susceptibility loci are found, the computation of linkage disequilibrium among their neighboring SNPs upstream and downstream and understanding the population haplotype structure are of interest. Haploview provides such analyses in a visually appealing and interactive interface (Fig. 4) [71] .

LocusZoom plot
LocusZoom, a web-based plotting tool, provides visually regional information such as the strength and extent of the association signal relative to genomic position, local linkage disequilibrium and recombination patterns and the positions of genes in the region (Fig. 5) [72] .

Other advanced topics
Although the above methods for analyzing the GWAS data have been widely employed in practice, they have some limitations and shortcomings, for example, low power and lack of interpretability and repeatability for single marker scan. Substantial efforts have been made to overcome these difficulties. Now, we discuss some advanced topics for further analyzing the GWAS data.

Missing heritability
In spite of GWAS success for most common diseases, the discovered disease susceptibility loci explain only a remarkably small part of the overall phenotypic variation [22,[73][74][75] . Several reasons have been proposed for the missing heritability. (I) The genetic effect sizes on the phenotypes are fairly weak. It is reported that the relative risks (or odds ratios, OR) of most of the related loci are typically on a scale of 1.10-1.20 even for GWAS with very huge sample size. A large number of markers with much weaker effect sizes cannot be detected using current statistical approaches. (II) Part of the heritability is attributable to the interaction effects of gene-gene and gene-environment [76][77][78][79][80] , whereas most of the published GWAS literature only considers the main effects. (III) Gene-based and pathway-based analyses can improve the power [81][82][83][84][85][86][87] ; these considerations result from the fact that multiple SNPs nearby are often involved in the same biological pathway and act collectively. (IV) The effectiveness of the current GWAS largely depends on the hypothesis of common disease common variant (CDCV) [18] , i.e., the common complex diseases are mainly attributable to a number of common variants (i.e., locus with MAF. 1%). However, recently, it has been recognized that the rare variants (i.e., locus with MAF ,1%) are expected to contribute to susceptibility to the common complex diseases substantially [75,[88][89][90][91][92] , i.e., the hypothesis of common disease rare variant (CDRV) [93] .
All of the issues above are not trivial, and actually they can be more complicated and challenging. Discovering markers with weaker effect sizes requires much larger sample size, which will lead to more expenses. Although the importance of epistasis has been recognized, the power of detecting epistatic effect is rather low due to relatively small sample size, and the calculation of interaction effect is much time-consuming. For example, to look for pairwise interactions from 500,000 SNPs, a total of 1.25 6 10 11 tests need to be implemented. The issue of multiple testing also arises in epistasis model, higher order interactions are more complex, and how to model the biological epistasis statistically is not completely clear. For rare variants, the commonly used methods break down due to the extremely low MAF; therefore, developing powerful statistical approaches for rare variants is an urgent demand [94][95][96][97][98][99][100][101][102][103][104][105] .

Meta-analysis
As mentioned before, current single marker scan of GWAS data is underpowered and limited by the weak to modest effect sizes of related common variants. Fortunately, several independent teams of investigators around the world perform similar GWAS on the same disease [11,106] . This provides an opportunity to combine these datasets via the method of meta-analysis. An attractive aspect of meta-analysis is that the statistical power can be improved by increasing the sample size. However, various studies typically contain different sets of SNPs; therefore, the first step of meta-analysis for GWAS is imputing the untyped SNPs. Then, the classical meta-analysis can be employed, such as the fixed effects model and mixed effects model. The former assumes that the genetic effects are the same across the individual studies and the observed differences are duo to sample error; while the latter assumes that the genetic effects vary between studies and the differences include both sample error and substantial distinctions across various GWAS. The mixed effects model is more conservative than the fixed effects model; thus, determining which models are employed is important. If the evidence of between-study heterogeneity is present, the mixed effects model is used, for example, the DerSimonian and Laird model; otherwise, the fixed effects model is used. The heterogeneity may be due to variable linkage disequilibrium across the studies, winner's curse, gene-gene interaction and other possibilities [107] . Many metrics have been proposed to test the heterogeneity, among those the statistic I 2 is widely applied, which is a measure of the proportion of total variation between studies attributable to heterogeneity beyond the sample error [108] . The value of I 2 ranges from 0% to 100%, and a larger I 2 indicates more heterogeneity. A general guideline of I 2 for detecting heterogeneity is that 0% to 25% represents ignorable heterogeneity, 25% to 50% represents low heterogeneity, 50% to 75% represents moderate heterogeneity, and 75% to 100% represents high heterogeneity [109] .

Set-based association analysis
To overcome the limitation of single SNP analysis, set-based association analyses have been developed, where multiple related SNPs (e.g., within the same gene, pathway or functional group) are grouped into an SNP set and collectively examined [81][82][83][84][85][86][87] . SNP set analysis enjoys many advantages compared to single SNP analysis, e.g., identifying multi-marker effects, decreasing the number of multiple comparisons, allowing for epistatic effects, and making inference on biologically meaningful units [110] . The reasonable logics behind the set-based testing procedures exist in two aspects: (I) the statistical power increases significantly due to enriching association signals by grouping; (II) it has been observed that multiple SNPs are often jointly related to diseases. The often used set-based tests include principal components analysis, kernel machine testing, the global test and others [110] .

Copy number variant analysis
Beyond SNP, copy number variant (CNV) is another important genetic variation existing in the human genome. CNV is typically defined as a submicroscopic variation of DNA segments, ranging from kilobases to megabases in size, including deletions, insertions, duplications, and complex multi-site variants. CNV is related to common complex diseases and is widely believed as one of the causes of missing heritability [111,112] . Several methods are developed for detecting CNV, e.g., BAC Array Comparative Genomic Hybridization (CGH), Representational Oligonucleotide Microarray Analysis (ROMA) and Agilent CGH. It is also possible to infer CNV using the GWAS data; however, the optimal strategy for evaluating such data is not clear. Additionally, the approaches for inferring rare and de novo CNV also need to be developed [113] . Furthermore, the next-generation sequencing technologies create new challenges for analyzing CNV since there have been no accepted standard protocols and quality control measures so far. The challenges can come from mappability, GC-content bias, quality control measures of reads and difficulty in identifying duplications [114] .

GWAS cohort analysis
Current GWAS is mainly cross-sectional. This may be limited in investigating the causality between genetic variants and diseases. GWAS cohort can not only provide further insight into causality but also a solution of missing heritability. In cohort study, individuals are collected longitudinally; accordingly, temporal changes in biological properties will offer insight into disease diagnosis, progression, and recovery. However, the temporal effects between different times create great statistical challenge, especially in such high-dimensional data, and should be taken into account [115] . Association analysis for GWAS cohort data is a relatively new field although some explorations have been made. Recently proposed methods for cohort GWAS data include, dependent on the type of dataset, for example, the longitudinal support vector machine and the penalized mixed effects model [115] .

Discussion
The success of GWAS relies on the progresses of technologies, effective collaborations of researchers, well designs and implementations as well as subsequent analyses and interpretations. This article gives a comprehensive overview of genome-wide association analysis, the basic strategies commonly used in GWAS are introduced, and the shortcomings of these approaches are also emphasized.
Stringent quality control ensures the results of GWAS are believable, some of which will become less important in the future. For example, owing to nextgeneration sequencing, there are fewer genotyping e r r o r s a n d n o n e e d t o r e m o v i n g S N P s w i t h MAF , 1%; instead, the rare variants are of important interest.
When association is analyzed, it requires paying more attention to the issues of multiple testing and population structure because false positive or false negative associations arise. For population structure, it is important either to demonstrate that it can be negligible [6,8] , or to adequately adjust for by methods such as genomic control and PC analysis [21] .
There are many user-and-interface friendly tools to implement the visualizations. It has become routine in GWAS analysis that one presents the Manhattan plot to show the potential disease-related loci and the Q-Q plot to show the inflations of test statistics. Other plots such as LocusZoom plot and Haploview plot offer valuable complements.
The genome-wide era creates exciting and challenging opportunities to the scientific world. Statistical approaches play very important roles in GWAS. We have noted that some other novel statistical tools are ignored, such as the Bayesian approaches and the machine learning [21][22]51,58,62,79,[116][117] . We hope that the present overview is helpful for GWAS researchers to obtain a clear picture of analyzing such large scale data, and that the future improvement of statistical methods will enable us to overcome these challenges and to make a better understanding of the genetic basis of the complex diseases.