gap : Genetic Analysis Package

A preliminary attempt at collecting tools and utilities for genetic data as an R package called gap is described. Genomewide association is then described as a speciﬁc example, linking the work of Risch and Merikangas (1996), Long et al. (1997) for family-based and population-based studies, and the counterpart for case-cohort design established by Cai and Zeng (2004). Analysis of staged design as outlined by Skol et al. (2006) and associate methods are discussed. The package is ﬂexible, customizable, and should prove useful to researchers especially in its application to genomewide association studies.


Introduction
Approaches to understanding the genetic basis of human diseases have been widely discussed, e.g., Morton et al. (1983), Khoury et al. (1993), Thomas (2004).Methods include the assessment of familial aggregation for heritability, identification of major gene effect, study of cosegregation of genetic marker with putative disease-predisposing loci in the so-called linkage studies and association studies in search of frequency differences between cases and controls and/or correlation between genotype and phenotype as a quantitative trait.Recently, owing to the availability of large number of genetic variants and particularly single nucleotide polymorphisms (SNPs), attention has focused on association designs including both families and unrelated individuals from general populations.Three initiatives of interest are: 1.The hapmap project (http://www.hapmap.org/), a partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the United Kingdom and the United States to develop a public resource that will help researchers find genes associated with human disease and response to pharmaceuticals.
2. The Wellcome Trust Case Control Consortium (WTCCC, http://www.wtccc.org.uk/), a collaboration of human geneticists who analyse thousands of DNA samples handling of large data, multiple testing, among others.The implementation of analytical approaches to these problems has so far scattered and not integrated with established software systems.Preliminary reviews have shown that R is a strong alternative which offers many desirable facilities and can be used in conjunction with other software systems (Zhao and Tan 2006).It is possible that in the near future, many tools for genetic data analysis will be available on these software systems.
We have recently carried out a study of obesity using Affymetrix (http://www.affymetrix.com/) 500K and Illumina (http://www.illumina.com/)317K systems following an earlier pilot using Perlegen (http://www.perlegen.com/)platform with about 250K SNPs.The samples were based on European Prospective Investigation of Cancer (EPIC) Norfolk study (http://www.srl.cam.ac.uk/epic/).Our particular concern is cost-efficiency, not only because the experiment is expensive but we wish to take full advantage of it.We adopted a two-stage case-cohort design which is advantageous in cost-efficiency over the standard casecontrol design.While seeking for appropriate methods for joint analyses, we have recast the problem within the missing data framework.We believe our approach will be invaluable to other colleagues.
Here, I describe a genetic analysis package, gap, created and maintained by the author and is available from the Comprehensive R Archive Network (CRAN, http://CRAN.R-project.org/), taking advantages of the portable environment of R for data management, analysis, graphics and object-oriented programming.The functions implemented for genetic association analysis include Hardy-Weinberg equilibrium tests, measures of linkage disequilibrium between SNPs or multiallelic markers and haplotype analysis.I will give a brief overview of the package, and a specific example concerning design and analysis of genetic association, which was used in the design of the case-cohort study aforementioned.I will also use a dataset from our case-control association study of Type-2 diabetes with chromosome 20, to illustrate joint analysis within R. It is clear that many statistical issues can be handled with ease and used in conjunction with other packages available from CRAN.The calculation as in some original work is also given for reference, with results not seen in the genetic analysis literature.
At the end I will provide a brief summary.

Overview of implemented methods
This package was designed in 2003 to integrate some programs in C/Fortran and SAS (SAS Institute Inc. 2004) I have written or used over the years, by taking advantage of the ability of R to use foreign language routines in C/Fortran or its facilities similar to SAS.The recent implementations are more native to R. The collection therefore somewhat reflects the evolution of (or more exactly my appreciation of) statistical genetics (or genetic epidemiology) over years.The description of the main functions is given in Table 1, while a list of data sets is given in Table 2.They can be conveniently obtained with R command library(help = "gap") once gap is installed.Some of the functions are highlighted as follows: • BFDP: Bayesian false-discovery probability according to Wakefield (2007).
• hap, hap.em, mia: a multiallelic version of progressive EM algorithm for haplotype inference as reported in Zhao (2004).
• tbyt, kbyl: linkage disequilibrium statistics between two biallelic or multiallelic markers, including standard error for D as reported in Zhao (2004).
• makeped: a utility to prepare for post-MAKEPED format file as required by LINKAGE (Lathrop et al. 1984) or other computer programs, a version distributed with LINKAGE by Wentian Li. • pedtodot: a utility to produce Graphviz (AT&T Research Lab 2008) command file based on a GAWK script by David Duffy.
• tscc: power of two-stage case-control design under several models, similar to Skol et al. (2006).
• muvar, whscore: utility functions to obtain mean and variance under two-locus model, and score (Whittemore and Halpern 1994).
A brief summary of a given function, including appropriate reference(s), can be obtained with help(function-name, package = "gap") within R.There are a number of example datasets.For instance, the cystic fibrosis data (Kerem et al. 1989) has been used in many papers on fine-mapping.

Example: Study design and analysis of genetic association
Since the range of problems to be tackled by gap is quite broad, I now give specific examples of how some functions have been developed to facilitate our design and analysis of genetic association.The three popular designs outlined earlier are shown in Figure 1.
Because of the importance of work by Risch and Merikangas (1996) and Long et al. (1997), functions pbsize and fbsize have been written to implement their methods and correct a programming error in Risch and Merikangas (1996).The table to be obtained could be used as a quick reference and the code be tailored to particular designs.Power for case-cohort design established by Cai and Zeng (2004) is given in function ccsize.The method by Skol et al. (2006) for two-stage design has been implemented in function tscc.

Some preliminaries
A widely used general model consists of a disease susceptible locus with two alleles (A and a) and population frequencies p, q = 1 − p, and penetrances as f 0 , f 1 and f 2 (i.e., the conditional probability of getting the disease, the subscripts 0, 1 and 2 are copies of the disease-predisposing allele A).Hardy-Weinberg equilibrium states that the distribution of the three genotypes, aa, Aa, and AA are according to (p + q) 2 , so that the population prevalence of the disease K = p 2 f 2 + 2pqf 1 + q 2 f 0 .Other statistics, which are often used, are additive respectively.It is also common to specify further the relationship between the three genotypes.
As for multiplicative models to be considered here, the genotypic relative risk (GRR) is γ and γ 2 for genotypes Aa and AA compared to aa as baseline, the population disease prevalence, the additive and dominance variances are therefore K = p 2 γ 2 + 2pqγ + q 2 = (pγ + q) 2 , V A = 2pq(γ − 1) 2 (pγ + q) 2 and V D = p 2 q 2 (γ − 1) 4 , respectively.These can be used to define offspring/sibling relative risks, which are the probability ratios of a offspring/sibling of an affected parent/child being affected relative to the general population and they are shown to be Linkage and association designs using family data were the state-of-the-art designs for localising disease-predisposing loci.A key concept relates to genes shared by relatives from common ancestor and is called identity by descent (IBD), which is associated with a model-free approach of linkage whereby allele-sharing between affected members in a pedigree is compared to test for linkage between a marker locus and a disease locus.The simplest allele-sharing method uses affected sib-pairs.Assuming the recombination fraction between the disease locus and a marker to be θ, and defining Suarez et al. (1978) derived the IBD probabilities for affected sib-pair, respectively.Under no linkage, the probabilities of affected sib pair sharing 0, 1, and 2 alleles IBD are 1/4, 1/2, and 1/4.This result can be used to form a nonparametric test for linkage.The corresponding expressions under multiplicative model can also be obtained, where probabilities of siblings sharing none or one allele by descent is z 0 = 1/(4λ S ) and When a candidate or dense collection of markers is available, one can use nuclear families in which affected offsprings and their parents are genotyped, the transmitted versus nontransmitted alleles from parents to offsprings are compared in the so-called transmission disequilibrium test (TDT), which can be based on nuclear families with affected singletons only.
In the following standard results related to normal distribution will be used.For a set of N independent identically distributed random variables B i with mean and variance being 0 and 1 under the null hypothesis, µ and σ 2 under the alternative hypothesis, the statistic N i=1 B i /N has mean 0 and variance 1 under the null but mean √ N µ and variance σ 2 under the alternative.The sample size (N ) for a given significance level α and power 1 − β can be estimated by (Z α − σZ 1−β ) 2 /µ 2 .In the actual calculation below, the type I error rate (α) and type II error rate (β) are 5 × 10 −8 and 0.2, respectively.

Power/sample size for family and case-control/case-cohort designs
We recall some results on affected sib-pair linkage analysis, and define the alleles shared and nonshared from the ith parent as a random variable B i , i = 1, . . ., N , scoring 1 and −1.The mean (µ) and variance (σ 2 ) of B i are 0 and 1 under the null hypothesis as the shared and nonshared each has probability 0.5, and 2Y − 1 and 4Y (1 − Y ) under the alternative.Assuming sharing of alleles from both parent to be independent, the required sample size (N ) for affected sib-pair under θ = 0 and no linkage disequilibrium is (Z α − σZ 1−β ) 2 /2µ 2 , where Y and w are defined as before.
As with TDT, we assume that the disease locus and a nearby locus are in complete disequilibrium, the number of transmissions of allele A are scored from heterozygous parents, where the probability (h) of a parent of an affected child being heterozygous is given by pq(γ + 1)/(pγ + q) and pq(γ + 1) 2 /[2(γp + q) 2 + pq(γ − 1) 2 ] for singletons and sib-pairs, respectively.For singletons, a random variable B i is defined taking values 1/ √ h if parent is heterozygous and transmits A, 0 if parent is homozygous, −1/ √ h if parent is heterozygous and transmits a.The mean and variance of B i are 0 and 1 under the null hypothesis, √ h(γ − 1)/(γ + 1) and 1 − h[(γ − 1)/(γ + 1)] 2 under the alternative.When sib-pairs instead of singletons are used in TDT analysis, the same formula for sample size calculation can be applied and the required number of families is half the expected number since there are two independent affected sibs.
The results of example("fbsize", package = "gap") are shown Table 3. Column N L contains the correct calculation corresponding to the original paper.The Alzheimer's disease model is based on Scott et al. (1997).
It turns out that with γ ≤ 2, the expected marker-sharing only marginally exceeds 50% for any allele frequency (p).The use of linkage would need large sample size, but direct tests of association with a disease locus itself can still be quite strong although it may involve large amount of statistical testing of associated alleles.Clearly, it is most favorable for diseases that are relatively common, which has important implications for complex traits.Now we consider the case-control design with a statistic directly testing association between marker and disease.Following Long et al. (1997) and supposing we have a randomly ascertained population sample, the frequencies of the three disease genotypes AA, Aa and aa in cases are πγ 2 p 2 , 2πγpq, and πq 2 , respectively, where π is the "baseline" probability that an individual with aa genotype being affected.Similarly, the three frequencies in controls are (1 − πγ 2 )p 2 , 2(1 − πγ)pq and (1 − π)q 2 .A unit χ 2 statistic can be constructed using Table 4 as X 2 = (O − E) 2 /E, where Es indicate expected unit frequencies (πp(γp + q) 2 , πq(γp + q) 2 , p − πp(γp + q) 2 and q − πq(γp + q) 2 ), and the discrepancies between observed and expected frequencies all have factor πpq(γp + q)(γ − 1) but with negative sign before the second and the third items.The statistic X 2 has χ 2 1 distribution under the null hypothesis (γ = 1), and noncentral χ 2 1,δ distribution with noncentrality parameter where Z is a preassigned standard normal deviate, Φ(.) is the cumulative normal distribution function.
The power for random case-control ascertainment with three different disease prevalences as from example("pbsize", package = "gap") is shown in Table 5.
So the most favorable scenario is when both the genotypic relative risk (γ) and allele frequency (p) are high for a given locus, and the disease is common.
Lastly we turn to case-cohort design.Cai and Zeng (2004) showed that power of a case-cohort design can be obtained by Φ(Z α + m 0.5 θ p 1 p 2 p D /q + (1 − q)p D ), where α is the significance level, θ is the log-hazard ratio for two groups, p j , j = 1, 2, are the proportion of the two groups in the population, m is the total number of subjects in the subcohort, p D is the proportion of the failures in the full cohort, and q is the sampling fraction of the subcohort.Alternatively, the sample size required for the subcohort is and n is the size of cohort.The method has been implemented in function ccsize.Skol et al. (2006) examined tests of allele frequency differences between cases and controls in a two-stage design and is described here.The usual test of proportions can be written as

Joint analysis in staged design and related issues
, where p 1 and p 2 are the allele frequencies, n 1 and n 2 are the sample sizes, π samples is the proportion of samples to be genotyped at stage 1.The test statistics for stage 1, for stage 2 as replication and for stages 1 and 2 in a joint analysis are then C 2 , and C j be the thresholds for these statistics, Skol et al. (2006) derived the false positive rates according to P (|z Table 5: Estimated sample sizes required for association detection using population data. C j ||z 1 | > C 1 ) for replication-based and joint analyses, respectively.As our primary interest is the power for the two types of analyses, we implement it in function tscc whose format is tscc(model, GRR, p1, n1, n2, M, alpha.genome,pi.samples, pi.markers, K) which requires specification of disease model (multiplicative, additive, dominant, recessive), genotypic relative risk (GRR), the estimated risk allele frequency in cases (p 1 ), total number of cases (n 1 ) total number of controls (n 2 ), total number of markers (M ), the false positive rate at genome level (α genome ), the proportion of markers to be selected (π markers , also used as the false positive rate at stage 1) and the population prevalence (K).Note the disease risks involving the three genotypes for additive, dominant and recessive models are calculated as (1, GRR, 2 • GRR − 1), (1, GRR, GRR), (1, 1, GRR), respectively.Power for a number of scenarios for two-stage designs can be obtained with slight modification of example code in the documentation of tscc.Interestingly, the R implementation is much shorter than the C program by Skol et al. (2006).Lin (2006) recently proposed a method of analysis for two-stage designs.Like other contributions, the focus was on the markers genotyped at both stages.Given that in most studies such as the case-cohort design outlined above, there is a rich collection of covariate information and perhaps an equally important aspect is to use all markers and covariates at both stages in a unified analysis.This amounts to a standard analysis with missing independent variables with which statistical methods have been well developed, e.g., Ibrahim (1990), Schafer (1997a), van Buuren et al. (1999).The packages for multiple imputation available from CRAN, such as cat, mix, norm, pan (Schafer 1997b,c,d,e), mice (van Buuren and Oudshoorn 2007), mitools (Lumley 2006) can be used.
A useful point relates to prospective versus retrospective methods.The retrospective counterpart requires more attention but far from clear (Elston and Anne Spence 2006), although the equivalence of retrospective versus prospective methods is known (Prentice and Pyke 1979;Seaman and Richardson 2004).It has been shown (Kraft and Thomas 2000;Tan et al. 2005) that prospective likelihood can give biased estimate due to over representation of cases or the extremes of the traits compared with the general population, so that a retrospective model of allele/genotype/haplotype frequencies conditional on the disease phenotype via logistic model is preferred.The model is applicable to a wide range of traits and provides unbiased estimates.This is in contrast to prospective methods (Schaid et al. 2002;Seaman and Muller-Myhsok 2005).
These two aspects are well illustrated with a concrete example.We have recently conducted a study of Type-2 diabetes involving 5,013 Ashkenazi and four UK populations using twostaged design and 4,570 SNPs across a 10Mb region of chromosome 20q.A subset of 2,502 individuals were genotyped on these SNPs at stage one, from which SNPs with significance level 0.1 together and those within regions of interest (e.g., HNF4A) were further genotyped on remaining sample at stage two.In the analysis, a meta-analysis was applied to take into account the heterogeneity between populations.The two-stage method by Lin (2006) would be rather complicated.A useful prospective formulation for a particular marker not genotyped at stage two is to treat it as missing and considered in a weighted regression on trait, where the weight is associated with the three genotypes, similar to Schaid et al. (2002).For the retrospective method, one can augment the missing genotype similarly, in much the same way as Burkett et al. (2006).However, recall that this is exactly the kind of problem multiple imputation will deal with, such that we simply apply the appropriate packages aforementioned which accounts for the uncertainty due to missing genotypes.This has not been used in staged design although it was recognized recently in haplotype analysis (Sorensen et al. 2006;Cordell 2006).Now with our data contained in chr20, and stage one marker rs1419383 is to be analyzed, we can use mice as follows, R> data("chr20") R> imp <-mice(chr20) R> fit <-lm.mids(cc~rs1419383 + ethnicity, data = imp) R> pool(fit) The function mice by default performs five imputations and therefore five analyses whose results are combined with pool command.We can also load Bioconductor (http://www.bioconductor.org/)package RSNPper (Carey 2006) and report metadata (e.g., genome annotation) and population information based on hapmap, as follows.
R> library("RSNPper") R> mysnp <-SNPinfo("rs1419383") R> popDetails(mysnp) R> geneInfo("HNF4A") Markers involved in a multistage design can largely be seen as monotonic missing by design and can be dealt with similarly.For large data from a typical genomewide association involving much more SNPs, we can use database connection mechanisms such as open database connectivity (ODBC).

The EPIC-Norfolk study of obesity
Our EPIC-Norfolk genomewide association study of obesity serves as a good example which many principles just outlined apply.The study was initially designed as a case-control study with the following definition of cases and controls: cases are those with body mass index (BMI) > 30kg/m 2 and controls with 20kg/m 2 ≤ BMI < 25kg/m 2 .This leads to all obese individuals of the EPIC-Norfolk cohort (3425), and 3400 controls.Half of the samples are to be genotyped at stage one, to be followed by the remaining half at stage two.When adapting this into a case-cohort design, we wished to know the power/sample size required and this turned to be straightforward to obtain.Part of calculation in Cai and Zeng (2004) and the required sample size or power in our case can be obtained with example("ccsize", package = "gap").It turned out a comparable case-cohort design has approximately 2500 controls.Interestingly, this also corresponds to about 10% of the size of our cohort (n), which would expect to give stable estimate for a range of covariates measured in the cohort.

Summary
This short report shows that the design and implementation of the gap package allows for a wide range of functions available in a unified fashion.The package is still under development and can be seen as a prototype for a future more fully established environment.As a reviewer pointed out that "A strength of the 'gap' package is that it aims to be the seed point for a general compilation of such methods, which means it implements many different kinds of methods, not just the ones developed by the author", the author would like to take this opportunity to invite comments, suggestions and contributions to consolidate the package.Because of the broad scope of applications, we focus on genomewide association as a particular application.Part of Table 3, Table 5, and the use of multiple imputation in staged design, have not been seen in the literature.The latter requires more elaborate development with respect to probability weighting.Function tscc is a much more transparent implementation than the original authors' C/C++ software.For power calculation, it is desirable to implement other commonly used models such as additive, dominant and recessive models.Recently, Skol et al. (2007) further examined the design issues associated with two-stage design.I am aware of related efforts which may be useful, such as pbatR (Hoffmann and Lange 2006), SNPassoc (Gonzalez et al. 2007), GenABEL (Aulchenko et al. 2007) and snpMatrix (Clayton and Leung 2007).Among others, the RSNPper (Carey 2006) is useful for accessing SNP metadata while biomaRt (Durinck 2006) enables a large collection of biological data to be available in R. The focus here is largely single point analysis, but multilocus methods such as haplotype analysis have been implemented in gap.For instance, the function hap.score is an adaptation of haplo.scorefunction based on Schaid et al. (2002) which can easily take haplotype frequency estimates and haplotype assignments from other source and use them within the generalized linear models framework.
Although the availability of SNP data is overwhelming, the gap package was conceived such that it will handle not only SNPs but multiallelic loci and X-chromosome data.Examples include genecounting and hap, with the former being flexible enough to include features such as X-linked data and the latter being able to handle large number of SNPs.But they are unable to recode allele labels automatically, so functions gc.em and hap.em are in haplo.emformat and used by a modified function hap.score in association testing.Owing to limitation in timing, we have used SAS for many tasks in analyzing our obesity data (Zhao et al. 2007) which partly contributes to the delay in consolidating SNP specific analyses in the package.
Nevertheless, it will be of value to adapt some of the utilities developed in SAS.
After the package was posted to CRAN, other packages have appeared with somewhat overlapping functions, notably powerpkg (Weeks 2005); a recent version of genetics package (Warnes 2003) also contains a function power.casectrl.It would be useful to make some comparisons.Another point relates to the sequence of development of functions for a package.In retrospect, it would have been easier if all codes were implemented in native R language and avoiding foreign language calls, but the latter may be advantageous in speed given some tuning is done.Notably, the following functions need further work: bt, kin.morgan, twinan90 (Williams et al. 1992), gcontrol (Devlin and Roeder 1999), mtdt (Sham 1997), s2k (Hirotsu et al. 2001).The overall target would involve better memory management, maintaining numerical precision or extensions, and use more native .Call and S4 classes of R. It is likely that existing functions and data would continue to evolve.
The major driving force to integrate many tools for genetic data analysis is in line with the observation that statistical genetic methods or genetic epidemiology is increasingly becoming part of the general epidemiology with focus on both genetic and nongenetic factors for diseases, where statistical and computational methods are more established.On this point, I wish to emphasize the benefit of shifting to or working on R from my own experience, which has been a very rewarding one, e.g., Zhao (2005) and Zhao (2006) in relation to package kinship (Therneau 2003).I noted a similar but more formal effort has been launched (http: //rgenetics.org/)but certainly an informal approach also has its place, as for example in the case of haplo.stats(Sinnwell and Schaid 2007).The best data structures have yet to be established and collaborative work would be beneficial.I hope eventually this will be part of a bigger effort to fulfill most of the requirements foreseen by many e.g., Guo and Lange (2000).

Figure 1 :
Figure 1: Three common genetic association designs involving unrelated individuals (left), nuclear families with affected singletons (middle) and affected sib-pairs (right).Males and females are denoted by squares and circles with affected individuals filled with black and unaffect individuals being empty.

Table 3 :
Comparison of linkage and association in nuclear families required for identification of disease gene: γ=genotypic risk ratio; p=frequency of disease allele A; Y =probability of allele sharing; N L =number of ASP families required for linkage; P A =probability of transmitting disease allele A; H 1 , H 2 =proportions of heterozygous parents; N tdt =number of family trios; N asp/tdt =number of ASP.families

Table 4 :
Expected frequencies for allele by case/control genotypes.