with SNP

Complex medical disorders, such as heart disease and diabetes, are thought to involve a number of genes which act in conjunction with lifestyle and environmental factors to increase disease susceptibility. Associations between complex traits and single nucleotide polymorphisms (SNPs) in candidate genomic regions can provide a useful tool for identifying genetic risk factors. However, analysis of trait associations with single SNPs ignores the potential for extra information from haplotypes, combinations of variants at multiple SNPs along a chromosome inherited from a parent. When haplotype-trait associations are of interest and haplotypes of individuals can be determined, generalized linear models (GLMs) may be used to investigate haplotype associations while adjusting for the effects of non-genetic cofactors or attributes. Unfortunately, haplotypes cannot always be determined cost-effectively when data is collected on unrelated subjects. Uncertain haplotypes may be inferred on the basis of data from single SNPs. However, subsequent analyses of risk factors must account for the resulting uncertainty in haplotype assignment in order to avoid potential errors in interpretation. To account for such uncertainty, we have developed hapassoc, software for R implementing a likelihood approach to inference of haplotype and non-genetic effects in GLMs of trait associations. We provide a description of the underlying statistical method and illustrate the use of hapassoc with examples that highlight the flexibility to specify dominant and recessive effects of genetic risk factors, a feature not shared by other software that restricts users to additive effects only. Additionally, hapassoc can accommodate missing SNP genotypes for limited numbers of subjects.


Introduction and Background
The identification of genetic factors influencing susceptibility to complex diseases such as cancer and diabetes is important for improving our understanding of disease pathways.Genetic factors are measured at multiple sites of known genomic location (genetic markers) to determine if variants at these sites are associated with the disease trait.A common type of genetic marker in association studies is the single nucleotide polymorphism (SNP).SNPs generally have only two different variants, called alleles, which can be labelled as 0 or 1.For each marker, an individual inherits one allele from their mother, and one from their father.The genotype is the particular two alleles that are inherited.The three possible genotypes for a SNP marker are 0/0, 0/1 and 1/1, where "/" is used to separate the alleles inherited from each parent.The 0/0 and 1/1 genotypes are called homozygous (both alleles are the same) and the 0/1 genotype is called heterozygous (the two alleles are different).
A haplotype denotes the alleles at multiple markers that are inherited together on the same chromosome from a parent.For example, if an individual inherits a 0 allele at one SNP and a 0 allele at another SNP from their mother, the haplotype they inherit is 00.Because the SNPs may be close together on the same chromosome, their alleles are not necessarily independent.The haplotype phase is the combination of the two haplotypes that an individual inherits.For example, the phased genotype 00/01 indicates that the two haplotypes inherited were 00 and 01; the genotype at the first SNP is 0/0 and the genotype at the second SNP is 0/1.When associations between haplotypes and disease outcomes or traits are of interest, a potential difficulty is that haplotype phase is not necessarily known because, typically, genotypes are only measured at individual markers.For a subject who is heterozygous at one or fewer markers, the haplotype phase can be inferred from the genotypes at individual markers.However, for a subject who is heterozygous at two or more markers, the phase can not be determined with certainty.This is illustrated in figure 1.Current cost-effective genotyping technology gives the observed genotypes at the two SNPs as (0/1, 0/1).The observed genotypes could involve the inheritance of haplotypes 00 and 11, or of haplotypes 01 and 10.Therefore, two possible haplotype phasings, 00/11 and 01/10, are compatible with the observed data.The analysis of haplotype-trait associations therefore involves handling the missing phase data.To describe such associations, Lake et al. (2003) and Burkett et al. (2004) adopt a generalized linear model (GLM) framework (McCullagh and Nelder 1983).GLMs provide the flexibility to incorporate non-genetic risk factors or potential confounding variables such as age or sex as covariates.They also allow the incorporation of interaction between genetic and environmental risk factors, a current research focus in the study of complex diseases.Finally, GLMs can accommodate a variety of disease outcomes including dichotomous, count and continuous traits.The method of Burkett et al. (2004) is implemented in hapassoc, a contributed package for R. Standard population-and quantitative-genetic assumptions, such as population Hardy-Weinberg equilibrium and independence of haplotypes and nongenetic covariates, are made.Parameter estimates are obtained by use of an expectationmaximization (EM) algorithm, called the method of weights (Ibrahim 1990), for missing categorical covariates.Standard errors that account for the extra uncertainty due to missing phase are calculated using Louis' method (Louis 1982).The statistical approach implemented in hapassoc is briefly described and the use of the program is illustrated with examples.

Statistical Description
Let y i be the measured trait and x i be the vector of covariates for the ith subject, assuming known haplotype phase.The covariates x i provide information about the individual's haplotypes and non-genetic cofactors.Let P(x i | γ) be the probability of covariates x i , parametrized by γ.For notational convenience, suppose γ is a column-vector.A GLM specifies the conditional probability of the trait given the covariates as where ν i is the canonical parameter for a probability distribution in the exponential family, φ is a dispersion parameter, a i (φ) = φ/m i for a known constant m i , and b(•) and c(•, •) are known functions.A link function g(µ i ) = η i relates the linear model η i = x T i β to the mean trait value µ i = b (ν i ).The regression coefficients β describe the effects of haplotypes and nongenetic cofactors, including possible interactions, on the mean trait value.The parameters that describe the trait and covariate data are thus θ = (β T , φ, γ T ) T .However, we only consider φ and γ as needed for inference about β, which is of primary interest.
If haplotype phase were known and complete covariate information were available on all subjects, the complete-data log-likelihood for the ith individual would be Haplotypes are not necessarily known, and therefore for some subjects the genetic components of the covariate vector x i are not available.However, the observed genotypes at individual SNPs constrain the potential haplotype configurations.Hereafter, let x obs,i be the observed covariate information for the ith subject and x j i be the jth possible covariate vector (i.e.covariates that code haplotype information and non-genetic covariates) consistent with x obs,i .
We account for missing haplotype phase by use of the EM algorithm (Dempster et al. 1977), in which maximum likelihood estimates θ are obtained by iteratively maximizing the conditional expectation of the complete-data log-likelihood, Q(θ | θ t ), given the observed data and current parameter estimates θ t to determine new parameter estimates θ t+1 .Assuming independent subjects, Q(θ Our EM algorithm is based on the method of weights (Ibrahim 1990), an implementation for generalized linear models with categorical covariates that may be missing (see Horton and Laird 1999, for a review).Let l j ci (θ) = log P(y i , x j i | θ) be the complete-data log-likelihood for the ith subject if that subject had covariate vector x j i .Then where w ij (θ t ) = P(x j i | y i , x obs,i , θ t ) are the "weights".Note that each subject's contribution Q i (θ | θ t ) to Q(θ | θ t ) is summed over the set of covariate vectors consistent with their observed data (y i , x obs,i ).The EM algorithm involves computing weights for each possible complete-data covariate vector (E-step), and computing new parameter estimates by maximizing a weighted log-likelihood function (M-step).Upon convergence, standard errors can be calculated.We implemented this EM algorithm and the calculation of standard errors in R (http://www.r-project.org) as described in detail below.Our implementation is freely available on the Comprehensive R Archive Network (CRAN) as a package called hapassoc.

EM algorithm implemented in hapassoc
Our implementation of the algorithm is summarized by the following steps.
1.For each individual with unknown haplotypes, add a "pseudo-individual" for each possible haplotype combination consistent with that individual's observed genotypes.The genetic data for each pseudo-individual is a different haplotype configuration; the nongenetic data is the same for each pseudo-individual.Initial values for regression coefficients and haplotype frequencies are set.
2. At the tth iteration: (a) E-step: Compute the expected log-likelihood conditional on the observed data and the current parameter estimates θ t = (β T t , φ t , γ T t ) T .This step simplifies to weighting each pseudo-individual by the conditional probability of the corresponding haplotype configuration given the observed data: .
Calculation of P(y i | x j i , β t , φ t ) is straightforward using the GLM and the estimates β t and φ t from fitting the model in the M-step at iteration t.To specify the covariate distribution P(x j i | γ t ), two standard assumptions from the genetics literature are made.The first is independence of the genetic and non-genetic factors: where x j gi and x j ei code, respectively, the haplotype and non-genetic covariate information.If only the haplotype information is missing, this assumption means that so that only the distribution P(x j gi | γ t ) has to be specified.Such a simplification is convenient because specifying P(x j ei | γ t ) can be difficult for non-genetic covariates that are continuous, for example.In general, the frequencies of haplotype configurations are not identifiable from the genotypes at individual SNPs.For example, haplotype configurations resulting in two or more heterozygous SNP genotypes are never observed unambiguously.However, under Hardy-Weinberg proportions (HWP), the probability of a haplotype configuration can be written in terms of the haplotype frequencies, allowing both the frequencies of haplotypes and of haplotype configurations to be estimated from genotypes at individual SNPs.The assumption of HWP holds if the haplotype inherited from the mother is independent of the haplotype inherited from the father.Assuming HWP also leads to a considerable reduction in the number of parameters γ g describing the haplotype covariate distribution P(x gi | γ g ).For example, when considering haplotypes of 3 SNPs, there are 2 3 = 8 possible haplotypes compared to 8(9)/2 = 36 possible phase configurations.The substantive reduction in the number of genetic parameters leads to faster convergence of the EM algorithm and greater stability of estimation.
Under the assumption of independence of the genetic and non-genetic covariates, Q(θ | θ t ) may be re-expressed as a sum of components involving one of either (β, φ), γ g or γ e : Therefore, each component is maximized separately to update the estimates of β, φ and γ g .The summand Q(β, φ | β t , φ t , γ gt ) is a weighted log-likelihood for regression and dispersion parameters in a GLM.The R glm function is used to estimate β t+1 with the w ij (β t , φ t , γ gt ) input to the glm function through the "weights" option.The summand Q(γ g | β t , φ t , γ gt ) is a weighted multinomial log-likelihood and so its maximization is also straightforward.Since estimates of γ e are not required for the weights or for updating estimates of the other parameters, γ e can be ignored when finding maximum-likelihood estimates of the regression parameters β.
3. Repeat E-and M-steps until convergence.
4. Calculate the standard errors using values from the final GLM.
The variance-covariance matrix of θ can be approximated by the inverse of the observed information matrix I( θ). Louis (1982) gives an expression for the observed information in missing data problems.With ambiguous haplotype configurations, the expression consists of a term for the expected complete-data information given the observed genotypes minus a penalty for unknown phase (Schaid et al. 2002).Factorization of the likelihood implies that only the observed information I( β, φ, γ g ) is required to obtain standard errors for β.The variance of β is approximated by the appropriate submatrix of I −1 ( β, φ, γ g ).Appendix A provides details of the derivation of I −1 ( β, φ, γ g ).

Using hapassoc and its features
Our contributed R package hapassoc can be downloaded from the CRAN webpage using the install command at the R command line (>): > install.packages("hapassoc") hapassoc does not depend on any optional R packages.After it has been installed, the package may be loaded with:

> library(hapassoc)
There are four main functions for the user of hapassoc: pre.hapassoc, hapassoc, summary.hapassocand anova.hapassoc.The function pre.hapassoc takes the input dataset, with columns of genotype data, and outputs an augmented dataset consisting of all pseudoindividuals.The function hapassoc fits the user-specified generalized linear model, with haplotypes as covariates.The summary function provides a summary of the results while anova returns a likelihood ratio test of two nested models fit with hapassoc.Further details on each of these functions are now given.
The function pre.hapassoc pre-processes the input dataset for hapassoc.This pre-processing involves converting the genotype data into haplotype data and adding rows to the input dataset representing haplotype configurations compatible with the observed genotype data for a subject.If the haplotype configuration of a subject is known, no extra rows are added for this subject.The pre-processing also involves obtaining initial estimates for haplotype frequencies based on the genotype data from all subjects combined.Haplotypes with nonnegligible frequency below a user-defined pooling tolerance (pooling.tol)are then grouped together.The function requires the input dataset to have the following form: 1.Each row represents an individual.Columns represent the trait (response), non-genetic covariate information and genotype information.
2. The columns of genotype information may have one of two formats.For the "allelic" format, each genotype is denoted by two columns, one column for each allele, with alleles coded as either 0 or 1.For the "genotypic" format, the genotype at each locus is given in one column by a character string or factor.Only two possible variants or alleles are permitted, forming three possible genotypes.Genotypes with more than two alleles will cause pre.hapassoc to issue an error and stop execution.Examples of both allelic and genotypic formats are given in section 4.
3. If there are n SNPs, the final n columns of the "genotypic" dataset or 2n columns of the "allelic" dataset contain the genotype data.By default, pre.hapassoc prints a list of genotype variables used to form haplotypes and a list of other "non-genetic" variables passed to the function.
4. There are no restrictions on the non-genetic covariate columns, except that missing data should be coded as NA.Note that only missing genotype data is handled by hapassoc; individuals with missing non-genetic data will be removed from the dataset and a warning will be given to the user.
5. Missing allelic data should be coded as NA or ' '.Missing genotypic data should be coded as, e.g., 'A', if one allele is missing and the other allele is the nucleotide adenine (say), and ' ' or NA if both alleles are missing.
This function has a number of optional parameters that may be set by the user.The option maxMissingGenos specifies the maximum number of SNPs which may be missing genotypes for a subject in order for that subject to be included in the analysis, with a default of 1.Each SNP with a missing genotype leads to more haplotype configurations being added for that subject in the augmented dataset.To avoid convergence problems due to haplotypes of low frequency, pooling.tolcan be used to specify a threshold frequency below which haplotypes will be pooled into a single category.The threshold is applied to the initial haplotype frequencies provided by pre.hapassoc.The "pooled" category returned by pre.hapassoc is not updated with the haplotype frequencies from subsequent iterations of the EM algorithm in hapassoc.
The default pooling threshold is set to 0.05.Similarly, the parameter zero.tolgives the frequency below which it is assumed that a haplotype does not exist.The default threshold is the inverse of ten times the total number of haplotypes in all subjects.Pseudo-individuals carrying haplotypes of estimated frequency below this zero threshold will be removed from the augmented dataset.The verbose flag, if TRUE (the default), causes pre.hapassoc to print a list of the genotype variables used to form haplotypes and a list of other non-genetic variables.
The function pre.hapassoc will return a list with components required for hapassoc.The components include two data frames, haploMat and haploDM, with the haplotype information.
The data frame haploMat consists of two columns, one for each of the inferred haplotypes, and a row for each pseudo-individual in the dataset.The data frame haploDM consists of a column for each of the haplotypes inferred to be present in the dataset and a row for each pseudo-individual.Each pseudo-individual is given a code of 0, 1 or 2 in each column for the number of copies they have of that haplotype.Therefore, all rows of haploDM should sum to 2, as each individual can have only two haplotypes.The non-haplotype data is returned in the data frame nonHaploDM.Some rows of nonHaploDM may be identical, as they represent the trait value and non-genetic covariate information for the same individual who has multiple haplotype configurations compatible with their observed genotype data.
Other components of the list returned by pre.hapassoc include a vector initFreq of initial estimates of all haplotype frequencies (including pooled haplotypes), vectors zeroFreqHaplos and pooledHaplos of the zero-frequency and pooled haplotypes, respectively, and a vector wt of the initial estimates for the weights.For more information on this function, type > help(pre.hapassoc) The function hapassoc uses the list returned from pre.hapassoc as input, and fits the linear model provided in the model-formula argument.The model formula is specified the same way as it would be for the glm or lm functions in R. The elements of the model formula must have names from among the names of columns in the data frames nonHaploDM and haploDM.The availability of haploDM provides the user with the flexibility to specify recessive, dominant or other types of genetic risk models.See the examples in section 4 below for more details on specifying the model equation.As with glm, the family of the GLM is specified with the family parameter.Currently the binomial, gaussian, poisson, and Gamma families are supported by hapassoc.Unlike glm, we have opted for a binomial family as the default, since in most of the biomedical applications we have encountered binary traits (e.g.diseased/not diseased) have been investigated.Other optional parameters can be set including the initial estimates of haplotype frequencies and of regression coefficients used to start the EM algorithm, the maximum number of iterations for the algorithm, the tolerance for convergence, and a verbose flag that controls printing of the iteration number and the value of the covergence criterion at each iteration.
The function returns a list of class "hapassoc" with components such as the estimated regression coefficients, the estimated haplotype frequencies and an estimate of their joint variancecovariance matrix.The associated help file, found by typing help(hapassoc), provides a detailed description of the components of the returned list.The function summary.hapassoc is the summary method for the class and provides a nicer way of viewing the results.The summary includes a table of the estimated regression coefficients and the associated standard errors, Z scores and p-values, a table of the estimated haplotype frequencies and their standard errors, and the estimated dispersion parameter (when appropriate).Likelihood ratio tests to compare two nested models fit with hapassoc may be performed with anova.hapassoc.

Examples
This section illustrates the use of hapassoc with examples of logistic and linear regression, when input genotypes are in "allelic" and "genotypic" format, respectively.

Logistic regression with input genotypes in "allelic" format
The "hypoDat" dataset from the hapassoc package will be used for this example.The dataset can be loaded into R with the command

> data(hypoDat)
The first five rows of the dataset are shown The data consists of eight columns, the affection status (0/1), a continuous non-genetic covariate, and information on three diallelic genetic markers.The genotype data is in "allelic" format, so there are six genotype columns for the three SNPs.The genotype of the first SNP is represented by the two columns M1.1 and M1.2.For example, the fourth individual has genotype 1/1 (two copies of allele 1) at the first SNP.
The genotype data is converted to haplotype data by the pre.hapassoc function.Data frames passed to pre.hapassoc are assumed to contain the response, an arbitrary number of nongenetic covariates and the genotype data.The user must tell the function explicitly how to separate the genetic and non-genetic data by specifying numSNPs and by passing a data frame whose last 2×numSNPs (allelic format) or last numSNPs (genotypic format) columns comprise the data on genotypes.The data frame hypoDat is in allelic format and so the appropriate call is > example1.haplos<-pre.hapassoc(hypoDat,numSNPs = 3) Haplotypes will be based on the following SNPs (allelic format): SNP 1: M1.We prefer the first of these calls since it will exclude the data on the first SNP, M1.1 and M1.2, from the non-haplotype data frame nonHaploDM output by pre.hapassoc.The second call includes the columns M1.1 and M1.2 describing the first SNP in nonHaploDM.
The first two individuals in the hypoDat dataset have haplotype configurations that can be inferred with certainty.However, the haplotype configuration of the third individual is uncertain.This subject either has haplotypes 000/011 or 001/010.The uncertain haplotype phasing for this subject is reflected in the data frames haploDM and nonHaploDM returned by pre.hapassoc.The rows of haploDM corresponding to the first five rows of hypoDat are Rows three and four of haploDM correspond to row three of hypoDat for the third subject with the uncertain haplotype configuration.The column for the "pooled" category of haploDM is comprised of haplotypes 101, 110 and 111 (see the estimated haplotype frequencies below).
The rows of nonHaploDM corresponding to the first five subjects in hypoDat are: Logistic regression is now performed to determine the effects of haplotypes on affection status, after adjusting for the non-genetic covariate "attr".> example1.regr1<-hapassoc(affected ~attr + h000 + h010 + h011 + + h100 + pooled, example1.haplos,family = binomial()) The same regression can be fit with > example1.regr<-hapassoc(affected ~., baseline = "h001", example1.haplos,+ family = binomial()) where the "." in the model formula is an R short form for the other columns in the data matrix.Haplotype h001 was chosen as the baseline haplotype in this example because it is the most frequent (the default in hapassoc).Users are free to choose their own baseline haplotype, but are cautioned that using a rare haplotype (or the pooled category, if rare) as the baseline can lead to unstable parameter estimates.
Other formula syntax is supported in the hapassoc function.For example, > example1.regr<-hapassoc(affected == 0 ~attr + h000 + h010 + + h011 + h100 + pooled, example1.haplos,family = binomial()) can be used to specify that the probability that affection status is 0 is modelled, rather than 1.This feature can be useful if the binary outcome is specified by character strings, such as "yes"/"no", since the formula may then be specified as hapassoc(affected == 'yes' ...) Additionally, different models for haplotype effects can be fit using R formula syntax.The default is an additive effect for each copy of the haplotype on the scale of the linear predictor.
The following command fits recessive effects for haplotypes h000 and h001 (i.e., the haplotype has an effect only if the individual has two copies).

Linear regression with input genotypes in "genotypic" format
An example of the "genotypic" input format is found in the dataset hypoDatGeno.The first five rows are shown below.In this case, the alleles at all three SNPs are now represented with A's and C's rather than 0's and 1's.Additionally, the genotype at each SNP is given in one column.
To illustrate how hapassoc handles missing genotype data, we introduce missing genotypic information for SNPs M1 and M2, as shown below.We modify the first subject's genotype at M1 so that it is completely missing.We modify the second subject's genotype at M2 so that only one allele, an 'A', is known.Since the data frame hypoDataGeno is in genotypic format, M2 is a factor.Therefore, before modifying the second subject's M2 genotype, we must add a level 'A' to M2: The data in hypoDatGeno is now pre-processed by pre.hapassoc.For illustration, the optional parameter maxMissingGenos has been changed from its default value of one to two.> example2.haplos<-pre.hapassoc(hypoDatGeno,3, maxMissingGenos = 2, + allelic = F) $frequencies Estimate Std.Error f.hAAA 0.248400660 0.03590094 f.hAAC 0.257565353 0.03806208 f.hACA 0.241325932 0.03550967 f.hACC 0.090867626 0.02673329 f.hCAA 0.101119710 0.02355280 f.hCAC 0.030216413 0.01611531 f.hCCA 0.009153699 0.00911681 f.hCCC 0.021350608 0.01213278 The dispersion reported by the summary function is the maximum likelihood estimate of the error variance in the linear model.

Summary
hapassoc is an R package for haplotype-trait associations in the presence of uncertain haplotype phase, for dichotomous or continuous traits.It uses an EM algorithm called the method of weights to handle the missing haplotype-phase information.Non-genetic attributes, environmental covariates and haplotype-environment interactions can be included in generalized linear models of trait associations.Assumptions of Hardy-Weinberg proportions and independence of genetic and non-genetic covariates are made.In addition to additive effects, dominant and recessive effects of the haplotype risk factors can also be specified in hapassoc, in contrast to similar haplotype-trait association software.Missing genotype data is also handled by hapassoc so that those individuals with a limited amount of missing genotype data are not removed during the analysis.hapassoc relies on many of the R base functions, and for that reason it uses much of the same syntax.For someone familiar with the R glm and lm functions, using hapassoc should be relatively straightforward, as the specification of model formulae is identical, the output of the summary method is similar and the anova method is similar when used to compare two nested models.Currently anova.hapassocdiffers from the anova methods for lm and glm in that anova.hapassocrequires exactly two models as input.By contrast, anova.lmand anova.glmtake either one fitted model, in which case a sequence of possible sub-models are compared, or an arbitrary number of nested models.Work to make anova.hapassocbehave more like anova.lm and anova.glm is underway and will be included in a future version of hapassoc.As an alternative to likelihood ratio tests, users may construct the appropriate Wald statistics in the usual way, with the estimated coefficients and variance-covariance matrix returned by the hapassoc function.
version to an R package, and Sigal Blay for code optimization and package maintenence.We would also like to thank the hapassoc users for their valuable feedback in bug reports.
where θ 0 is the "true" value of θ and l(θ) is the observed-data log-likelihood.The usual estimate of I(θ 0 ) is the observed information I( θ).
Under independence of genetic and non-genetic covariates, the observed-data likelihood L(θ) factors into a term involving β, φ and γ g and another term involving γ e : The observed information I(θ) is therefore block-diagonal: , and so Thus, only I(β, φ, γ g ) is required for inference of β.
Information matrix for β, φ and γ g Louis (1982) derived an expression for the observed information in missing data problems in terms of complete-data score vectors and information matrices.Let S c (β, φ, γ g ) and S ci (β, φ, γ g ) be the complete-data score functions for all subjects and for the ith subject, respectively.Define I c (β, φ, γ g ) and I ci (β, φ, γ g ) to be the corresponding observed information matrices.Further, let S j ci (β, φ, γ g ) and I j ci (β, φ, γ g ) be, respectively, the complete-data score function and observed information for the ith subject with covariate vector x j i .Under the linear model, recall η j i = x j i T β and µ j i = g −1 (η j i ) = b (ν j i ).Define d ij = y i ν j i − b(ν j i ) and let c (y, φ) and c (y, φ) denote the first and second derivatives of c(y, φ) with respect to φ.Then S j ci (β, φ, γ g ) T = S j ci (β) T , S j ci (φ), S j ci (γ g ) T where S j ci (β) = The observed information I j ci (β, φ, γ g ) has three submatrices along its diagonal.In the top left-hand corner, , where g (µ j i ) is the derivative of the link function with respect to µ, evaluated at µ j i .The next diagonal submatrix is Finally, in the bottom right-hand corner ∂ 2 ∂γ g ∂γ T g log P (x j i | γ g ).
The three off-diagonal submatrices are I j ci (β, γ g ) = I j ci (φ, γ g ) = 0 and The expressions for S j ci (γ g ) and I j ci (γ g , γ g ) depend on the distribution of genetic covariates.For example, suppose that the jth possible haplotype configuration consistent with the observed SNP genotypes of the ith subject has x j gi = (n 1 , . . ., n H+1 ), where n h = 0, 1, 2 codes the number of copies of haplotype h and there are H + 1 haplotypes in the population.Then, under HWP, , where γ gh is the population frequency of haplotype h.The score vector is where J is an H × H matrix of ones. Louis (1982) writes the observed information as   Lipsitz and Ibrahim 1996;Burkett 2002) It can be shown that the expected value of j w ij I j ci (β, φ) is a vector of zeroes.Thus, to estimate the Fisher information, we may replace the I j ci (β, φ) submatrix of I j ci (β, φ, γ g ) in equation ( 1) by a vector of zeros and take I j ci (β, φ, γ g ) to be a block-diagonal matrix.

Figure 1 :
Figure 1: Illustration of unknown haplotype phase for certain observed genotypes As with HaploDM, rows three and four of nonHaploDM are pseudo-individuals that correspond to the subject in row three of hypoDat, and therefore have the same trait and non-genetic covariate data.The initial estimates of haplotype frequencies are > example1.haplos$initFreqh110, and h111 all have frequencies below the default pooling threshold of 0.05 and so they comprise the pooled category in example1.haplos$haploDM.

(
When there are no more than maxMissingGenos SNPs with missing genotype data, rows corresponding to haplotype configurations compatible with the missing genotype data for a subject are added to the end of the augmented data matrices haploDM and nonhaploDM returned by pre.hapassoc.For example, rows 142 to 145 of nonHaploDM correspond to the first subject in hypoDatGeno and rows 146 to 148 of nonHaploDM correspond to the second subject in hypoDatGeno.When there are more missing genotypes than are allowed by the user, as in the following command, a warning message is given to indicate that some individuals have been removed from the dataset.With SNP data in genotypic format, the haplotypes are now denoted by combinations of the A and C alleles at each SNP, as can be seen in the display of the initial haplotype frequency estimates.To illustrate linear regression, the continuous non-genetic covariate attr will be used as a continuous trait in the call to hapassoc: This research was supported by Natural Sciences and Engineering Research Council of Canada grants 227972-00 and 213124-03, by Juvenile Diabetes Research Foundation International grant 1-2001-873, by Canadian Institutes of Health Research grants NPG-64869, ATF-66667 and GEI-53960, and in part by the Mathematics of Information Technology and Complex Systems, Canadian National Centres of Excellence.JG is a Scholar of the BC Michael Smith Foundation for Health Research.