The R Package geepack for Generalized Estimating Equations

This paper describes the core features of the R package geepack , which implements the generalized estimating equations (GEE) approach for ﬁtting marginal generalized linear models to clustered data. Clustered data arise in many applications such as longitudinal data and repeated measures. The GEE approach focuses on models for the mean of the correlated observations within clusters without fully specifying the joint distribution of the observations. It has been widely used in statistical practice. This paper illustrates the application of the GEE approach with geepack through an example of clustered binary data


Introduction
Generalized Estimating Equations (GEE) (Liang and Zeger 1986) are a general method for analyzing data collected in clusters where 1) observations within a cluster may be correlated, 2) observations in separate clusters are independent, 3) a monotone transformation of the expectation is linearly related to the explanatory variables and 4) the variance is a function of the expectation.It is essential to note that the expectation and the variance referred to in points 3) and 4) are conditional given cluster-level or individual-level covariates.
There are several approaches to extend generalized linear models to clustered data.Mixed effect models and transition models (Diggle, Liang, and Zeger 1994, Chapter 7, 9-10) fully specify the joint distribution within clusters via latent variables or conditional dynamics.With the presence of random effects, likelihood estimation necessitates the integration over the random effects distributions, which may be numerically intractable.Lee and Nelder (1996, The R Package geepack for Generalized Estimating Equations 2001) introduced hierarchical generalized linear models and showed that the integration may be avoided by working on the h-likelihood.Compared to these approaches, the method of GEE fits marginal mean models with the advantage that only correct specification of marginal means is needed for the parameter estimator to be consistent and asymptotically normal.This approach has become an important tool in analyzing longitudinal data or repeated measures arising in a wide variety of applications.For a discussion on the relation between marginal and mixed effects models, see Heagerty and Zeger (2000) and Nelder and Lee (2004).
Several implementations of GEE have become available (Horton and Lipsitz 1999).The basic approach of Liang and Zeger (1986) is available in SAS (SAS Institute Inc. 1999, proc genmod), Stata (StataCorp LP 2005), XLISP-STAT (Lumley 1996) and in S-PLUS by the packages oswald (Smith 1998) and gee or yags (Carey 2002(Carey , 2004)).The last two packages have been ported to R (R Development Core Team 2005).The R package geepack implements the basic approach and some extensions (Yan 2002;Yan and Fine 2004).Three features of geepack distinguish it from other implementations: 1) There is an interface function geeglm which is designed to be as similar to glm as possible; 2) A jackknife variance estimator is available as an alternative to the sandwich estimator; and 3) Covariates can be incorporated into the scale and correlation parameters in a similar fashion to the mean modeling.In this paper, we illustrate the aspects of geepack with the focus on the first two features.
The paper is organized as follows.In Section 2 we introduce an example dataset on repeated measures of binary data.In Section 3 we outline the GEE approach and in Section 4 we describe the features of the geeglm function that implements the approach in geepack.We close the article with an analysis of the data in Section 5 and a conclusion in Section 6.

An example data set
To illustrate the type of problems well suited for the GEE approach we consider a data set on respiratory illness.The data is provided in geepack and detailed information about the data can be found in Koch, Carr, Amara, Stokes, and Uryniak (1990).Briefly, the data comes from a clinical study in which the effect of a treatment of patients with respiratory illness was
Table 1 shows the number of patients for the response patterns across the four visits classified by baseline status and treatment.Patients with baseline respiratory status equal to 0 seem either to have a low or large number of positive responses whereas patients with a baseline of 1 tend to respond positively.Table 2 describes the distribution of the number of positive responses per patient for sex and center.
Figure 1 presents the plot of age against the proportion of positive responses for each patient.
It indicates a quadratic relationship between the proportions and the age.We fit a logistic model to the data (which would be appropriate if there were no time effect and no spread in the response probabilities for patients with the same covariate values): R> m.glm <-glm(outcome ~baseline + center + sex + treat + age + I(age^2), + data = respiratory, family = binomial) The correlation matrix of the Pearson residuals within a patient based on the glm-fit is shown in Table 3

Theory of GEE
For the regression analysis of correlated observations, Liang and Zeger (1986) introduced the GEE approach.This approach generalized the estimation method of quasi-likelihood of q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 10 20 30   Wedderburn (1974) to correlated data.An alternative generalization was proposed by Lee andNelder (1996, 2001).An extensive review of the development of the GEE approach is given by Ziegler, Kastner, and Blettner (1998).
Consider a sample of i = 1, . . ., K independent multivariate observations Y i = (Y i1 , . . ., Y it , . . ., Y in i ).Here i may represent a cluster with n i observations.The expectations E(Y it ) = µ it are related to the p dimensional regressor vector x it by the mean-link function g g(µ it ) = x it β. (1) where φ is a common scale parameter and a it = a(µ it ) is a known variance function.Let R i (α) be a working correlation matrix completely described by the parameter vector α of length m.Let be the corresponding working covariance matrix of Y i , where A i is the diagonal matrix with entries a it .For given estimates ( φ, α) of (φ, α) the estimate β is the solution of the equation Liang and Zeger (1986) suggest to use consistent moment estimates for φ and α.This yields an iterative scheme which switches between estimating β for fixed values of φ and α and estimating (φ, α) for fixed values of β.This scheme yields a consistent estimate for β.Moreover, K covariance matrix Σ = lim Replacing β, φ and α by consistent estimates and the covariance matrix 3) yields a so called sandwich estimate Σ of Σ.The estimate Σ is a consistent estimate of Σ even if the working correlation matrices R i (α) are misspecified.

The geeglm function
The geeglm function largely follows the syntax of the glm function and many of the methods available for glm objects are also available for geeglm objects.We discuss in the following the most important arguments of the geeglm function and the anova method for comparing models by Wald tests.These will be exemplified in Section 5.

Variance and link functions (family)
The variance function is specified by the family argument in geeglm and is identified by the name of the corresponding distribution in a generalized linear model.The available variance functions are given in Table 4.The available link functions for the mean are the same as those in glm with the exception of the cauchit link for the binomial family.

Working correlation (corstr, zcor)
Four pre-defined working correlation structures are available and are specified via the argument corstr (Table 5).It is also possible to provide a correlation matrix the entries of which remain fixed under the computation.This may be necessary if the estimate for α does not yield a positive definite R(α) (Chaganty 1997).Using a fixed working correlation (corstr="fixed") may still yield efficient estimates (Chaganty and Joe 2004) for β.Additionally, a user-defined correlation structure (corstr="userdefined") can be provided by expressing the correlation parameters as linear combinations of covariates.Given a n i × m matrix X i of covariates, the upper diagonal correlations parameters r i = (r i,12 , r i,13 , . . ., r i,1n i , r i,23 , . . ., r i,n i−1 n i ) of the working correlation matrix R i (α) can be written as r i = X i α.The zcor argument takes the concatenated matrices (X 1 , . . ., X K ) as the design matrix for the working correlation.
For the fixed correlation matrix one has simply m = 1 and X i = r i .Some useful correlation structures are defined as simple linear restrictions of the unstructured working correlation matrix for which the corresponding design matrix is provided by the genZcor function of the name Table 5: Working correlations in geeglm.geepack package.For example, the entries in each off-diagonal of a Toeplitz correlation matrix are equal, r i,kl = r i|k−l| .Hence, the corresponding design matrix is obtained by adding those columns in the unstructured design matrix with the same difference in the k, l indices.

Missing values (waves)
In case of missing values, the GEE estimates are consistent if the values are missing completely at random (Rubin 1976).The geeglm function assumes by default that observations are equally separated in time.Therefore, one has to inform the function about different separations if there are missing values and other correlation structures than the independence or exchangeable structures are used.The waves arguments takes an integer vector that indicates that two observations of the same cluster with the values of the vector of k respectively l have a correlation of r kl .

Jackknife variance estimates (std.err)
For a small number of clusters (K ≤ 30) the sandwich variance estimator exhibits bias and Paik (1988) recommended using the jackknife variance estimator.It is defined as where p is the number of parameters in the mean structure and β−i are the estimates of β leaving out the ith cluster.By default the geeglm function returns the sandwich estimates.Specifying std.err="fij" the fully iterated jackknife estimate is returned.Additionally, the computationally less demanding approximate jackknife estimate (std.err="jack") or a one-step jackknife estimate (std.err="j1s")can be obtained.Simulation studies (Ziegler, Kastner, Brunner, and Blettner 2000;Yan and Fine 2004) indicate that the approximate jackknife estimates are in many cases in good agreement with the fully iterated estimates.

anova method
The anova method allows either to produce a table of tests for sequentially adding terms to a model or to compare two nested models.The test statistic for testing the difference between model M1 and a submodel M2 is the Wald test statistic.If β is the parameter of model M1 and the submodel M2 is obtained from M1 by setting Cβ = 0, where C is a rank q contrast matrix and Σ β is a consistent estimate of the covariance matrix of β, the test statistic is

Analysis of the respiratory data
In the initial description of the respiratory data we saw appreciable within-patient correlation.
We now fit the logistic model with the same mean structure and the binomial variance function a(µ it ) = µ it (1 − µ it ) using the GEE approach.We estimate β under the four pre-defined working correlations and a user-defined Toeplitz working correlation.For example, the fit with the exchangeable working correlation is obtained by R> m.ex <-geeglm(outcome ~baseline + center + sex + treat + age + I(age^2), + data = respiratory, id = interaction(center, id), + family = binomial, corstr = "exchangeable") The design matrix for the Toeplitz working correlation was constructed by first obtaining the design matrix for the unstructured correlation using the genZcor function and then adding the appropriate columns.The use of the waves argument is not necessary because there are no missing observations.R> zcor <-genZcor(clusz = c(xtabs(~id + center, data = respiratory)), + waves = respiratory$visit, corstrv = 4) R> zcor.toep<-matrix(NA, nrow(zcor), 3) R> zcor.toep[,1]<-apply(zcor [,c(1,4,6) this paper we described the geeglm function of the package that is close in syntax to the glm function and implements the estimating equation approach of Liang and Zeger (1986).
Additionally, the function allows to use a fixed user-defined correlation structure and several jackknife variance estimators.The latter are preferable to the sandwich estimator in case of a small number of clusters.
geepack has additional features which allow the scale parameters and the parameters of the working correlation matrix to be modeled as functions of explanatory variables, see Yan (2002) and Yan and Fine (2004).Using additional estimating equations for these parameters the approach yields consistent estimates for β, the scale and correlation parameters.The function ordgee of geepack allows the analysis of ordinal data according to the method proposed by Heagerty and Zeger (1996).

Figure 1 :
Figure 1: Relation of age to the proportion of positive responses.The smooth line is a spline.

Table 2 :
Number of patients for the number of positive responses across the four visits for sex and center.

Table 3 :
and indicates an appreciable correlation within patient measurements.Correlation matrix for the measurements at different visits based on the Pearson residuals from the logistic model.

Table 6 :
Comparison of parameter estimates.