Criteria to Select a Working Correlation Structure for the Generalized Estimating Equations Method in SAS

The generalized estimating equations (GEE) method is popular for analyzing clustered and longitudinal data. It is important to determine a proper working correlation matrix when applying the GEE method since an improper selection sometimes results in ineﬃcient parameter estimates. In this paper, we provide the CriteriaWorkCorr macro in SAS to calculate the criteria proposed by Pan (2001), Hin, Carey, and Wang (2007), Hin and Wang (2009), and Gosho, Hamada, and Yoshimura (2011) for selecting the working correlation structure when the GEE method is applied. We illustrate the implementation and an example of the macro.


Introduction
The generalized estimating equations (GEE) method is one of the most popular ways to analyze clustered and longitudinal data. To apply the GEE method, a working correlation structure-independent, exchangeable, and first-order autoregressive (AR(1))-must be specified. If the working correlation structure is correctly specified, the GEE provides a best asymptotically normal (BAN) estimator of mean parameters. Fitzmaurice (1995) and Wang and Carey (2003) show, however, that the asymptotic relative efficiency of the parameter estimates of the GEE method is likely to be low when the working correlation structure is misspecified. Fitzmaurice (1995), Mancl and Leroux (1996), and Sutradhar and Das (2000) also point out that the misspecification of the correlation structure lowers the relative efficiency of the estimate even when the sample size is finite. To address this concern, some researchers have proposed new criteria to select a working correlation structure. Pan (2001) proposes a modification of Akaike's information criterion (AIC) called the "quasi-likelihood under the independence model criterion (QIC)." In addition, Hin et al. (2007) apply a method by Rotnitzky and Jewell (1990) as a criterion to select the working correlation structure. Rotnitzky and Jewell (1990) describe an approach to appraise the adequacy of the assumed correlation matrix using the fact that the asymptotic distribution of a modified working Wald statistic is the linear combination of independent χ 2 1 random variables. We call this criterion "Rotnitzky and Jewell's criterion (RJC)." Furthermore, Hin and Wang (2009) propose a correlation information criterion (CIC) that modifies QIC and substantially improves its performance. Moreover, Gosho et al. (2011) devise an objective criterion for evaluating the appropriateness of the correlation structure. The proposed criterion measures the discrepancy between the covariance matrix estimator and the specified working correlation matrix. Hereafter, this criterion is referred as DEW. Hin et al. (2007), Wang (2009), andGosho et al. (2011) compare the performances of these criteria for selecting the working correlation structure.
For most longitudinal data in biological applications, Wang and Carey (2003), Ziegler and Vens (2010), and Vens and Ziegler (2012) show that the AR(1) structure is preferable over banded correlation structures, e.g., the 1-dependent correlation structure. Furthermore, mdependent correlation structures are not biologically plausible. Ziegler and Vens (2010) and Vens and Ziegler (2012) also point out that the investigators should choose a working correlation structure for both statistical and biological reasons. The statistical criteria for selecting the working correlation structure can be helpful tools to decide the most reasonable structure for the investigators.
In this paper, we present a SAS (SAS Institute Inc. 2003a) macro to calculate the values of the criteria (QIC, RJC, CIC, and DEW) for selecting the working correlation structure when the GEE method is applied to longitudinal data. The SAS macro (CriteriaWorkCorr) was developed and tested on Microsoft Windows XP and 7 operating systems and requires SAS 9.1 (at least the SAS/BASE, SAS Institute Inc. 2003b, the SAS/IML, SAS Institute Inc. 2003c, and the SAS/STAT, SAS Institute Inc. 2003d, components) or above.

Summary of the generalized estimating equations method
Assume that an n i × p matrix of covariate values X i = (x i1 , . . . , x in i ) is adjoined to the outcome vector Y i = (Y i1 , . . . , Y in i ) on clusters i = 1, . . . , K and observations t = 1, . . . , n i per cluster. To simplify notation, we suppose that n i = n. The expected value and variance of the outcome variable are assumed to be , respectively, where h is a specified link function, β is a regression parameter (pvector) to be estimated, φ is a scale parameter, and υ denotes a variance function to indicate mean-variance relation. The working covariance matrix of Y i , V i is assumed to have the form φA is the working correlation matrix parameterized by α, an association parameter (q-vector).
The GEE method identifies the estimatorβ of the regression parameter β as the solution to where D i is an n × p matrix defined by A covariance estimator V r ofβ by the GEE method, which is referred to as the robust variance, is given by Equation 2: 3. Criteria for selecting a working correlation structure

Quasi-likelihood under the independence model criterion
AIC is a well-known criterion for likelihood-based model selection. However, we cannot apply a criterion such as AIC to the GEE approach, since the GEE is not likelihood-based. Pan (2001) proposes a criterion based on quasi-likelihood, named QIC, to select the proper mean model or the working correlation structure.
The quasi-likelihood function on cluster i and observation t evaluated at the regression parameters β is given by Table 1.
Under the assumptions that the clusters and observations are independent, QIC can be expressed as where tr refers to the sum of the diagonal elements of the matrix and Ω Rotnitzky and Jewell (1990) propose the test statistics to support the hypothesis that the vector of regression coefficients equals a given β. In the theorem pertaining to the test Table 1: Canonical link function, variance function, and quasi-likelihood for commonly used exponential family distributions.

Rotnitzky-Jewell's criterion
statistics, Ψ 0 , Ψ 1 , and Ψ were, respectively, defined as follows: When the working correlation structure is correctly specified, Ψ should be close to an identity matrix. Hin et al. (2007) describe the Rotnitzky-Jewell's criterion (RJC) to select the working correlation structure as Hin and Wang (2009) propose CIC in Equation 4 as a modification of QIC to improve its performance.

Correlation information criterion
CIC is constructed using only the second term that represents the penalty of QIC in Equation 3. The first term in QIC denotes the sum of quasi-likelihoods for all observations under the assumption that the subjects and time points are independent. It makes sense to ignore the first term when comparing different working correlation structures, since the term mostly does not depend on the specified R. Gosho et al. (2011) propose to select the correlation structure that minimizes DEW(R) as defined by the working correlation structure represented by Equation 5:

Gosho's criterion
where I is the identity matrix. In Equation 5, DEW(R) is the criterion that directly measures the discrepancy between the covariance matrix estimator and the specified working covariance matrix.

Program description and usage
The CriteriaWorkCorr macro is included with this article in the file CriteriaWorkCorr.sas. The arguments taken by the macro are summarized in Table 2. As shown in Table 2, the first five arguments, i.e., the dataset name (INDS), the name of the variable identifying each cluster (ID), the name of the visit variable in each cluster (VISIT), the name of the outcome variable

Argument Description Note INDS
Name of the SAS dataset. Specify the input dataset that contains the cluster/subject, visit, outcome variables, and any additional covariates. The data structure is illustrated in Table 3.

ID
Name of the variable identifying each cluster (subject).
The types character and numerical are available (SAS Institute Inc. 2003b).
VISIT Name of the visit variable (within a cluster).
The types character and numerical are available (SAS Institute Inc. 2003b).

OUTCOME
Name of the outcome variable. The outcome can be continuous, binary, or count.
DIST Name of the distribution of the outcome variable.

COVCONT
List of continuous explanatory variables.
Do not separate variable names by comma.

COVNOMI
List of nominal explanatory variables.
Do not separate variable names by comma. Dummy variables with two levels (0 or 1) are automatically generated.
SCALEPAR Name of the estimate method of φ.
Specify one of fixed, pearson, or deviance. fixed, fixed of 1; pearson, based on Pearson residuals; deviance, based on deviance residuals. (OUTCOME), and the name of the distribution of the outcome variable (DIST), are essential for implementing the macro. In addition, COVCONT and COVNOMI are the names of the continuous and nominal explanatory variable lists, respectively. A user should generally at least specify either COVCONT or COVNOMI to implement the macro. SCALEPAR requests the estimate method of the scale parameter φ.
The CriteriaWorkCorr macro consists of three nested macros, DataHandling, CalCri, and ResultDs. The DataHandling macro generates the analysis dataset. When one or more nominal explanatory variables are specified as COVNOMI, corresponding dummy variables are automatically generated by the macro. The CalCri macro provides the regression parameter estimates, the robust standard errors, the 95% confidence intervals, and the p values of the z test in the case of applying the GEE method. It also provides the values of the criteria (QIC, RJC, CIC, and DEW) for selecting the working correlation structure that are given in Section 3. The ResultDs macro creates the combined output of the results derived from the CalCri macro for each working correlation structure.
A user can call the CriteriaWorkCorr macro by inputting the arguments listed in Table 2. Before a user implements the macro, he/she needs to prepare a SAS dataset for analysis. Part of an example dataset (Wheeze data) provided by Hardin and Hilbe (2003) is shown in Table 3. These data study the effect of air pollution on the health of 16 children.
The Wheeze data include the case number, case; a within-subject observation identifier (time point), t; a binary indicator for whether the subject wheezes, wheeze; a binary indicator for whether the observation is in Kingston, kingston; the age of the subject in years, age; and a measure of the smoking habits of the subject's mother, smoke. In this case, wheeze is an outcome variable and kingston, age, and smoke are explanatory variables. The nominal explanatory variables kingston and smoke, which take the value zero or one and the value zero, one, or two, respectively, are turned into dummy variables. The dummy variable kingston1, derived from kingston, takes a value of zero if kingston is zero and a value of one otherwise. In addition, the two dummy variables smoke1 and smoke2, derived from smoke, take a value of one if smoke is equal to one and zero otherwise and a value of one if smoke is equal to two and zero otherwise, respectively. The minimum value of a nominal explanatory variable becomes a reference level for the corresponding dummy variables in the macro.
The macro creates an output table that includes the regression parameter estimates, the robust standard errors, the 95% confidence intervals, and the p values of the z test in the case where the GEE method is applied and three working correlation structures (independent, exchangeable, and AR(1) structures) are specified using the GENMOD procedure in SAS. In addition, the output table contains the values of the criteria (QIC, RJC, CIC, and DEW) for selecting the working correlation structure for each such structure using the IML procedure in SAS.
There are some limitations of the CriteriaWorkCorr macro. The macro can be applied to incomplete longitudinal data only when the incompleteness follows a monotone missing pattern; that is, a subject missing in one follow-up must also fail to participate in subsequent follow-ups. Another limitation is that the macro assumes the link function is the canonical link as listed in Table 1.

Example
In this section, the implementation of the CriteriaWorkCorr macro is demonstrated using the Wheeze data. As mentioned earlier, the outcome variable is a binary indicator for whether or not the subject wheezed, and is measured consistently four times yearly at ages 9, 10, 11, and 12. We fitted the following logistic model to the data: where Y it is the binary indicator for whether or not subject i wheezed at time t; age it ∈ {9, 10, 11, 12} denotes the child's age; kingston1 i ∈ {0, 1} indicates whether the child is a resident of Portage or Kingston, respectively; and smoke1 it and smoke2 it are dummy variables for the smoking habits of the child's mother, that take a value of one if smoke is equal to one and zero otherwise and a value of one if smoke is equal to two and zero otherwise, respectively. Three structures-independent, exchangeable, and AR(1)-are adopted as candidates for the working correlation structure.
To implement the CriteriaWorkCorr macro, a user inputs values for the various arguments as shown in the code below (also see Table 2) and invokes the macro.
\%CriteriaWorkCorr (INDS = wheeze, ID = case, VISIT = t, OUTCOME = wheeze, DIST = binomial, COVCONT = age, COVNOMI = kingston smoke, SCALEPAR = fixed); Table 4 shows the output results of the macro. In Table 4, "WorkingCorr" is the specified working correlation structure, and "Parm" and "Level" are the names of the parameters and the parameterized level, respectively. "Estimate" and "RobustSE" refer to the parameter estimates and the robust standard errors, respectively. "LowerCL" and "UpperCL" are the lower and upper limits of the 95% confidence intervals, respectively. "ProbZ" gives the p values of the z test. "QIC," "RJC," "CIC," and "DEW" refer to the criteria values for selecting the working correlation structure given in Section 3.

Conclusion
In this paper, we provide the CriteriaWorkCorr macro to calculate the values of the criteria (QIC, RJC, CIC, and DEW) for selecting a working correlation structure at the time of applying the GEE method to analyze clustered and longitudinal data.