%QLS SAS Macro: A SAS Macro for Analysis of Correlated Data Using Quasi-Least Squares

Quasi-least squares (QLS) is an alternative computational approach for estimation of the correlation parameter in the framework of generalized estimating equations (GEE). QLS overcomes some limitations of GEE that were discussed in Crowder (1995). In addition, it allows for easier implementation of some correlation structures that are not available for GEE. We describe a user written SAS macro called %QLS, and demonstrate application of our macro using a clinical trial example for the comparison of two treatments for a common toenail infection. %QLS also computes the lower and upper boundaries of the correlation parameter for analysis of longitudinal binary data that were described by Prentice (1988). Furthermore, it displays a warning message if the Prentice constraints are violated. This warning is not provided in existing GEE software packages and other packages that were recently developed for application of QLS (in Stata, MATLAB, and R). %QLS allows for analysis of continuous, binary, or count data with one of the following working correlation structures: the first-order autoregressive, equicorrelated, Markov, or tri-diagonal structures.


Introduction
Quasi-least squares (QLS, Chaganty 1997;Shults and Chaganty 1998;Chaganty and Shults 1999) is a two-stage approach for fitting longitudinal data that are correlated over time using an assumed working correlation structure in the framework of generalized estimating equations (GEE, Liang and Zeger 1986). Both QLS and GEE estimate the regression parameter β via solution of the same estimating equation which contains an additional unknown correlation parameter α. These methods alternate between updating the estimate of β, and updating the estimate of α iteratively, but they differ with respect to estimation of α. GEE uses a moment estimate of α based on the Pearson residual, while QLS solves estimating equations for α that are orthogonal to the GEE estimating equation of β.
There are primarily two important reasons to consider the implementation of QLS over GEE. Crowder (1995) pointed out that there may be no feasible estimate of α under misspecification of the working correlation structure, which could result in a failure to converge for the iterative GEE estimation procedure. On the other hand, the QLS estimates of α are guaranteed to be feasible for several structures, including the first-order autoregressive (AR(1)), equicorrelated, Markov, and tri-diagonal structures. The other main advantage of the application of QLS is that it allows for relatively straightforward implementation of more complex and/or biologically plausible correlation structures than are currently available in GEE. For example, Shults and Chaganty (1998) and Chaganty and Shults (1999) described estimating equations for α for the Markov correlation structure that is a generalization of the AR(1) structure for measurements that are unequally spaced in time. Furthermore, QLS has been successfully implemented in multi-outcome longitudinal studies which contain multiple sources of correlations, e.g., correlations between multiple outcomes as well as within subject for each outcome over time, using Kronecker product-based correlation structures (Shults and Morrow 2002;Chaganty and Naik 2002;Shults et al. 2004;Shults and Ratcliffe 2009). In addition, Shults et al. (2006) implemented a banded Toeplitz structure that had previously not been implemented for GEE.
There has been a considerable effort to disseminate and promote the use of QLS via the development of user written QLS programs in some of the major statistical software packages. For example, Shults et al. (2007) developed xtqls using Stata's proprietary higher programming language (StataCorp. 2003). Xie and Shults (2009) provide a user written QLS program called qlspack using R. These programs estimate the regression parameter β by calling existing GEE procedures, e.g., xtgee in Stata and the user written package geepack (Halekoh et al. 2006) in R, and then solving the stage one and two estimating equations for α iteratively. Lastly, Ratcliffe and Shults (2008) developed GEEQBOX software in MATLAB for the implementation of both GEE and QLS. Prior to introduction of GEEQBOX, there was no software available for the implementation of GEE in MATLAB. All these QLS programs can be downloaded from http://www.cceb.upenn.edu/~sratclif/QLSproject.html, while GEEQBOX is also available for download on the website of the Journal of Statistical Software provided in Ratcliffe and Shults (2008).
This manuscript reflects our ongoing effort to develop QLS software in SAS (SAS Institute Inc. 2003) which is one of the most widely used software packages, due to its versatility including varied and powerful procedures for data management and statistical analysis. In this manuscript, we present our user written SAS macro called %QLS for the implementation of QLS developed under SAS version 9.1 using SAS/IML, an interactive matrix language in SAS. We demonstrate %QLS using a randomized clinical trial reported by Backer et al. (1996) for the comparison of two treatments for a common toenail infection. (See Section 4 for more description of the study). %QLS can be used for analysis of continuous, binary, or count data, with an AR(1), equicorrelated, Markov, or tri-diagonal correlation structure to describe the pattern of association among the repeated measurements on each subject, or cluster.
A unique feature of %QLS, which is not available in current GEE packages and other userwritten QLS software, is that it computes the so called 'Prentice boundary' (Prentice 1988) of the estimate of α for analysis of binary data (see Section 2.5 for more details).
Our outline for this manuscript is as follows: Section 2 describes each working correlation structure implemented in %QLS. Then it briefly describes QLS and the Prentice constraints for α. Section 3 describes a complete list of the parameters in %QLS. Finally, Section 4 demonstrates %QLS using the clinical trial data (toenail data) for testing the time-averaged difference between the two treatment groups. This includes a demonstration of an analysis that results in a violation of the Prentice constraints, in addition to an application of the Markov structure that currently is unavailable for GEE.

Description of quasi-least squares
In this section, we provide a brief description of QLS similar to that provided in Shults et al. (2007); Ratcliffe and Shults (2008) and Xie and Shults (2009). For a more through treatment of the development of QLS and its related asymptotic distributional properties, see Chaganty (1997) for a description of stage one of QLS for balanced and equally spaced data, Shults and Chaganty (1998) for stage one of QLS for unbalanced and unequally spaced data, and Chaganty and Shults (1999) for the second stage of QLS. Prior to Chaganty and Shults (1999), the QLS estimate of α was in general not consistent, even if the correlation structure was correctly specified. As noted earlier, QLS is a method in the framework of GEE. For an excellent text on GEE, see Hardin and Hilbe (2002).
We consider outcome measurements y i = (y i1 , . . . , y in i ) with associated covariates x ij = (x ij1 , . . . , x ijk ) collected at measurement occasions j = 1, . . . , n i , for each subject i = 1, . . . , m. For the model specification, we assume that the mean and variance of the outcome variable satisfy E(y ij ) = g −1 (x ij β) = µ ij and Var(y ij ) = φh(µ ij ), respectively, where φ is known as the dispersion parameter. Unlike the generalized linear model (GLM), both GEE and QLS assume that the covariance matrix Cov(y i ) = φA . . , h(µ in i )), and R i (α) is known as the working correlation matrix that describes the pattern of association among the repeated measurements on each subject.

Working correlation structures implemented in %QLS
%QLS currently allows for application of the AR(1), equicorrelated, Markov, and tri-diagonal structures that are described as follows: 1. The first-order autoregressive (AR(1)) structure: This structure assumes that Corr(y ij , y ik ) = α |j−k| , with feasible region (−1, 1). The feasible region for α is defined as the interval on which α yields a positive definite correlation matrix.
2. The equicorrelated structure: This structure assumes that Corr(y ij , y ik ) = α. The feasible region for this structure is (−1/(n m − 1), 1), where n m is the maximum value of n i over i = 1, 2, . . . , m. Note that for large n m , this structure may be unrealistic unless all pairwise correlations are roughly identical across all time points.
4. The tri-diagonal correlation structure: This structure assumes that Corr(y ij , y ik ) = α for |j − k| = 1 and is zero otherwise. The feasible region for this structure is and n m is the maximum value of n i over i = 1, 2, . . . , m. This interval is approximately (−1/2, 1/2) for large n and contains (−1/2, 1/2) for all n.
%QLS does not allow for application of the independent structure (identity matrix) because QLS is identical to GEE for this structure. In addition, application of the unstructured matrix is complex for QLS. In SAS, we therefore suggest application of PROC GENMOD with the repeated statement and the option corr=ind for application of the independent correlation structure, or corr=un for application of the unstructured correlation matrix for GEE.

Estimating equations and the algorithm for quasi-least squares
QLS is a two-stage procedure for estimation of the regression parameter β and the correlation parameter α. In stage one it alternates between updating the GEE estimating equation for β and the estimating equation for α. After convergence in stage one, an updated estimate of α is obtained by solving the stage two estimating equation for α. A final estimate of β is then obtained by solving the GEE estimating equation for β. The estimating equations are as follows: The GEE estimating equation for β: The stage one estimating equation for α: where Z i (β) = (z i1 , . . . , z in i ) and z ij = (y ij − µ ij )/h(µ ij ).
The stage two estimating equation for α (evaluated at the stage one estimatê α): Note that the stage one and two estimating equations (2) and (3) involve the first derivative of the inverse of R i (α), which may not be easily obtained for some correlation structures, e.g., the tri-diagonal structure. However, it can be easily shown that where ∂R i (δ) ∂δ is computed by taking the derivative of each element in R i (α). For example, for the tri-diagonal structure, ∂R i (δ) ∂δ is an n i × n i matrix with ones on the off-diagonal and zeros elsewhere, i.e., the (j, k)th element of ∂R i (α) ∂α is 1 if |j − k| = 1 and is 0 otherwise.
%QLS involves application of the following algorithm to obtain the estimates of β and α: 1. Fit a generalized linear model with an appropriate link and variance function using PROC GENMOD in SAS to obtain an initial estimate for β.
2. In stage one of QLS, repeat the following steps until a pre-specified convergence criterion is met for estimating β and α.
(i) Compute the Pearson residuals at the current estimate of β, where the jth Pearson residual on subject i is given by (ii) Compute the estimate of α by solving the QLS stage one estimating equation (2) for α using a pre-specified working correlation structure.
(iii) Update the estimate of β by solving the GEE estimating equation (1) for β at the current estimate of α.
3. After convergence in the estimates of β and α in stage one, obtain an updated estimate of α by solving the stage two estimating equation (3) for α. Then obtain a final estimate of β by solving the GEE estimating for β that is evaluated at the stage two estimate of α.

Algebraic expressions of the estimating equations for β and α
At each iteration, %QLS uses the Gauss-Newton method to compute the current estimate β * using the previous estimates β and α as follows: Although (4) can be used to obtain solutions to the stage one and two estimating equations (2) and (3), explicit expressions for α are preferred since their application is computationally more efficient. Except for the tri-diagonal structure, %QLS implements explicit expressions for α that were obtained by solving the stage one and two estimating equations (2) and (3) for α, for particular working structures.
For the AR(1) structure with unbalanced data, Shults and Chaganty (1998) proved that the feasible stage one estimateα A-ONE can be expressed as: while Chaganty and Shults (1999) showed that the stage two estimateα A-TWO is given bŷ For the equicorrelated structure with unbalanced data, Shults (1996) showed that the stage one estimating equation of α has a unique feasible solutionα A-ONE given by where I n i is the identity matrix and e i is a n i × 1 column vector of ones. Shults and Morrow (2002) obtained the stage two estimateα E-TWO given by For the Markov structure with unbalanced data, Shults (1996) obtained the QLS stage one estimating equation for α as follows: The stage two estimating equation of the Markov structure (Chaganty and Shults 1999) is then given by

Estimates of the covariance matrix
%QLS provides two types of estimated covariance matrices of the estimated regression parameter, the model-based and robust sandwich-based estimates. The robust sandwich-based estimate of the covariance matrix is the default matrix. It is often preferred when there is any uncertainty in the choice of working correlation structure. However, the standard errors are not necessarily smaller for the sandwich covariance matrix. Therefore, application of both the model based and sandwich based covariance matrices might be considered in an analysis.
%QLS computes the model-based covariance matrix as follows: whereφ is the estimate of the dispersion parameter with or without the bias correction given byφ respectively, where p is the dimension of the regression parameter β and N = m i=1 n i . By default, %QLS provides the bias corrected estimate of the dispersion parameterφ BC . For the robust sandwich-based estimate of the covariance matrix, %QLS computes %QLS also computes the (1 − α)100% confidence interval, and a p value for testing each individual regression parameter β j = 0, based on either the model-based or the robust sandwichbased estimate of the covariance matrix of β.

Prentice boundary of the estimate of α for analyzing binary data
Consider longitudinal binary measurements y i1 , . . . , y in i with expected values E(y ij ) = Pr(y ij = 1) = p ij , with q ij = Pr(y ij = 0) = 1 − p ij , and the correlation between measurements y ij and y ik denoted by Corr(y ij , y ik ). An important feature of correlated binary data is that the p ij , q ij and Corr(y ij , y ik ) completely determine the bivariate distribution of y ij and y ik because the pair-wise probabilities pr(y ij , y ik ) can be expressed as Prentice (1988) pointed out that the probabilities in (5) will be non-negative, i.e., Pr(y ij , y ik ) ≥ 0, only if the correlations satisfy the following constraints that depend on the marginal means: for all i = 1, . . . , n.
A main problem with violation of the Prentice constraints is that this leads to a lack of theoretical justification for the existence of a valid joint probability mass function, given the estimated parameter values of β and α. However, as Molenberghs and Verbeke (2005) pointed out, the estimate of the working correlation is "merely a device to provide consistent and asymptotically normal point estimates for the marginal regression parameters which should not the made a part of formal inference." Rochon (1998) also noted that violation of the Prentice constraints appears to cause no difficulty in practice, although the potential for violation should be taken into account in the design phase of a study. For example, when conducting sample size calculations, values of α that satisfy the Prentice constraints should be considered.  also suggest that a severe violation of bounds might be used to remove a particular working structure from a list of candidate working structures.
As noted in Section 1, the existing GEE and user-written QLS software (in Stata, MATLAB, and R) do not provide the estimated Prentice constraints (6) for analysis of binary data. %QLS, on the other hand, computes the estimated Prentice constraints for each correlation structure, and alerts users to a potential problem by providing a warning message if the estimate of α does not fall within the interval.

List of parameters in the macro
A complete list of the parameters in %QLS is as follows: %QLS (data, y, x, id, time, link, corr, robust, dispersion, alpha, initialout, stage1out, stage2out, cmatrix, reference, converge, maxiter) where data is the name of the data set in the usual longitudinal data format to be read in PROC GENMOD. The data set must not contain any missing values.
y is the outcome variable.
x are the predictors (covariates) in the regression model.
id is the ID variable.
time is the time variable.
link equals 1 for the identity link, 2 for the logit link, and 3 for the log link (default is 1).
robust equals 1 for robust sandwich-based standard errors, 2 for model-based standard errors (default is 1).
alpha is the significance level to be used in testing each regression coefficient (default is 0.05).
initialout equals 1 creates a SAS permanent data set in the current work space for the initial output, 0 otherwise (default is 0).
stage1out equals 1 creates a SAS permanent data set in the current work space for the stage 1 output, 0 otherwise (default is 0).
stage2out equals 1 creates a SAS permanent data set in the current work space for the stage 2 output, 0 otherwise (default is 0).
cmatrix equals 1 creates a SAS permanent data set in the current work space for the stage 2 correlation matrix, 0 otherwise (default is 0).
reference equals 1 prints out the reference information, 0 otherwise (default is 0).
converge is the convergence criterion for estimation of β and of α (default is 0.0001).
maxiter is the maximum number of allowable iterations for estimation of β and α (default is 100).
Note that many of the parameters have default values, so that they do not have to be specified. %QLS assumes the usual longitudinal data format to be read in PROC GENMOD without any missing observation contained in the data. If there are missing observations in the data that are coded as missing, these must be deleted prior to implementation of %QLS. This is equivalent to assuming that the observations are 'missing completely at random' (MCAR), as in the usual GEE analysis implemented by PROC GENMOD with the repeated statement. Backer et al. (1996) reported a 12-week, randomized, double-blind, multi-center trial for the comparison between the standard oral drug (terbinafine 250mg daily) and the experimental oral drug (theritraconazole 200mg daily) in the treatment of a common toenail infection called dermatophyte toe onychomycosis (DTO) which affects more than 2% of the British population (Roberts 1992). The data was also described in Molenberghs and Verbeke (2005), is available along with this manuscript, and can also be downloaded from http://www.cceb.upenn.edu/ sratclif/QLSproject.html. A total of 189 patients were randomized to each treatment group and followed over 12 weeks, with measurements taken at baseline, and at months 1, 2, 3, 6, 9, and 12. The primary outcome measure was the severity of the toe nail infection, that was defined as 1 if the infection was severe, and 0 otherwise. For the purpose of demonstration, we first consider a simple logistic regression model for comparison of the time-averaged treatment difference between the standard treatment group and the experimental treatment group.

Clinical trial example
The toenail data, toenail.txt, contains 4 variables: time, treatment, y, and id, where time is the time variable, treatment is the treatment indicator (1 for the standard arm, and 0 otherwise), y is the outcome variable (1 if the infection is severe, and 0 otherwise), and id is the ID number assigned to each patient. Let y ij follow the Bernoulli distribution with pr(y ij = 1) = p ij such that where treatment i is the treatment indicator that equals 1 if the ith subject is assigned to the standard drug and 0 otherwise. One advantage of implementing the model in (7) is that the upper limit of the Prentice constraint for α is always equal to 1. In general, any model that involves covariates that do not vary within clusters will have an upper Prentice boundary (for α) of 1, due the fact that the estimated probabilities p ij will not vary within subjects (clusters) when only cluster level covariates are considered in the model.
Before we demonstrate %QLS to fit the model (7), first we assume that the data, toenail.txt, is read into the current SAS workspace, e.g., data toenail; infile "D:\toenail.txt"; input time treatment y id; run; where we assume that the toenail.txt is stored in the D:\ directory.

Application of the AR(1) correlation structure
The following codes can be used to analyze the toenail data using the QLS regression model (7) with the AR1 structure: %QLS(data = toenail, y = y, x = treatment, id = id, time = time, link = 2, corr = 1); The estimated standard errors are the robust sandwich-based estimates that are set by default.
The outputs from the code are as follows: The output of %QLS contains the model information followed by the estimates of the stage one and two estimates of β and of α. As noted earlier, the upper limit of the Prentice interval is equal to 1 in the above output. From the stage two output, the p value corresponding to the time-averaged treatment effect is equal to 0.38, which suggests that there is no significant time-averaged treatment difference between the standard drug versus the experimental drug.

Application of the Markov correlation structure
Here we demonstrate application of the Markov correlation structure, which is currently unavailable for GEE. This is important because the toenail data is unequally spaced in time, e.g., the variable time in this data set indicates the month of measurement (post baseline) and takes value in {0, 1, 2, 3, 6, 9, 12} for each subject. Therefore, the Markov correlation structure is preferable for the analysis of this trial. The following code can be used to fit the model (7) with the Markov correlation structure: %QLS(data = toenail, y = y, x = treatment, id = id, time = time, link = 2, corr = 3); Here we omit the stage one output, and present the initial and stage two outputs.

Quasi-Least Squares SAS
The results are similar to those for the AR(1) structure, with an estimated α in stage two of α = 0.79 versusα = 0.74 for the AR(1) structure. Further, the p value with respect to the time-averaged treatment effect is 0.30. Hence, the same conclusion follows as with the AR(1) structure.

Application of the equicorrelated and tri-diagonal structures
Although the equicorrelated and tri-diagonal structures may not be the best candidate correlation structures for the toenail data, we include the implementation of theses structures for the purpose of demonstration. Here we only present the codes for fitting the model (7) with the equicorrelated and tri-diagonal correlation structures, but omit their outputs. For the equicorrelated correlation structure, we have %QLS(data = toenail, y = y, x = treatment, id = id, time = time, link = 2, corr = 2); For the tri-diagonal correlation structure, we have %QLS(data = toenail, y = y, x = treatment, id = id, time = time, link = 2, corr = 4);

Violation of the Prentice boundary
Here we briefly demonstrate violation of the Prentice constraints using the toenail data. Consider the following model for testing the treatment effect over time: where treatment i is the treatment indicator that equals 1 if the ith subject is assigned to the standard drug and 0 otherwise, time ij represents the time of the measurement collected on subject i at the jth measurement occasion, and treatment i × time ij is the treatment by time interaction.
To fit the model in (8)  To fit the model (8) with the AR(1) structure, we use %QLS(data = toenail, y = y, x = treatment time interaction, id = id, time = time, link = 2, corr = 1); Here we provide the initial and stage two outputs for the AR(1) structure.   From the stage two output, the estimated stage two α is 0.71, which exceeds the upper limit (0.31) of the Prentice constraints. It is also important to note that although the results are not shown here, application of GEE for the AR(1) structure would also result in a violation of the Prentice constraints.

Quasi-Least
The above results suggest something different than the time-averaged model, which is that there is a difference in the likelihood of high severity between the two treatment conditions. However, graphical displays (not shown) suggest that the assumption of linearity in the logit is not appropriate for these data. For an extensive discussion of approaches for assessment of the linearity in the logit assumption, see Hilbe (2009).
For demonstration purposes, we now present a model that did not violate the linearity in the logit assumption, and that also did not result in a violation of the Prentice bounds for α. This model contains indicator variables for the second (1 month), third (2 month), fifth (6 month), and seventh (12 month) measurements on each subject, an indicator variable for the standard treatment, and a visit seven (12 month) by treatment indicator variable. (All other treatment by visit indicator variables did not differ significantly from zero). The corresponding data set, toenail2.txt, can be also downloaded from www.cceb.upenn.edu/~sratclif/QLSproject. html. Here we present the code, and the initial and stage two outputs. First, we read in the data: data toenail2; infile "D:\toenail2.txt" delimiter="," ; input y time2 time3 time5 time7 treatment time7_trt id time; run; Next, we fit the QLS model. %QLS(data = toenail2, y = y, x = time2 time3 time5 time7 treatment time7_trt, id = id, time = time, link = 2, corr = 1);    For the above model, the Prentice constraints are not violated. In addition, the results seem more in agreement with the time-averaged model, which also did not identify a significant difference between the two treatment conditions with respect to severity of toenail infection.

Discussion
%QLS can fit a model to longitudinal data using the method of quasi-least squares, and can consider data which follows the normal, Bernoulli, or Poisson distribution with the AR(1), Markov, equicorrelated, and tri-diagonal structures. The syntax and the output of %QLS are similar to the existing GEE procedures in SAS, i.e., PROC GENMOD with the repeated statement, that would be familiar to SAS users. %QLS assumes that there are no missing observations in the dataset. Hence, any observations that are coded as missing should be deleted prior to the implementation of the macro. As noted earlier, this is equivalent to assuming that the data is missing completely at random (MCAR), which is a typical assumption in a GEE analysis.
Further updates of %QLS will be made to allow for implementation of other structures that are not currently available for GEE, including the familial structure, and Kronecker product-based structures which account for multiple sources of correlation, e.g., multi-outcome longitudinal data for which the sources of correlations come from within subjects over time, and between multiple outcomes. It will also be of interest to develop and compare existing methods for selection of a working correlation structure and for assessment of goodness of fit of QLS (and GEE) models.