SIMEX R Package for Accelerated Failure Time Models with Covariate Measurement Error

It has been well documented that ignoring measurement error may result in substantially biased estimates in many contexts including linear and nonlinear regressions. For survival data with measurement error in covariates, there has been extensive discussion in the literature with the focus typically centered on proportional hazards models. The impact of measurement error on inference under accelerated failure time models has received relatively little attention, although these models are very useful in survival data analysis. He et al. (2007) discussed accelerated failure time models with error-prone covariates and studied the bias induced by the naive approach of ignoring measurement error in covariates. To adjust for the eﬀects of covariate measurement error, they described a simulation and extrapolation method. This method has theoretical advantages such as robustness to distributional assumptions for error prone covariates. Moreover, this method enjoys simplicity and ﬂexibility for practical use. It is quite appealing to analysts who would like to accommodate covariate measurement error in their analysis. To implement this method, in this paper, we develop an R package for general users. Two data sets arising from clinical trials are employed to illustrate the use of the package.


Introduction
There has been extensive interest in discussing inference methods for survival data with covariates subject to measurement error. It is known that standard inferential procedures may produce biased estimation if measurement error is not properly taken into account (e.g., Carroll et al. 2006). With proportional hazards models a number of methods have been proposed to correct bias induced by measurement error (e.g., Prentice 1982;Li and Lin 2003;Yi and Lawless 2007).
Although the impact of covariate measurement error on inferential procedures is well understood for proportional hazards models, there is little discussion about its impact under accelerated failure time (AFT) models, which have proved to be useful in survival analysis (e.g., Lawless 2003).
Unlike the proportional hazards model that focuses modeling on the hazard function, an AFT model directly facilitates the relationship between the failure time (or its transformation) and covariates via a regression model. This formulation allows a direct and transparent interpretation of covariate effects on the change of the failure time. As noted by D. R. Cox (Reid 1994, p. 450), an AFT model is "in many ways more appealing because of its quite direct physical interpretation". In certain applications AFT models could provide better fit to data than proportional hazards models (e.g., Zeng and Lin 2007).
Under Weibull regression models, Giménez et al. (1999Giménez et al. ( , 2006 investigated inference methods using the corrected score approach discussed by Nakamura (1992). With general AFT models, He et al. (2007) discussed inference procedures to account for effects of covariate measurement error using a simulation-extrapolation (SIMEX) approach. The developed SIMEX method for AFT models is simple to implement and flexible to cover many applications. Moreover, this method is robust in a sense that distributions of covariates, including error-prone covariates, are left unspecified. Because of those features, this method becomes quite appealing to analysts who would like to accommodate covariate measurement error in their analysis of survival data.
Despite that there have been great advances on methodology of addressing covariate measurement error for survival analysis, the methods developed in the literature have not enjoyed widespread use in practice. Reluctance to adopt these methods may be due, in part, to the lack of available software to implement these methods. To address this practical issue, we develop an R package (R Development Core Team 2011), entitled simexaft, to implement the SIMEX method discussed in He et al. (2007) so that this method can be accessible for general users.
The remainder is organized as follows. Section 2 introduces the notation and model formulation. In Section 3 we describe the SIMEX method and its implementation in R. The developed R package is illustrated in Section 4 with two survival data sets: one arising from a subset of Busselton Health Study (Knuiman et al. 1994), and the other from a multi-center clinical trial (Fuchs et al. 1994). General discussion is included in the last section. The developed R package is available from the Comprehensive R Archive Network (CRAN) at http//CRAN.R-project.org/package=simexaft.

Notation and framework
For i = 1, 2, . . . , n, let T i and C i be the failure and censoring times for subject i, respectively, and δ i be the censoring indicator variable taking value 1 if T i ≤ C i and 0 otherwise. Denote t i = min(T i , C i ). Independent censoring is assumed. Let x i = (x i1 , x i2 , . . . , x ip ) be the p × 1 covariates subject to possible measurement error, and z i the vector of covariates free of error.
Response variable Y i = log(T i ) is characterized by the AFT model, given by where β = (β x , β z ) is the vector of regression parameters of interest, and β z may include the intercept coefficient. Here i assumes a distribution G(·) with parameters α. Common choices of the distribution G(·) include Weibull, exponential, Gaussian, logistic, log-normal and log-logistic distributions (e.g., Lawless 2003).
Let W i be an observed version of covariate x i . W i and x i are assumed to follow a classical additive measurement error model. That is, conditional on x i and z i , where e i follows, independent of x i and z i , a normal distribution with mean 0 and covariance matrix Σ e = [σ jk ] p×p . Here we assume nondifferential measurement error, i.e., . This mechanism says that the contribution from the observed W i is not informative, given the true covariates x i and z i . This mechanism applies commonly to many practical problems, especially when the true and observed covariates occur at a fixed time point and the response is measured at a later time (Carroll et al. 2006, p. 36).
The parameters in Σ e can be estimated, for example, when repeated measurements for x i are available. In other situations, the parameters in Σ e may be assumed known which is, for instance, based on prior knowledge or other similar studies. When conducting sensitivity analysis to assess the impact of different degrees of measurement error on estimation of the response parameters, the parameters in Σ e are typically specified to be known based on background information about the measurement process.
Let θ = (β , α ) be the parameters for the response model and q be its dimension. Primary interest often centers on estimating parameters β in order to study the relationship between the response Y i and covariates (x i , z i ) . Under the true response model (1) let be the likelihood contributed from subject i, where g(·) is the density function corresponding to the distribution function G(·), and y i = log(t i ). Denote the log likelihood as If there is no measurement error present in covariates, then the maximum likelihood estimator θ is obtained by solving and this estimator is consistent for θ and has an asymptotic normal distribution. However, when error is present in covariates, the resulting estimator can be substantially biased (e.g., Li and Lin 2003;Yi and He 2006;He et al. 2007).

SIMEX algorithm
To conduct valid inference for θ in the presence of covariate measurement error, He et al. (2007) developed a SIMEX method. The basic idea of SIMEX adjustment is to add additional variability to the observed measurement W i in order to establish the trend how measurement error-induced bias may be related to the variance of induced measurement error, and then extrapolate this trend back to the case without measurement error. (Carroll et al. 2006, p. 97). This method is robust in a sense that the distribution of x i is unspecified. Moreover, it is easy to implement. In this subsection we describe the SIMEX method developed in He et al. (2007). Typically, we consider two practical cases for the parameters in Σ e : (i) the parameters in Σ e are given as fixed values; and (ii) the parameters in Σ e are not known, but repeated measurements of x i are available. The SIMEX method applies to both cases with the same steps except for the first step of data simulation. Details are given as follows.

Simulation step
In case (i) in which the parameters in Σ e are known, we generate, for each i = 1, . . . , n, a sequence of variables The array {W i (b, λ)} of artificially simulated data will be used in the next step for estimation of the parameters.
In case (ii) in which the parameters in Σ e are unknown but repeated measurements for x i are available, we need to modify Equation 4 by making it be computable because Σ e contains unknown parameters. To be specific, let V ij , j = 1, . . . , m i , denote the repeated measurements for true covariate x i , i.e., V ij and x i are linked by the model where Σ e is unknown, and e ij 's are independent of x i , z i , Y i . Instead of using Equation 4 to generate W i (b, λ), we set, for given b and λ, is to use a normal variate generation. That is, for each b and i = 1, . . . , n, independently generate m i normal random variates d ij (b), j = 1, . . . , m i , from a standard normal distribution N (0, 1), (Devanarayan and Stefanski 2002).

Estimation step
For given λ and b, we obtain an estimate θ(b, λ) by solving Equation 3 with x i replaced by W i (b, λ). For r = 1, 2, . . . , q, let θ r (b, λ) denote the rth component of θ(b, λ), and Ω r (b, λ) be the corresponding variance estimate, given by the rth diagonal element of Journal of Statistical Software -Code Snippets

5
The following quantities are then calculated:
Although the theory of the SIMEX method is not trivial, an example from simple linear regression can well illustrate the idea of this method. Suppose the response variable Y and the covariate x is modeled as where has mean 0. If replacing x with its observed measurement W , modeled by W = x + e where e has mean 0 and variance σ 2 e , and is independent of and x, then the resulting least squares estimator β * x for β x converges in probability to β * x is the variance of x. To see how the bias may be related to the degree of measurement error in x, we perturb W by adding additional error to create )β x . This expression indicates the dependence of the asymptotic bias on the magnitude of measurement error -the less degree of measurement error (equivalently, a smaller λ), the smaller asymptotic bias. In particular, if λ shrinks to 0, β x (b, 0) recovers the naive estimator β * x ; but if λ takes value -1, then the limit β * x (b, −1) is identical to the true parameter β x .
The SIMEX method was initially proposed by Cook and Stefanski (1994) for analyzing complete univariate data with error-prone covariates under parametric models. He et al. (2007) generalized this method to handle survival data for which censoring is a typical feature. The SIMEX approach is very appealing because of its simplicity to implement and no requirement of modeling the true covariates x i (often not observable). To implement this method, we need to address a few issues. The specification of B or Λ is not unique. Technically speaking, a larger value of B leads to a better SIMEX estimator in the sense that Monte Carlo sampling error in the simulation step can be reduced. For practical use, however, choosing B = 50, 200 or 500, and taking Λ to be the equal cut points of interval [0, 1] or [0, 2] with M = 5, 10 or 20, can often lead to fairly reasonable SIMEX estimates (e.g., Carroll et al. 2006). Another source of variation in obtaining SIMEX estimators lies in the choice of an extrapolation function. The exact extrapolation function is usually not known. Instead, a user-specified approximation is employed, hence SIMEX estimators are usually approximately consistent. Linear regression or quadratic regression function tends to be the most widely used replacement of the exact extrapolation function. Although SIMEX estimators are often not exactly consistent, they greatly outperform naive estimators for which measurement error is not properly taken into account. The performance of the SIMEX method has been shown superior in some highly nonlinear models (e.g., Carroll et al. 1996;Wang et al. 1998).

Implementation in R
The SIMEX procedures described above are implemented in the package simexaft (which depends on packages survival, Therneau and Lumley 2011, and mvtnorm, Genz et al. 2011and Genz and Bretz 2009).
Specifically, the function simexaft produces the SIMEX estimates for interesting parameter β and other parameters along with their associated SIMEX standard errors and p values. The form of calling function simexaft is given by simexaft(formula = formula(data), data = parent.frame(), SIMEXvariable = indicator, repeated = FALSE, repind = list(), err.mat = err.mat, B = 50, lambda = seq(0, 2, 0.1), extrapolation = "quadratic", dist = "weibull") with the arguments being described as follows: formula: specifies the model to be fitted, with the variables coming with data. This argument has the same format as the formula argument in the function survreg from survival, taking the form Surv(time, censoring indicator)~covariates.
SIMEXvariable: the index of the covariate variables that are subject to measurement error.
repeated: set to TRUE or FALSE to indicate if there are repeated measurements for the mis-measured variables, i.e., corresponding to cases (i) and (ii) in Section 3.1, respectively.
repind: the index of the repeated measurement variables for each mis-measured variable. It is of an R list form. If repeated = TRUE, repind must be specified.
err.mat: specifies the covariance matrix in error model (2). If repeated = FALSE, err.mat must be specified.
B: the number of simulated samples for the simulation step. The default is set to be 50.
extrapolation: specifies the function form for the extrapolation step. The options are "linear" and quadratic". The default is set to be "quadratic".
dist: specifies a parametric distribution that is assumed in AFT model (1). This argument is the same as the dist option in the function survreg, and it can take distributions such as "weibull", "exponential", "gaussian", "logistic", "lognormal", and "loglogistic".

Examples
To illustrate the usage of the developed R package simexaft, we apply the package to two real data sets, corresponding to cases with or without repeated measurements for errorcontaminated covariates.
The first example is based on a subset of the real data arising from the Busselton Health Study (Knuiman et al. 1994). The original data were analyzed in He et al. (2007). The data set analyzed here includes survival information for a randomly selected subset of 100 females. The survival time is taken as the age at death, as in He et al. (2007). Systolic blood pressure (x i1 ), cholesterol level (x i2 ), age at registration (z i1 ), body mass index (z i2 ) and smoking status are risk factors related to mortality. Systolic blood pressure is rescaled as log(x i1 − 50), as in Carroll et al. (2006, p. 113). Smoking status is classified by two dummy indicators, denoted by z i3 and z i4 , where z i3 = 1 indicates an individual is an ex-smoker, and 0 otherwise; z i4 = 1 represents that an individual is a current smoker, and 0 otherwise. It is known that measurements of risk factors x i1 and x i2 are subject to substantial error due to the nature of these covariates.
The logarithms of the failure times are postulated by model where error i follows a specific distribution. The standard extreme value distribution is assumed for an illustration. We assume that errors in both risk factors x i1 and x i2 can be represented by model (2).
The developed R package simexaft can be downloaded and installed from CRAN. The package can then be loaded in R:

R> library("simexaft")
Next, load the data that are properly organized with the variable names specified. In the example here, the data set named as BHS is included by issuing R> data("BHS") R> dataset <-BHS R> dataset$SBP <-log(dataset$SBP -50) For illustrative purposes, we use settings with B = 50, λ M = 2 and M = 20. In this example, we assume the parameters in Σ e are known. This is a typical case when conducting sensitivity analysis. Here set σ 2 11 = σ 2 22 = 0.75 2 and σ 12 = σ 21 = 0 as an example. The naive AFT approach without considering measurement errors in covariates gives the output: R> formula <-Surv(SURVTIME,DTHCENS)~SBP + CHOL + AGE + BMI + + SMOKE1 + SMOKE2 R> out1 <-survreg(formula = formula, data = dataset, dist = "weibull") R> summary(out1) To adjust for possible effects of measurement error in variables SBP and CHOL, we call the developed function simexaft for the analysis: R> set.seed(120) R> ind <-c("SBP", "CHOL") R> err.mat <-diag(rep(0.5625, 2)) R> out2 <-simexaft(formula = formula, data = dataset, SIMEXvariable = ind, + repeated = FALSE, repind = list(), err.mat = err.mat, B = 50, + lambda = seq(0, 2, 0.1), extrapolation = "quadratic", dist = "weibull") R> summary (out2 Now we demonstrate the use of simexaft for the case that the parameters in Σ e is unknown, but repeated measurements for error-prone covariates are available. This is illustrated by an example from a study of pulmonary exacerbations and rhDNase. Fuchs et al. (1994) reported on a double-blind randomized multicenter clinical trial designed to assess the effect of rhDNase, a recombinant deoxyribonuclease I enzyme, versus placebo on the occurrence of respiratory exacerbations among patients with cystic fibrosis. The rhDNase operates by digesting the extracellular DNA released by leukocytes that accumulate in the lung as a result of bacterial infection, and so it was expected that aerosol administration of rhDNase would reduce the incidence of exacerbations (Cook and Lawless 2007, p. 365).
Six hundred and forty five patients were recruited in this trial. Each subject was randomly assigned to treatment or placebo group, and was followed up approximately 169 days for pulmonary exacerbations. Data on the occurrence and resolution of all exacerbations were recorded. The forced expiratory volume (FEV) was considered a risk factor and was measured twice at randomization. The response is defined as the logarithm of the time from randomization to the first pulmonary exacerbation.
To investigate the effect of the FEV on the time to first pulmonary exacerbation, we postulate the model where trt is the indicator of treatment, and error i follows a specific distribution. The standard extreme value distribution is taken again for illustrations. We assume that measurement errors in risk factors FEV can be represented by model (2).
First, load the data, named rhDNase, into R by issuing R> data("rhDNase") Two repeated measurements for covariate FEV , fev and fev2, are called in simexaft using the option repeat = TRUE, along with a list of index of the repeated measurements.
Existing survreg can provide the analysis with no measurement error effects properly taken into account, by merely taking the FEV measurement as the average of the two repeated observations: R> rhDNase$fev.ave <-(rhDNase$fev + rhDNase$fev2)/2 R> output1 <-survreg(Surv(time2, status)~trt + fev.ave, data = rhDNase, + dist = "weibull") R> summary(output1) Similar analysis results can be obtained if using the simexaft function to accommodate covariate error effects. In this example, we note that variation in the two repeated measurements of FEV is too minor to suggest different results obtained from the methods of ignoring or accounting for covariate measurement error. Here we perturb the two repeated observations by adding additional noise, e.g., 15% of sample standard error, and then apply the developed R function to produce the output. This artificial procedure may not be customary when one focuses on a genuine data analysis. However, it is useful for illustration purposes. Moreover, this approach can provide some insights if conducting sensitivity analyses is of prime interest.

Discussion
The impact of measurement error in covariates is well documented for survival data that are typically postulated by proportional hazards models, but there is relatively little discussion on AFT models. Yi and He (2006) explored the measurement error problem for bivariate q q q q q q q q q q q q q q q q q q q q q survival data under AFT models, but their discussion focused on the AFT models with normal error distributions. To accommodate general distributional forms, He et al. (2007) describe a simulation based method that is simple to implement. For practical interest, we develop an R package simexaft to adjust for biases induced by covariate measurement error under AFT models. Our illustrations show that the developed package is simple to use. It is anticipated that such development is of great interest to data analysts when handling survival data with covariate measurement error. The R package simexaft is available from the Comprehensive R Archive Network at http//CRAN.R-project.org/package=simexaft.