Categorizing a Continuous Predictor Subject to Measurement Error

Epidemiologists often categorize a continuous risk predictor, even when the true risk model is not a categorical one. Nonetheless, such categorization is thought to be more robust and interpretable, and thus their goal is to fit the categorical model and interpret the categorical parameters. We address the question: with measurement error and categorization, how can we do what epidemiologists want, namely to estimate the parameters of the categorical model that would have been estimated if the true predictor was observed? We develop a general methodology for such an analysis, and illustrate it in linear and logistic regression. Simulation studies are presented and the methodology is applied to a nutrition data set. Discussion of alternative approaches is also included.


Introduction
Fitting models by categorizing a continuous risk predictor is a common practice in epidemiology. Among many recent examples, see [20,19,1,5,10] and [25]. A look at current issues of epidemiology journals will uncover many more examples. An important issue is that, generally in these problems, there are many covariates other than the main risk predictor.
The appeal of categorization in interpreting results is clear. If we have a risk predictor X, and we categorize it into J levels (C 1 , ..., C J ), one can compare the highest level of the predictor, C J , to the lowest level, C 1 , and if they are statistically significantly different, one can then conclude that it is better to be in the class that has the lowest risk, and quantify how much better.
One important technical point is that categorization implicitly posits an induced model based on the categorized variable X. In some cases, the induced model actually fits the data, e.g., when the response Y actually depends on X only through its categorized version, or if there are no other covariates, see the next paragraph. In other cases, and generally, the induced model does not fit the data, and we call this model misspecified. In particular, suppose that there are other covariates than X, say Z. Consider a binary response, Y , let H(·) be the logistic distribution function, and suppose that the true risk model in the continuous scale is pr(Y = 1|X, Z) = H{m(X, Z, β)} for some continuous function m(·). Then, even if there is no measurement error, if any of the covariates Z are related to Y in this continuous model, or if there is an interaction of X and Z on Y , categorizing X into J levels and plugging that into m(X, Z, β) in place of X leads to a misspecified model as we have defined it. Measurement error in this context makes things even more difficult. When there is no measurement error, [26] gives a characterization of what is actually being estimated in misspecified models: while we do not emphasize it, our paper extends this characterization to the measurement error case. A relevant paper that first solved this particular problem is [14], which was also cited in [26].
This slightly different terminology is motivated by the following example. Suppose that Y is binary, there are no additional covariates Z, and simply define π j = pr(Y = 1|X * = j), where X * is the categorized predictor. Then we can write, correctly, that pr(Y = 1|X * = j) = H{I(X * = j)θ j } by making the obvious identifications. Thus, categorization does result in an induced correctly specified logistic model, just not the one in the continuous scale. A logistic regression analysis of Y on the categories of X * then will estimate θ j consistently.
Our point is not to try to get epidemiologists to change their common practice. Instead, we study the effect of measurement error when a continuous predictor variable subject to measurement error is categorized. Our goal is to answer the question: with measurement error in this context, how can we (a) obtain consistent estimates of what epidemiologists would have obtained if X were actually observed; and (b) develop consistent standard errors.
We answer the question above in a general way. Section 2 gives basic technical background. Section 3 provides a general methodology for answering questions (a) and (b) above. Section 4 presents simulation studies for linear and logistic regression that show the good behavior of our methodology, both in terms of bias and confidence interval coverage. Section 5 shows applications of our approach by using data from the Eating at America's Table Study [23]. Section 6 presents a discussion about other potential approaches to categorization and how those approaches compare to ours. Sketches of technical arguments are in the appendix. Remark 1. As discussed above, categorization leads to a misspecified model. It is also well-known that such categorization generally leads to differential measurement error [11,13,3], and thus additional complications over simply fitting a measurement error model. Chapters 6.1-6.2 of [13] has a detailed discussion when the continuous variable is dichotomized, calling the result differential by dichotomization. We are thus assuming that the true risk model in a continuous variable X is not categorical in X. If it were, consult [13] and [3], who also discuss the issue of doing a measurement error analysis in this case, especially the difficult complex issues of computation and identifiability both theoretical and practical.

Illustration: A special case of linear regression
It is instructive to consider a special case, namely linear regression. Doing so will set the stage for our general method. The response is Y , the scalar predictor subject to error is X, the observed scalar predictor is W , there are predictors Z measured without error, and we define Z = (1, Z T ) T to allow for an intercept. The regression model in the continuous predictor X is Y = Xβ 1 + Z T β 2 + , where is mean zero independent of (W, X, Z). There are j = 1, ..., J categories (C 1 , ..., C J ): the number of categories J is set by the investigator, and is generally 3 (tertiles), 4 (quartiles) or 5 (quintiles), depending on the scientific field and the investigator's interests. Here M (X, Z) = {I(X ∈ C 1 ), ..., I(X ∈ C J ), Z T } T . If X could be observed, then we would also immediately obtain an estimate of β = (β 1 , β T 2 ) T . By [26], when X is observed, what epidemiologists estimate by using the categorized M (X, Z) is Θ, where, based on the normal equations for the categorized predictor, The estimate Θ is the solution to 0 , and this is a consistent estimate of Θ. Comparisons between categories j and k for j, k ≤ J, say, are θ j − θ k . However, when X is not observable, estimating the solution to (1) has to be based solely on (Y, W, Z). In (1), it makes sense that if one believes the true regression model is linear in (X, Z), then, at some point, an estimate of β can be obtained via a measurement error analysis if there are sufficient data to do so.
Solving (1) based only on the observed W though is not so easy, and it is clear that some part of the relationship between W and X given Z is going to need to be specified, as it needs to be to do a general measurement error analysis. One way to do this is to define and then define Q(W, Z, Θ, β) = E{G(X, Z, Θ, β)|W, Z}. Since 0 = E{Q(W, Z, Θ, β)}, Θ can be estimated by solving Hence, in this simple case, for j = 1, ..., J we will need to be able to calculate expectations of XI(X ∈ C j ) given (W, Z) and the probability that X ∈ C j given (W, Z). As we will see, in general problems, we will need to estimate the expectations of other functions of X given (W, Z). So, to summarize, to get a general solution, it appears that we will need to estimate (β 1 , β 2 ) by a measurement error analysis and estimate expectations of specified functions of X given (W, Z).

Remark 2.
Following on Remark 1, it is obvious that in the unlikely event that the true risk model is actually categorical in X, so that E(Y |X, Z) = M T (X, Z)β, then model misspecification and differential measurement error both disappear, and one really needs just the probabilities that X is in the categories given (W, Z). As [13] and [3] discuss in detail, estimating such models is difficult because of model identifiability concerns. Often, papers dealing with this issue assume the existence of a validation data set, where X is actually observed on a subset of the data. [13] is a particularly good source for the difficulties we have mentioned and remedies using replication data. [3], page 314, who states that estimating the misclassification rates is "most likely coming from internal validation data" and also has a nice discussion.

Assumptions
Our work is very general, but even so, the algorithm is basically the same as in Section 2.1. Our methodology requires three basic assumptions, described below. We let X be the continuous predictor subject to measurement error, Z covariates measured exactly, W the mismeasured version of X, and Y the response. Assumption 1. When X is observed, the true response model in the continuous scale has parameters β, such that there is an estimating function, Φ true (Y, X, Z, β) that identifies β and satisfies Assumption 1 occurs in at least two circumstances.

Example 1. (A)
There are functions m 1 (X, Z, β) and m 2 (X, Z, β) such that E(Y |X, Z) = m 1 (X, Z, β) and the unbiased estimating function that would be used if X were observable is There is a parametric model for Y given (X, Z).
Example 1(A) is very general, in that it includes traditional quasilikelihood models, nonlinear regression, generalized linear models, probit regression, etc. Crucially, it does not require a fully parametric model for the distribution of Y given (X, Z).
In our approach, as in linear regression in Section 2.1, we may need to obtain information about moments of specified functions of X given (W, Z). To do this, we will consider the setting in which there may be an external data set of N observations giving information on one set of parameters of the joint distribution, Λ ext : if there is no external study, N = 0 and Λ ext does not exist. In addition, there is another set of the parameters, Λ int , that is estimated from the n observations in the internal data set.

Assumption 2.
When X is not observed, either (a) the distribution of X given (W, Z) is known up to parameters Λ ext and Λ int as described above, or (b) there is a function, G(X, Z, Θ, β) defined at (11) below, whose conditional expectation given (W, Z) depends on parameters Λ ext and Λ int and can be estimated. The parameter Λ ext cannot be estimated by internal data, while the parameter Λ int can be estimated by internal data. For both, there are unbiased estimating functions V ext,m (Λ ext ) for the external data and V int,i (Λ int , Λ ext ) for the internal data such that E{V ext,m (Λ ext )} = 0 and E{V int,i (Λ int , Λ ext )} = 0.
For linear regression, G(X, Z, Θ, β) is given in (2). If there are external data and N > 0, we estimate Λ ext by solving the estimating equation In the internal data set, we estimate Λ int by solving an estimating equation There is also a very subtle issue that needs to be made explicit. The issue of when parameters are transportable from an external data set to the internal data set is discussed in Chapter 2.2.4-2.2.5 of [4]. As they state, it is much better if there are sufficient internal data that external data need not be used, but this is not always the case.

General observations when X is observed
As argued in Section 1, the goal is to fit a model when X is categorized into J levels (C 1 , ..., C J ), and so we defined the dummy variables and Z together as M (X, Z) = {I(X ∈ C 1 ), ..., I(X ∈ C J ), Z T } T : our formulation allows more complex forms, including interactions. Suppose there are i = 1, ..., n subjects in the primary/main/internal study, and suppose further that we observe (Y i , X i , Z i ). If X is observed, the analysis done on these categories will be based on replacing (X, Z) in (3)-(4) by M (X, Z), and to make clear the categorization, we define a parameter Θ, and obtain Θ by solving More complex forms of (7) are easily accommodated. Unlike in Assumption 1 and (3)-(4), except in the rare case that the categorized model is actually true, 0 = E[Φ cat {Y, M (X, Z), Θ}|X, Z], a conditional expectation. This is a key part of the work in [26].
Despite the fact that the categorized model does not fit the data conditional on (X, Z), by standard estimating equation theory [26], the estimate formed by solving (7) has a limit as n → ∞, Θ, which is the solution to It is important to observe that (8) is an unconditional expectation, not a conditional one. If, instead of observing X, we observe its mismeasured version W , and if we replace X by W , we will of course generally inconsistently estimate both β and Θ.

Estimating the true parameter β
In our approach, as in Section 2.1 for linear regression, we must estimate β in (3). There is of course a large literature on how to do this [13,4,3,27]. Borrowing on that literature, from Assumptions 1-2, for an estimating function Φ(Y, W, Z, β, Λ int , Λ ext ), the estimate, β, is the solution to where ( Λ int , Λ ext ) are obtained from equations (5) and (6), respectively. Of course, the details and the form of Φ(·) differ from case-to-case.

Methodology: General case
The methodology is simple to explain at the general level. The target Θ is defined as the solution to (8). However, we can rewrite (8) as Since the distribution of Y given (X, Z) depends on β, for notational completeness we define Making the usual nondifferential measurement error assumption, i.e., that Y and W are independent given (X, Z), Critically, (12) depends only on the observed covariates. Thus, if we have consistent estimates ( β, Λ int , Λ ext ) of (β, Λ int , Λ ext ), then a consistent estimate, Θ, of Θ solves In some cases, we do not have external data. Thus, we do not have V ext and Λ ext , and V int and Θ only depend on Λ int .

Remark 3.
The key question is how to compute G(X, Z, Θ, β) in (11). In the fully general case (3), we require a parametric model for the distribution of Y given (X, Z), as in Example 1(B). However, in standard regression models of the form in (4) in Example 1(A), great simplification occurs, because in that case, and thus C.3 gives detailed formulae for linear and logistic regression.

Remark 4.
Our method is closely related to the expectation-correction method of [27], Chapter 2.5.2, and less closely to the general corrected score methods first introduced by [17]. [27] has an excellent and comprehensive discussion of the correction methods in the literature. We do not have a score function per se, but we have a function, Φ cat {Y, M (X, Z), Θ}, with the property that Instead of our (11)-(12), the expectation-correction method uses as its estimating equation The obvious distinction is that our function Q(·) does not involve Y explicitly, while the expectation-correction function Q * (·) does involve Y . We used Q(·) and (11) because our assumptions allow G(·) to be calculated explicitly, especially in Example 1(A), so that implementation is somewhat easier. In addition, in Example 1(A), there does not need to be a full likelihood, as would be required in the expectation-correction method, so there are actual differences in the methods.

Asymptotic Theory
Asymptotic theory for the parameter estimates is easily derived. Let Ω = (Θ, β, Λ int , Λ ext ) and let the true values of the parameters be denoted by Ω.
It is neater notation in this section to let i = 1, ..., n denote the internal data, and i = n + 1, ..., n + N denote the external data. For i > n, define If there are no external data, then N = 0, Ω = (Θ, β, Λ int ) and the zero element and Λ ext in the definition of Ψ i (Ω) are removed.
By standard estimating equation results, we have the following results, which are shown in Appendices A.1 and A.2.

Lemma 1.
If there are external data, i.e., N > 0, make Assumptions 1-3. Suppose that N → ∞ and n → ∞ such that n/(n In the definitions of A and B, the expectation and covariance matrix for Ψ 1 (Ω) are computed in the internal data, while the expectation and covariance matrix for Ψ N +n (Ω) are computed in the external data. Let C ext be the sample covariance matrix of Ψ i ( Ω) for i = n + 1, ..., n + N and let C int be the sample covariance matrix of Ψ i ( Ω) for i = 1, ..., n. Consistent estimates of A and B are easily seen to be If there are no external data, i.e., N = 0, make Assumptions 1-2. As n → ∞, where A = E{∂Ψ 1 (Ω)/∂Ω T } and B = cov{Ψ 1 (Ω)}. In the definitions of A and B, the expectation and covariance matrix for Ψ 1 (Ω) are computed in the internal data. Let C int be the sample covariance matrix of Ψ i ( Ω) for i = 1, ..., n. Consistent estimates of A and B are easily seen to be Remark 5. While the calculations used in Lemmas 1-2 are standard, as a referee has pointed out, we are making the following kinds of assumptions to carry them through: weaker conditions can be constructed. All these conditions hold in our examples of linear and logistic regression with additive measurement error. There is a parameter which we have called in this subsection Ω = (Θ, β, Λ int , Λ ext ). For i = 1, ..., n + N , we have defined estimating functions Ψ i (Ω), which we have defined in such a way that E{Ψ i (Ω)} = 0 for i = 1, ..., n + N : the expectations are unconditional, although in implementing the estimators we have exploited our Assumptions 1-3 to simplify the numerical calculations. Having done all this, we are now in the realm of estimating equation theory. Sufficient but not necessary conditions for our asymptotic theory to hold are the following.
• The parameter space is compact. This is not necessary but it is convenient for proving consistency. Remark 6. The major new item here in verifying the assumptions mentioned in Remark 5 are the differentiability assumptions having to do with Q(W, Z, Θ, β, Λ int , Λ ext ) in (12). Let the conditional density/mass function of Y given (X, Z) be f Y |X,Z (·, β, Λ int , Λ ext ) and the conditional density/mass function of X given (W, Z) be f X|W,Z (·, Λ int , Λ ext ). Let dν(y) and dν(x) be integrals/counts as the case requires. Then (12) can be written out as Then the non-standard differentiability assumptions in Remark 5 are really about the differentiability assumptions of Φ cat {y, M (x, Z), Θ}, f Y |X,Z (·, β, Λ int , Λ ext ) and f X|W,Z (·, Λ int , Λ ext ) with respect to the parameters.

Scenarios
For simplicity, we do our simulations in the case that there is no Z. For logistic regression, we assume that the true model is where H(·) is the logistic distribution function. Then we generate data as where X and U are independent. We set β 0 = −0.42 and set β 1 = log(1.5) in Table 1. We set (μ x = 0, σ 2 x = 1, σ 2 u = 1), so that the measurement error variance is the same as the variance of X, and the classical attenuation coefficient is λ = σ 2 x /(σ 2 x + σ 2 u ) = 0.50. Solving (8) numerically, we find that Θ = (−0.98, −0.64, −0.42, −0.21, 0.14) T . In both cases, the main study sample size is n = 500.
We used the quintiles of the distribution of X to define the categories. This is because, as stated in the introduction, we have our goal is to obtain consistent estimates of what epidemiologists would have obtained if X were actually observed, in this case, the quintiles of X.
We did simulations in two cases: 1. External-Internal Data: The internal data has no replicates and the external data set has size N = 300 and K = 2 replicates for each observation. The nuisance parameters are Λ ext = σ 2 u and Λ int = (μ x , σ 2 x ). We estimated σ 2 u from the external data with replicates, and estimated μ x , σ 2 x using the internal data without any replicates. Standard errors were computed as in Lemma 1. 2. Internal Data Only: The internal data has R = 2 replicates and there are no external data (K = 0). The nuisance parameters Λ = Λ int = (μ x , σ 2 x , σ 2 u ). We estimated (μ x , σ 2 x , σ 2 u ) from the internal data with replicates. Standard errors were computed as in Lemma 2.
C.3 provides details of implementation.

Results
The results given below are similar, and indeed even more impressive, when the main study sample size n increases to n = 1,000, 2,000 and 3,000, and thus these are not displayed here. The results are also similar when β 1 is either smaller or larger. The same qualitative results are also found for Θ = (θ 1 , ..., θ 5 ) T individually (results not shown).
We fit the new approach and compare it with the naive method for the both cases described above. Our main interest is to estimate the log relative risk θ 5 − θ 1 , which compares the effect of the category 5 with the effect of the category 1. In the two simulations, we computed (a) the log relative risk pretending that X is observed; (b) our method; and (c) the naive method that ignores measurement error. In the scenario of internal data with R = 2, the predictor used was the sample mean of the replicates.
Based on 1000 simulated data sets, in Table 1, we report the empirical average mean bias, asymptotic standard error, standard deviation, root mean squared error, and coverage rate of the nominal 95% confidence interval across the simulations.
From Table 1, we observe the following.
• The estimator using true X and our method both have little bias and provide near-nominal coverage. • The naive estimator that ignores the measurement error is badly biased and attenuated towards zero. Consequently the coverage probabilities are near-zero and the root mean squared errors are quite inflated. • With no internal replicates, i.e., R = 1, the root mean squared error of our method is naturally higher than if X had been observed, but not quite as high as would be expected in a continuous analysis. Indeed, in a continuous analysis with attenuation λ = 0.50, as in our simulation, one would expect a doubling of root mean squared error.

Scenarios
In this section, we do simulations based on simple linear regression with no Z, including homoscedastic and heteroscedastic cases. We assume that the true model is Similarly, we generate data as

Results
Similarly as before, our main interest is to estimate θ 5 − θ 1 , which compares the effect of the category 5 with the effect of the category 1. In the two simulations, we computed θ 5 − θ 1 (a) pretending that X is observed; (b) our methods; and (c) the naive method that ignores measurement error. For the naive method, in internal data with R = 2, the predictor used is the sample mean of the replicates.
Based on 1000 simulated data sets, in Table 2, we report the empirical average mean bias, asymptotic standard error, standard deviation, root mean squared error, and coverage rate of the nominal 95% confidence intervals across the simulations.
From Table 2, we see that similar conclusions can be drawn as in Section 4.1. However, an interesting thing is in the heteroscedastic case, when noise has its variance related to X. Assuming that X is observed, the coverage rate of nominal 95% confidence intervals is low, because the heteroscedasticity is ignored. Using our method, we can get close to nominal coverage without knowing any information about the noise . Thus, this example shows that our method is very general as we stated in Example 1(A).

Data description
We illustrate our methods using data from the Eating at America's Table (EATS) Study [23], in which 964 participants completed multiple 24-hour recalls of diet. We consider the variable Fat Density, which is the percentage of calories coming from Fat. The response Y is either (i) the indicator of obesity, which means that a subject's body mass index (BMI, weight in kilograms divided by the square of height in meters) is 30 or greater. or (ii) the actual body mass index. We assume that W , is unbiased for usual intake X, and that W = X + U . It is reasonable in these data to take (a) X to be normally distributed, (b) that U is normally distributed; and (c) that X and U are independent, as we now describe. We used the methods described in [9] and Chapter 1.7 of [4], which also give the rationale for these methods. Specifically, for (a), as they suggest a qq-plot of the individual means for Fat Density looked acceptably normal, with skewness and kurtosis = -0.06 and 3.02, respectively, see the top panel of Figure 1. For (b), as they suggest, we took differences of the first and second Fat Density measurements, which had skewness (theoretically = 0) and kurtosis = -0.14 and 3.40, respectively: the somewhat higher kurtosis here is seen to be minor on the qq-plot, see the middle panel of Figure 1. Finally, for (c), they suggest analyzing the correlation between the individual-level mean and standard deviation = 0.06, and there was no obvious strong pattern when we plotted the data the latter against the former, see the bottom panel of Figure 1.
For numerical stability, our analysis in the continuous scale is uses centered and standardized W using (15W − 5)/ √ 0.5. To illustrate an example of an internal and an external study, we randomly selected N = 200 subjects as the external study to have the first two 24-hour recalls, while using the remaining data as the main internal study. As in the simulation, we either set the number of recalls R = 1, K = 2, meaning the external study data were used to estimate the measurement error variance, for R = 2, K = 0, in which case the external data were not used.

Logistic regression
As described in Section 4.1, we assume the true model defined by (15)- (16), and the respective two cases. In this application we again estimate the log relative risk θ 5 − θ 1 . We fit both our new approach and the naive model that ignores measurement error when external data is and is not used.
In Table 3, we observe that when using the external data and only 1 observation in the internal data the estimate of the log relative risk θ 5 − θ 1 from our approach is 108% greater than the naive estimate, while when using internal data with two replicates our estimate of our approach is 32% greater than the naive estimate. This makes sense because the second case uses the mean of two replicates, hence has smaller measurement error variance, and thus the naive estimate will be closer to our method.
In both cases, the asymptotic standard error from our new method is greater than the naive method, which led to wider confidence intervals. This makes sense, because with a scalar covariate measured with error, correcting for measurement error bias usually increases estimated standard errors, while of course reducing bias.

Linear regression
Next we consider the linear model with body mass index as the response. All assumptions for W , X and U are the same as in Section 5.1. Moreover, we maintain the standardization and sampling scheme in Section 5.1: the results are presented in Table 4.
From Table 4, we observe similar conclusions as in logistic regression case. One point of particular interest is that in both scenarios (external-internal or internal data only), our estimator converges theoretically to the same value, and this is seen in the results. The naive method that ignores measurement error estimates different parameters because the measurement error variance is twice as large in the external-internal case as it is in the internal-only case.

Other approaches
We emphasize once more that it is common practice in epidemiology to categorize a continuous predictor, and we have given numerous citations of this practice. Generally, this practice results in a misspecified model.
Our goal is to correct the analysis so as to reproduce, asymptotically, the estimators that would have been obtained if there were no measurement error.
The problem has not been considered previously in the context that a continuous predictor has been categorized. Such categorization generally leads to differential measurement error [11,13,3], and thus additional complications over simply fitting a measurement error model. While our paper is the first to consider the issue of how to correct an analysis to account for a continuous predictor that is categorized, there are of course other possible approaches, but none of them really avoids the basic issues we have discussed of what is needed to obtain consistent estimators with asymptotically correct inference in the case of measurement error.
• For example, one could assume that the true risk model is based upon the categorized truth, even if this is implausible in most contexts. One could further assume that the misclassification is nondifferential, which is incorrect if the true risk model is in the continuous scale [11,13,3]. There is a small literature on this problem. [13], especially Chapter 6.1, has remarks on the bias induced when a binary predictor is misclassified. [3], Chapter 6.7.7 and Chapter 6.14, has a detailed discussion of the issue, and provides a number of references to the problem. Both [13] and [3] show that a measurement error correction will require a distribution for the categorical X given (W, Z), sometimes called the reclassification rate, and both indicate that there are substantive issues, including identifiability, involved with estimating these models. For replication studies wherein W is measured repeatedly on a subset of the data, there is some evidence that 3 replicates will result in identifiability. However, both books emphasize the use of internal validation substudies, wherein one actually observes X in a substudy. If X cat is the categorized truth, then one might attempt an analysis based on assuming a joint distribution of (Y, W, X cat ) given Z, but as in any measurement error model [4], the joint distribution requires (a) a distribution for Y given (X cat , W, Z), and (b) the distribution of (W, X cat ) given Z. However, (a) actually depends on W , and thus that the modeling presents additional complications. In addition, (b) is no easier than ours, can be implausible and does not make fewer assumptions than we have done. • Simulation-extrapolation, or SIMEX, [6,22,4] is a well-known approach to the creation of approximately, but not fully, consistent estimators for additive measurement error models of the form W = X + Z T α + U , where U is independent of Z and can be homoscedastic or heteroscedastic but has replicates [8], and is generally taken to be normally distributed. This literature attempts to dispense with distributional assumptions for X for the continuous case, but is at best approximately correct. The fact that a categorized risk model is implausible, leading to differential measurement error, may also cause complications, but the use of SIMEX in this context is a worthwhile topic for further study. We also mention the MC-SIMEX procedure [16], which is appropriate for misclassified data where the misclassification probabilities can be estimated.
• It is also possible to change the paradigm entirely and avoid categorization, and all the issues related to categorization, by instead using Bsplines. Indeed, part of the reason sometimes given for categorizing a continuous predictor and not modeling a response linearly in the continuous X is that it could lead to unduly extreme comparisons for risk between the lowest and the highest values of X. The general thought is that this can be overcome by replacing the linear X by a Bspline in X. There are papers involving Bsplines and measurement error [2,12,18], and it appears that regression calibration can possible be used by calibrating each spline basis function. After the fitting, one could compare the Bspline fits at the 10 th , 30 th , 50 th , 70 th and 90 th percentiles of X to form versions of the tables found in epidemiology papers, but the interpretations are not fully comparable.
We showed how to solve this problem and given asymptotically consistent estimators with asymptotically correct standard errors. Assumption 2 is reasonable in other contexts than ours, for example, that X has a mixture-of-normals distribution and U is normally distributed [7].

Assumptions in the simulations and example
Readers of an initial version of this paper have noted that our simulations and data example use the assumption that the distribution of X given (W, Z) is normally distributed, but misinterpreted this fact into concluding that the approach is only applicable in that case. For the data example in Section 5, we justified the assumptions using known methods for model checking of measurement error models. Assumption 2 is widely used and reasonable in many other contexts than ours numerical work, for example, that X has a mixture-of-normals distribution and U is normally distributed [7]. Modeling via mixture distributions is a reasonable way to extend what we have done in the classical error case. See also [21] for the homoscedastic and heteroscedastic cases when the variance function and the distributions of X and U are modeled as mixture distributions.
Many papers in the literature also rely on the existence of validation data, where X is actually observed in a subset of the main data set. In that case, Assumption 2 is easily checked by model fitting and validation on the observed validation data subset.

Categorization
In Section 2.1, we stated that the number J of categories was set by the investigators, Usually, J = 3, 4 or 5, as seen by the examples cited in the introduction. In addition, setting the category limits is also an art, and may be based on (a) limits in the literature; (b) limits based on the error-prone instrument, such as the quintiles of a food frequency questionnaire or 24-hour recall; and (c) limits based on a measurement error analysis. Since our goal is to construct the analysis that would have been done if X could be observed, we use the latter.

A.1. Argument for Lemma 1
We consider the case that there are external data used to estimate Λ ext and that there are parameters Λ int . As in Section 3.2, the data for i = 1, ..., n are for the internal data, while, for i = n + 1, ..., n + N , are for the external data if such external data exist and are used. The functions Ψ i (Ω) are also defined in Section 3.2. By Taylor series, For logistic regression and linear regression, the forms of Ψ i (Ω) can be found in Appendix C. Thus, It is obvious that (n + N ) −1 N +n i=1 ∂Ψ i (Ω)/∂Ω = A + o p (1), and immediate that (n Normal(0, B), where A and B are defined in Lemma 1.

A.2. Argument for Lemma 2
We consider the case that there are only parameters Λ int . As in Section 3.2, the data for i = 1, ..., n are for the internal data. The functions Ψ i (Ω) are also defined in Section 3.2. Then Normal(0, B), where A and B are defined in Lemma 2. Table 1 Simulation study for logistic regression in Section 4.1 with sample size n = 500 and, where applicable, the external study has sample size N = 300 and 2 replicates, while β 0 = −0.42, β 1 = log (1.5). The target parameter, Θ = (θ 1 , ..., θ 5 ) T , where θ j is the parameter for the j th category. Displayed are results for the estimation of the log relative risk,  Here we only consider two cases among numerous possibilities. One is that the internal data consists of (Y i , W i , Z i ) for i = 1, ...n and σ 2 u is estimated from the external data using replicates W ik for k = 1, ..., K and i = n + 1, ..., n + N . The second case is that the replicates are in the internal data.

C.1.1. External-internal data
For specificity, we consider the first case that the external data have no responses Y , are independent of the internal data. Suppose that we use external data only to estimate σ 2 u , and we observe W ik = X i + U ik for k = 1, ..., K and i = n+1, ..., n+N . We use internal data to estimate μ x , σ 2 x without replicates. In the external data, let

C.1.2. Internal data only
Suppose there is no external data, and we have replicates W ir for r = 1, ..., R in the internal data. Now we use internal data to estimate Λ = (μ x , σ 2 x , σ 2 uR ), and we observe W ir = X i + U ir for r = 1, ..., R and i = 1, ..., n.
Define W i· = R −1 R r=1 W ir . Define σ 2 u,i to be the sample variance of the W ir within subject i, and define σ 2 u /R = σ 2 uR . The estimating equations are For μ x : Since the two cases we considered are the same as in linear regression and logistic regression, the way we estimate Λ int and Λ ext are exactly the same. Then we will only give details for the estimating equations about β and Θ below.
Let Z = (1, Z T ) T . Here we consider the simple case of linear regression with the classical measurement error model in both the external and internal data sets to be

C.2.3. The forms of Φ cat (·) and Q(·)
Since we assume the true model is Y = (X, Z T )β, it is easy to see that categorical estimating function We used the integrate function in the R package stats to compute the integrals. The estimating function for β = (β 0 , β 1 ) is The estimating function for Θ is Asymptotic standard errors were estimated as in Lemma 1 and Lemma 2.
As before, let H(·) denote the logistic distribution function and let Z = (1, Z T ) T . Here we consider the special case of linear logistic regression with the classical measurement error model in both the external and internal data sets to be pr(Y = 1|X, Z) = H(Xβ 1 + Z T β 2 ) = H{(X, Z T )β}; W = X + U ; X = Normal( Z T α, σ 2 x ); U = Normal(0, σ 2 u ).
Different from the linear case in Section C.2, we consider the case where X depends on another covariate Z. There are numerous data structures possible, but we here present the external-internal and internal data only cases.

C.3.2. Settings
There are two settings of interest.
• There is no information about σ 2 u in the internal data, so that the external parameter is the measurement error variance, Λ ext = σ 2 u , while the internal parameters are Λ int = (α T , σ 2 x ) T . • There are no external data, so that Λ ext is null, and the internal data with replicates allow estimation of Λ int = (α T , σ 2 u , σ 2 x ) T . In both case, σ 2 u (or σ 2 uR in the internal data only case) are estimated the same as in C.1.1 and C.1.2, while the estimating function for (α, σ 2 x ) is where i = 1, ..., n.

C.3.3. Estimating β
In this section, we implement our method and give all estimating equations in the case where we have both external and internal data. In another case, where we only use internal data with replicates, all results below are still valid by removing Λ ext . Define λ = σ 2 x /(σ 2 x + σ 2 u ). Then, given (W, Z), X is normally distributed with mean μ(W, Z, Λ ext , Λ int ) = Z T α + λ(W − Z T α) and variance λσ 2 u . We write this conditional density as f x|w,z (x, w, z, β, Λ int , Λ ext ).
There are multiple ways to estimate β from the observed data. Here we describe two of them.
• A second possibility, one that we used, is the following. By simple calculations, pr(Y = 1|W, Z) = p(W, Z, β, Λ int , Λ ext ), where p(W, Z, β, Λ int , Λ ext ) = H{(x, Z T )β}f x|w,z (x, W, Z, Λ int , Λ ext )dx, a quantity that is easily computed in R using the integrate function in the R package stats. Denote p i = pr(Y i = 1|W i , Z i ). Thus, the log- We then use optim function in the R package stats to minimize the negative loglikelihood to estimate β.

C.3.4. The forms of Φ cat (·) and Q(·)
Since we assume the true model is pr(Y = 1|X, Z) = H{(X, Z T )β}, it is easy to see that categorical estimating function We used the integrate function in the R package stats to compute the integrals.