Model averaging for varying-coeﬃcient partially linear measurement error models

: In a 2003 paper, Hjort and Claeskens proposed a framework for studying the limiting distributions and asymptotic risk properties of model average estimators under parametric models. They also suggested a simple method for constructing conﬁdence intervals for the parameters of interest estimated by model averaging. The purpose of this paper is to broaden the scope of the aforementioned study to include a semi-parametric varying-coeﬃcient partially linear measurement error model. Within this context, we develop a model averaging scheme for the unknowns, derive the model average estimator’s asymptotic distribution, and develop a conﬁdence in- terval procedure of the unknowns with an actual coverage probability that tends toward the nominal level in large samples. We further show that conﬁdence intervals that are constructed based on the model average estimators are asymptotically the same as those obtained under the full model. A simulation study examines the ﬁnite sample performance of the model average estimators, and a real data analysis illustrates the application of the method in practice.


Introduction
Model selection has always been an integral part of statistical analysis. Wellknown criteria for model selection include the AIC [1], Mallows' C p [29], Cross Validation [32], BIC [31], Generalized Cross Validation [8], RIC [13], FIC [5], among others. The search for the "best" model recognizes the existence of more than one plausible model structure, implying a level of uncertainty associated with the choice of model. However, this uncertainty is usually ignored when it comes to making an inference contingent on the chosen best model, and this results in an overconfident inference about the unknowns [20,9,6,27]. It is also well-known that many model selection techniques can be highly influenced by slight variations in data.
One method for incorporating model uncertainty in statistical analysis is model averaging; instead of selecting a single model, model averaging compromises across the set of plausible models, weighted by some criteria that reflect the degree to which each model is trusted. Bayesian model averaging (BMA) has been promoted in a range of disciplines as a means of incorporating model uncertainty. Excellent surveys of the vast BMA literature can be found in [10,30,22], and [7]. A key component of BMA is the use of prior distributions of the unknowns and models. While this provides a formal framework for incorporating prior knowledge of the process being modeled, any poor handling of prior distributions can lead to undesirable behavior of the posterior distributions and model average estimator. Frequentist model averaging (FMA), on the other hand, precludes the need to specify any prior distribution, although how to determine an optimal weight choice by a data-driven approach is arguably the biggest challenge for the frequentist formulation.
Compared to the immense amount of BMA literature, the literature on FMA is more recent, nonetheless a great deal of work has been invested in developing model weighting schemes for FMA estimators and the investigation of their properties. The early work of [2] described an approach that uses the exponent of the negative of the AIC value as the weight for an individual model. [36] and [38] developed an adaptive regression by mixing (ARM) algorithm, while [25] proposed a weight choice criterion based on risk minimization. More recently, [14,15] and [33] developed an FMA based on the Mallows' criterion. Of particular relevance to the current study is the work of [20], who developed an asymptotic theory for frequentist model averaging in parametric models based on a local mis-specification framework, which shows that FMA generally results in an estimator with a non-normal asymptotic distribution. They also suggested a simple method for confidence interval construction of the unknown parameters. Hjort and Claeskens' (2003) analysis has been extended to several other models including the Cox's hazard model [21], general semi-parametric models [4], the generalized additive partial linear model advanced by [40], and the censored regression model [41]. A summary of these recent developments can be found in [34].
The current paper extends Hjort and Claeskens' (2003) investigation to the varying-coefficient partially linear measurement error (VCPLE) model. The varying-coefficient partially linear (VCPL) model [39,12] allows the different covariates in the model to interact in a flexible way and has been an important development in the semi-parametric literature in recent years. It also covers many other semi-parametric models including the varying-coefficient model [18] and the partially linear model [11] as special cases. The VCPLE model considered in this paper is a version of the VCPL model where the covariates in the parametric component of the model are measured with additive errors. The VCPLE model was previously considered by [37], who suggested an alternative estimation procedure that leads to consistent estimators of the parametric and non-parametric components of the model. In this paper we are concerned with model averaging within the VCPLE framework; in particular, we focus on the derivation of the model average estimator's asymptotic distribution, and develop a method for constructing confidence intervals of the unknowns along the lines of [20]. We demonstrate that these confidence intervals have a coverage probability that tends toward the nominal level in large samples. We also prove that the FMA-based confidence intervals are asymptotically the same as the confidence intervals based on the full model.
The remainder of the paper is organized as follows. Section 2 presents the model setup and discusses the estimation method of the unknowns in each candidate model. Section 3 describes the model averaging scheme and presents the main theoretical results. Section 4 reports the results of a simulation study that examines the finite sample performance of the model average estimator. Section 5 applies the proposed method to a real data set on dietary intake measurements. Section 6 presents the conclusion. The appendix contains the proofs of lemmas and theorems.

Model setup and estimation methods
Consider the i.i.d. samples (Y i , W i , Z i , T i ), i = 1, . . . , n, and the following VC-PLE model: where Y i is the response variable, (X i , Z i , T i ) are covariates, θ = (β ⊤ , γ ⊤ ) ⊤ with β and γ being p and q dimensional coefficient vectors respectively, α(·) = {α 1 (·), . . . , α r (·)} ⊤ is an r dimensional unknown coefficient function, and ε i is a random error with mean 0 and variance σ 2 , and independent of (X i , Z i , T i ). As in [12] and [37], we assume that the dimension of T i is one. Here, it is assumed that X i cannot be observed, but instead its surrogate W i is observed, with U i being a vector of random errors with mean 0 and covariance matrix Σ u ; further, U i is independent of (X i , Z i , T i ) and ε i . For analytical convenience we assume throughout our theoretical analysis that Σ u is known. This last assumption does not give rise to any loss of generality because all results continue to hold if Σ u is replaced by a consistent estimator when Σ u is unknown. Clearly, when U i ≡ 0, the VCPLE model reduces to the VCPL model which contains many common models as special cases. For example, when θ ≡ 0, it reduces to the varying-coefficient model; when r = 1 and Z i ≡ 1, it becomes the partially linear model.
When there are no measurement errors, the profile least-squares method described in [12] can be used to estimate θ. To estimate α j (t), write, for any given θ, Then model (1) becomes the general varying-coefficient model, and the following local linear approximation can be used to estimate α j (t): for any t in the neighborhood of t 0 . Denote a = (a 1 , . . . , a r ) ⊤ and b = (b 1 , . . . , b r ) ⊤ . Then a and b can be estimated by the local weighted least-squares method based on the criterion is a kernel function, and h is a bandwidth. Write the solution to this minimization problem as â 1 (t), . . . ,â r (t), hb 1 (t), . . . , hb r (t) where Substituting {α 1 (t), . . . ,α r (t)} ⊤ in model (2), we obtain whereŶ = SY,X = SX, and Denote Y = (I n −S)Y and X = (I n −S)X, where I n is an n×n identity matrix. Then model (3) reduces to Y = Xθ + ε, a standard linear regression model for which the ordinary least squares method can be used to estimate θ. Now, when X i 's are measured with errors, [37] suggested the modified profile least squares estimatorθ which is the solution to θ that minimizes where W = (I n − S)W.

Estimation and inference based on model averaging
Unlike [37], who concentrated on the estimation of the coefficients in the linear component of the model based on a single candidate model, we consider estimation based on model averaging. We follow the local mis-specification framework suggested by [20] by setting the true value of θ to where the parameter vector δ = (δ 1 , . . . , δ q ) ⊤ represents the degree of a model's departure from the narrow model in which θ=θ 0 =(β ⊤ , 0 ⊤ ) ⊤ . Local parameterization was first introduced by [24], and has been a useful tool for asymptotic analysis.
The results in this section depend on the the following technical conditions, which are also used in [12] and [37].

Estimation of coefficients under the full and partially restricted models
Partition B, Σ u , W, U = (I n −S)U and X conformably with the dimensions of β and γ as B = B 11 B 12 and X = ( X 1 | X 2 ) respectively. Then we can write Direct calculations lead tô and where Altogether there are 2 q partially restricted models, one for each subset S of {1, . . . , q}; that is, while the partially restricted model includes every element of β, it contains only certain elements of γ true . The full model corresponds to S = {1, . . . , q}, while the narrow model corresponds to S = φ. Denote the coefficients of the partially restricted model in S by β S and γ S . We then have β S = β and γ S = Π ⊤ S γ true , where Π ⊤ S is an |S| × q selection matrix with the element matching γ S in any given row taking on the value of unity, and zero otherwise, and |S| is the number of components of γ true in the partially restricted model. Similarly, we let X S , W S and Σ us denote matrices in the partially restricted model S with definitions analogous to the corresponding matrices in the full model. Further, partition these matrices conformably with β S and γ S , and obtain X S = ( These manipulations enable the derivation of the following regression coefficient estimators of the partially restricted model in S: and From equations (5)- (8), we obtain the following set of equations characterizing the relationship between estimators under the full and partially restricted models: β where n . The following lemma illustrates the asymptotic properties of estimators under the full and restricted models.
ally independent, then we have the following convergence result when n → ∞: , C S and G S are limits of A n C ns and G ns respectively, , and R ⊗2 = RR ⊤ for any matrix R.
From the proof of results in the appendix, a consistent estimator of F iŝ Hence a consistent estimator of the asymptotic vari- n . The bias vector can be estimated by replacing A, C S and δ by A n , C ns andδ respectively.

Estimation by model averaging
In this subsection, we consider the estimation of the parameter µ true = µ(β, γ true ) by model averaging. We assume that the parameter of interest µ does not depend on the non-parametric component because the estimator of this component is not √ n-consistent. Let the estimator based on the partially restricted model in S beμ S = µ(β S ,γ S ). The following theorem can be obtained. where , and H S has the same form as H ns except that A n in H ns is replaced by A in H S .
The asymptotic bias and variance ofμ are Note that Var(Λ S ) can be estimated consistently by usingθ full in µ β and µ γ , and replacing G S and P byĜ n andP n respectively.
With each partially restricted estimator being a submodel estimator, the model average estimator has the form where c(S|δ)'s are weight functions that sum to one. Theorem 2 depicts the asymptotic properties of the estimatorμ avg .
Theorem 2. Assume that µ is differentiable at θ 0 , and the weight functions c(S|d) are continuous almost everywhere. If conditions (C1)-(C6) hold, and U i , and D would be independent, and the variance would simplify to This theorem reveals that when there are no measurement errors, the model average estimator under the VCPL model framework has asymptotic mean and variance expressions similar to those of the model average estimators discussed in [20,21] and [4].

Interval estimation based on model averaging
Note from Theorem 2 that the asymptotic distribution of the model average estimator is non-normal. This concurs with the observation under parametric models in [20]. Here, we follow Hjort and Claeskens' (2003) approach of constructing confidence interval based on the model average estimator. We demonstrate that the actual coverage probability of the interval converges to the intended level in large samples; as well, we prove that such a confidence interval based on the model average estimator is asymptotically equivalent to that constructed based on the full model estimator that follows an asymptotically normal distribution. The latter result concurs with the findings of [23] under a parametric set-up.
Assume that the conditions for Theorem 2 hold. Consider the confidence limits where z is a standard normal quantile,ω andκ are consistent estimators of ω and n . Then Pr µ true ∈ (low avg , up avg ) =Pr{−z ≤ T n ≤ z} is the probability of the confidence interval containing the true parameter µ true , where As √ n(μ avg − µ true ) is an almost surely continuous function of M n andδ, from the Continuous Mapping Theorem and Slutsky Theorem, we have The limiting variable on the right-hand side of the above equation follows a standard normal distribution. Hence we have P r{−z ≤ T n ≤ z} → 2Φ(z) − 1, where Φ is the standard normal distribution function. If we denoteμ full as the estimator of µ under the full model, then by formula (10) in Theorem 1, we obtain the following result: N (0, κ 2 ). Accordingly, the confidence limits of µ true based onμ full are From the definition ofμ avg and equation (9), and by using the Taylor series expansion, we obtain By the Taylor series expansion, Therefore,μ Comparing equations (12), (13) and (15), we see that low avg = low full +o P (1/ √ n) and up avg = up full + o P (1/ √ n). Thus, the two confidence intervals, based on the model average estimator and the full model estimator respectively, are asymptotically identical.
More specifically, if µ is a linear combination of β and γ, then the remainder in (14) varnishes. Furthermore, as κ and ω are quantities relevant to the full model only, the estimatorsκ andω are the same for the full model as for the model average. This means if the parameter of interest is a linear combination of regression coefficients, the confidence interval developed based on the model average (i.e., equation (12)) will be exactly identical to that obtained from the full model (i.e., equation (13)). Thus, if the investigator's main concern is interval estimation rather than point estimation, then the confidence interval based on the full model already serves the purpose and model averaging provides no additional useful information. The interval constructed under the full model also has the advantage of being computationally simple.

Relationship between FMA and model selection estimators
This subsection studies the relationship between the traditional model selection estimators based on information criteria and FMA under the setup of the VC-PLE model. Along the lines of [26], we define the AIC, BIC, and RIC under the VCPLE framework as respectively, whereθ S represents the estimator of regression coefficient in the reduced model. Note that . This implies that given the estimatorγ under the full model, the relative magnitudes of an information criterion (say, the AIC) across different submodels are determined by the selection matrix which is a function of the set S. Therefore, asymptotically, the AIC model selection estimator can be viewed as a model average estimator in the form of equation (11) with indicator functions as its weights. For example, assuming that there are no ties among the AIC values, the AIC model selection estimator can be written aŝ The same result holds for the BIC and RIC model selection criteria. Evidently, the variance ofμ AIC differs from the variance ofμ S for each set S because the indicator function is also random. However, typically, the investigator uses the variance of the estimator from the chosen model (i.e., Var(μ S ) =Var(μ AIC |AIC S is the smallest)). We call this approach the naive approach to distinguish it from the post selection approach of [20], by which the variation of the indicator function is also taken into account 1 .

Finite sample analysis by simulations
In this section, we evaluate the finite sample performance of the FMA estimator through simulations. The implementation of our method requires the selection of bandwidth for the non-parametric component of the model. This is an important yet unsolved problem for semi-parametric modeling [12]. We will not elaborate upon this problem here, because we focus primarily on the estimation of parameters in the linear component of the model, which is insensitive to the choice of the bandwidth. In our simulations we use a cross-validation method to choose the bandwidth parameter.
We refer to the ratio of the MSE for a given method to the MSE of the full model estimator as relative MSE (RMSE). Thus, a RMSE smaller than unity indicates that the given method is superior to the full model estimator, and vice versa. Tables 1 and 2 report the results for ρ = 0.5 and ρ = 0 respectively 2 . To facilitate readability, the smallest RMSE in each panel is flagged by a " †".
Note that the full model estimator is always unbiased even if over-fits the true model, but its variance can be larger than those produced by estimators that are biased.
The following observations may be noted from the results. First, of all cases considered, model averaging invariably delivers superior RMSE than its model selection counterpart. Although there are exceptions, this superiority is generally more marked when σ u = 0.1 than when σ u = 0.5, and when ρ = 0 than when ρ = 0.5, ceteris paribus. Second, in the case of δ = δ (1) , no matter the values of ρ, σ u and n, the full model estimator always yields the worst estimates; this is hardly surprising as the full model is grossly over-fitted when δ = δ (1) . For the other two choices of δ, full model estimation, with few exceptions, remains inferior Table 2 RMSE of estimators when ρ = 0 to model averaging, but it can be a better strategy than model selection in a good number of cases. The improved performance of the full model estimator for these choices of δ is of no surprise -when δ = δ (2) or δ = δ (3) , the full model is either only mildly over-fitted or correctly specified. However, although the full model estimator is always asymptotically unbiased, in most cases the variance produced by the full model estimator remains larger than those produced by other strategies. Thus, the full model estimator frequently remains worse than the other estimators even when the full model is the true model or close to being the true model. That being said, the AIC and BIC model selection estimators both perform poorly when ρ = 0, δ = δ (3) , and σ u = 0.1, having MSEs that are larger than that of the full model estimator for all three estimands and both values of n. Interestingly, it is also under these choices of ρ, δ and σ u that model averaging is sometimes found to perform worse than full model estimation. This can be partially explained by noting that when δ = δ (3) , γ 1 = γ 2 = γ 3 = 1, and with the majority of submodels in the model average having at least one γ j = 0, the model average estimator will likely be substantially biased. When σ u = 0.5, model selection typically has an edge over full model estimation irrespective of δ. Third, in the large majority of cases considered, the S-BIC model average estimator yields the smallest MSE; in the remaining cases where S-BIC model averaging is not the best strategy, the most accurate estimates are invariably produced by S-AIC averaging. In other words, for all of the cases considered, the dominating estimator is always either the S-AIC or the S-BIC model average estimator. Fourth, of the two model selection estimators, the BIC estimator is generally preferred to the AIC estimator, and there are a good number of instances where the BIC model selection estimator has smaller MSE than the S-AIC model average estimator. Commonly, the RMSE comparisons of estimators for n = 100 and 200 are reasonably similar.

Analysis of real data
Here, we apply our method to a subset of data obtained from the Continuing Survey of Food Intakes by Individuals (CSFII) conducted by the U.S. Department of Agriculture in 1985 and 1986. 3 This data set contains dietary intake and related information of n = 1827 individuals between the age of 25 and 50. Using the available data, we specify the following model for calorie intake, denoted by y: where x 1 , x 2 , x 3 , x 4 and x 5 represent intake levels of fat, protein, carbohydrates, Vitamin A and Vitamin C respectively, x 6 is an indicator variable for alcohol consumption, x 7 is body mass index, z is income and t is age. As we think that fat, protein and carbohydrates are the key determinants of calories, and we are primarily interested in the effects that these variables have on calorie intake, we treat x 1 , x 2 and x 3 as mandatory in the parametric component of the model. Indeed, statistical results based on the full model reveal that only x 1 , x 2 , x 3 and x 6 are significant, and the coefficient estimates (in absolute values) of x 1 , x 2 and x 3 are at least three-fold those of the other variables. As we are less interested in the effects of x 4 , x 5 , x 6 , and x 7 on y, we treat this second group of variables as optional. One key role of the optional variables is to improve the estimation of the coefficients of the mandatory variables. This approach for distinguishing between mandatory and optional explanatory variables is adopted from [28] and [9].
We are interested in the estimation of the following four estimands: µ 1 = β 1 , µ 2 = β 2 , µ 3 = β 3 and µ 4 = β 1 /β 2 , based on five alternative estimation methods: FMA by S-AIC and S-BIC, model selection by AIC and BIC, and full model estimation. The estimands µ 1 , µ 2 and µ 3 are of obvious interest because they represent the marginal effects that each of the mandatory explanatory variables have on calorie intake. The estimand µ 4 is also of interest as it measures the effect of fat relative to that of protein. Tables 3 and 4 present the point and interval estimation results. We observe from the tables that the two model selection methods produce identical results -for this data set, both the AIC and BIC select the model that contains x 1 , x 2 , x 3 , x 4 and x 6 . Results produced by the two FMA estimators are also quite similar; both S-AIC and S-BIC model averaging yield estimates of µ 1 and µ 4 that are larger and estimates of µ 2 and µ 3 that are smaller than the corresponding estimates obtained from model selection and full model estimation. The relatively large µ 4 estimates produced by the two model average estimators indicate that among the estimation approaches considered, model averaging most accentuates the common belief that calorie intake is associated with fat consumption more than with protein consumption. As for interval estimation comparisons, note that for µ 1 , µ 2 and µ 3 , model averaging and the full model estimation produce the same interval estimates as these estimands are all linear in parameters. Table 4 shows that model selection generally results in wider confidence intervals than do full model estimation or model averaging.

Concluding remarks
In this section, we summarize our main findings and point to some directions for future research.
• In the context of the VCPLE model, we have considered frequentist model averaging in the manner of [20]. We have derived the asymptotic distri-bution of the FMA estimator of the unknown parameters of interest, and developed a confidence interval procedure based on the FMA estimator. Asymptotically, the resultant interval achieves the target nominal coverage probability, and is identical to the confidence interval obtained from the estimation of the full model. More remarkably, if the parameter of interest is a linear combination of regression coefficients, then the equivalence between the FMA and full model based confidence intervals also holds in finite samples. In view of the simulation findings suggesting that FMA generally has an advantage over full model estimation in point estimation, alternative methods of interval estimation based on the FMA approach resulting in more efficient estimates likely exist, and this is an area that undoubtedly deserves more study. • Throughout this paper, we assume that Σ u is known. To estimate Σ u when it is unknown, it is usually assumed that replicated observations of X i are available such that W ij = X i + U ij , j = 1, . . . , J i , i = 1, . . . , n are observed [3,26]. Then Σ u can be consistently and unbiasedly estimated bŷ , whereW i = Ji j=1 W ij /J i . The substitution of Σ u byΣ u does not complicate the theoretical analysis in any substantial way; all asymptotic results continue to hold when Σ u is replaced by a consistent estimator. Having said that, sinceŪ i = Ji j=1 U ij /J i has smaller variance than U ij , an arguably better way to proceed would be to modify model (1) as In this case, the distributions ofŪ i 's are different if J i 's are not all identical. Then the expression of F in Lemma 1 should be modified to the following, assuming that the limit exists: • The only type of measurement errors we have considered is one where the errors are present in the linear part of the model. Cases where covariates in the non-parametric part are measured with errors, or measurement errors arise in a more general framework, such as the generalized varyingcoefficient partially linear model g(EY i ) = X ⊤ i θ + Z ⊤ i α(T i ), are definitely worthy of study.
• While we considered only model averaging based on weights constructed from values of AIC and BIC, other weight choice techniques exist [14,15,16,33,17]. The consideration of these alternative weight choice mechanisms in the context of the VCPLE model also warrants future studies. • While we assumed i.i.d. observations, the extension to the non-i.i.d. situation will be a fruitful avenue for future research. [35] recently considered model averaging with non-i.i.d. observations in a linear measurement error model, which is a special case of the more general VCPLE model framework examined here. • It should be mentioned that although the FMA strategy being studied produces a √ n-consistent estimator of the parametric component of the model, this strategy when applied to the non-parametric component of the model does not yield an estimator that converges to the unknown function at the rate of 1/ √ n. It is for this reason that throughout the paper we focused only on the estimators in the parametric component. It remains for future research to develop an FMA strategy for the non-parametric component that possesses optimal properties.
From the Continuous Mapping Theorem and Slutsky Theorem, the above equation converges in distribution to the following variable: Proof of Theorem 2. From the definition of the FMA estimatorμ avg in equation (11), we have √ n(μ avg − µ true ) = S c(S|δ) √ n(μ S − µ true ).
From the proof of Theorem 1, √ n(μ S − µ true ) on the right-hand side of (18) can be represented by a linear function of M n andδ. As c(S|d) is almost surely continuous, √ n(μ − µ true ) is an almost surely continuous function of M n and δ. Thus, applying the Continuous Mapping Theorem, Slutsky Theorem, and Theorem 1, we obtain the required result.