Empirical likelihood for quantile regression models with response data missing at random

Abstract This paper studies quantile linear regression models with response data missing at random. A quantile empirical-likelihood-based method is proposed firstly to study a quantile linear regression model with response data missing at random. It follows that a class of quantile empirical log-likelihood ratios including quantile empirical likelihood ratio with complete-case data, weighted quantile empirical likelihood ratio and imputed quantile empirical likelihood ratio are defined for the regression parameters. Then, a bias-corrected quantile empirical log-likelihood ratio is constructed for the mean of the response variable for a given quantile level. It is proved that these quantile empirical log-likelihood ratios are asymptotically χ2 distribution. Furthermore, a class of estimators for the regression parameters and the mean of the response variable are constructed, and the asymptotic normality of the proposed estimators is established. Our results can be used directly to construct the confidence intervals (regions) of the regression parameters and the mean of the response variable. Finally, simulation studies are conducted to assess the finite sample performance and a real-world data set is analyzed to illustrate the applications of the proposed method.


Introduction
Since the seminal work of Koenker [1], quantile regression (QR) has been an indispensable and versatile tool for statistical research due to its promising performance and elegant mathematical properties, and attracted immediately considerable attention, resulting in numerous papers (e.g., see [1][2][3][4][5][6][7]) devoted to various theoretical extensions of this significant topic. Moreover, QR has been widely applied to a variety of fields such as economics, finance, biology, and medicine. Compared to the mean regression model which is commonly used by the traditional least square (LS) methods, QR is able to directly estimate the effects of the covariates at different quantiles of the response variable and therefore provide more information about the distribution of the response variable. Furthermore, QR is less sensitive to outliers due to its specific estimation.
Despite the significant theoretical advances and a rapidly growing literature on QR, only scant attention has been paid to QR when the data samples contain missing values which may lead to substantial distortions on the results. In fact, missing data are a commonplace in practice due to various reasons such as loss of information caused by uncontrollable factors, unwillingness of some sampled units to supply the desired information, failure on the part of investigators to gather correct information, and so forth. Dating back to the early 1970s, spurred by the advances in computer technology that made it possible to perform laborious numerical calculations, the literature on statistical analysis of real data with missing values has flourished in applied work, see [8][9][10][11][12]. Although missing data analysis has a long history in statistics, little work on QR has taken missing data into account. Recently, Yoon [13] proposed an imputation method where the imputed values are drawn from the conditional quantile function of the response with which data are incomplete, but his method is valid only under independent and identically distributed (i.i.d.) errors. Wei [7] developed an iterative imputation procedure for the covariates with missing values in a linear QR model that is valid under non-i.i.d. error terms. Lv [5] discussed smoothed empirical likelihood analysis with missing response in partially linear quantile regression. Sherwood [6] recently proposed an inverse probability weighting QR approach for analyzing health care cost data when the covariates are MAR. Sun [14] studied QR for competing risk data when the failure type was missing. Chen [4] discussed efficient QR analysis with missing observations. Shu [15] proposed some imputation methods for quantile estimation under missing at random.
On the other hand, empirical likelihood (EL) method, introduced by Owen [16,17], has many advantages over normal approximation methods for constructing confidence intervals. For example, the EL method produces confidence intervals or regions whose shape and orientation are determined entirely by the data, and the empirical likelihood regions are range preserving and transformation respecting. Many authors have used this method for linear, nonparametric and semiparametric regression models. About the quantile regression model, Chen [18] constructed the EL confidence intervals for population quantiles. Tang [19]developed an EL approach for estimating equations with missing data. Wang [20] considered EL for quantile regression models with longitudinal data. Whang [21] proposed a smoothed EL and discussed its higher-order properties with cross sectional data. Otsu [22] studied the first-order approximation of a smoothed conditional EL approach. The EL method has also been used for the analysis of censored survival data, for example, Zhao [23].
In this paper, a empirical-likelihood-based method is proposed to study quantile linear regression models with response data missing at random. A class of quantile empirical log-likelihood (QEL) ratios of the regression parameters are defined firstly which include QEL ratio with complete-case data, weighted QEL ratio and imputed QEL ratio. Then the statistical inference on the mean of the response for a given quantile level is further studied to obtain a bias-corrected QEL ratio of the mean of the response for a given quantile level. It is proved that the QEL ratios of both the regression parameters and the mean of the response for a given quantile level are asymptotically 2 distribution. To compare the quantile empirical likelihood method with a normal approximation method, we also construct a class of estimators for the regression parameters and the mean of the response for a given quantile level. It is shown that this class of estimators are asymptotically normal. Furthermore, we derive consistent estimators of asymptotic variance, their confidence intervals (regions) can be constructed directly of the regressions parameters and the mean of the response for a given quantile level. The rest of this paper is organized as follows. In Section 2, a class of QEL ratios and estimators for the regression parameters are constructed with missing response data and their asymptotic distributions are derived. In Section 3, a bias-corrected QEL ratio and the maximum empirical QEL estimator for mean of the response at a given quantile level are proposed and their asymptotic properties are studied. A simulation study is conducted in Section 4 to demonstrate the finite-sample performance of the proposed method. A real-world data set is analyzed to illustrated the applications of the proposed method in Section 5. The proof of the main results are postponed to Section 6.

Quantile empirical likelihood (QEL) method with missing response
Consider the quantile linear regression model where Y i is the i th observation of the response Y , X i is the i th observation of the covariates X and a d 1 vector, 2 .0; 1/ is the quantile level of interest,ˇ is a d 1 vector of unknown quantile regression parameters and " i is the error satisfying P ." i < 0jX i / D for i D 1; 2; ; n.
For the model (1), we focus on the situation where some observations of Y in a sample of size n may be missing while X is observed completely. As a consequence, we have an incomplete sample fX i ; Y i ; ı i g n iD1 with ı i D 0 if Y i is missing and ı i D 1, otherwise. Throughout this paper, we assume that the observations of Y are missing at random (MAR) which implies that ı and Y are conditionally independent given X . That is, P .ı D 1jX I Y / D P .ı D 1jX/ D p.X /. As pointed out in [24], MAR is a common assumption for statistical analysis with missing data and is reasonable in many practical situations. Hereafter, we will simply writeˇ asˇwhenever no confusion is made.

Quantile empirical likelihood with complete-case data
In the model (1), a vector Ǒ Q is called the complete data quantile regression estimator ofˇif where .u/ D u. I .u<0/ / is the quantile loss function and I. / is the indicator function. In addition,ˇsatisfies the following estimating equation is the quantile score function. The quantile empirical loglikelihood ratio function forˇwith complete-case data is defined as ; is called the maximum quantile empirical likelihood estimator ofˇwith complete-case data.

Weighted quantile empirical likelihood
Using the method in Section 2.1, a weighted quantile empirical log-likelihood ratio function forˇis defined as and p.x/ D P .ı D 1jX D x/. Here p.x/ is called a selection probability function. Note that the selection probability in (4) is regarded as known. If the selection probability is unknown, it can be estimated by a kernel smoothing method. An estimator of p.x/ can be defined by where K. / is a kernel function, and h n controls the amount of smoothing used in estimations. Here, fh n g 1 nD1 is a sequence of positive numbers tending to zero. Consequently, by replacing p.

Quantile empirical likelihood with imputed values
For the quantile empirical likelihood with complete-case data and the weighted quantile empirical likelihood, the information contained in the data is not fully explored . Since incomplete-case data are discarded in constructing the empirical likelihood ratio, the coverage accuracies of the confidence regions are reduced when there are plenty of missing values. To resolve the issue, we estimate In what follows, we introduce the auxiliary random variables Thus, a quantile empirical log-likelihood ratio function based on imputed values is defined as The ratio function is more appropriate than the quantile weighted empirical likelihood ratio function because it sufficiently uses the information contained in the data. In addition, Ǒ Q of Z iI .ˇ/ can be substituted by Ǒ QEL .

Asymptotic properties
In this section, some theoretical results on the asymptotic distribution of the quantile empirical likelihood ratios and their estimators proposed in Sections 2:1 2:3 are established. We first give the asymptotic distributions of Theorem 2.1. Suppose that Conditions C1 C 6 in the Appendix all hold. Ifˇis the true parameter, Then it follows from Theorem 2.1 that an approximate 1 ˛confidence region forˇcan be formulated by The following theorem demonstrates that both Ǒ Q and Ǒ QEL have the same asymptotic normality.
Theorem 2.2. Suppose that Conditions C1 C 6 in the Appendix hold. Then In order to construct the confidence region ofˇ, the asymptotic covariance matrix D can be estimated by which yields Therefore, the confidence regions ofˇcan be constructed by using (7).

Quantile empirical likelihood for the mean of the response
Some methods are provided firstly in this section to conduct a inference on the mean of the response Â by using empirical likelihood. Then a weighted quantile regression imputation is used to construct a weighted-corrected quantile empirical likelihood ratio of Â such that this ratio has an asymptotic 2 distribution.

Weighted-corrected quantile empirical likelihood (WCQEL)
Above all, we introduce the auxiliary random variable is the true parameter, a quantile empirical log-likelihood ratio function l .Â / can be defined in the following.
According to the analogy work of Owen [17], it is easy to see that l .Â / is asymptotic 2 distributed with one degree of freedom, i.e., l .Â / L ! 2 1 . However,ˇand p. / are unknown usually, and hence l .Â / cannot be used directly to make a statistical inference on Â . Accordingly, p. / is replaced by its estimator defined in (5),ˇis computed by the following procedure.
(a). Simulate j Uniform(0,1) independently for j D 1; 2; ; J ; (b). For each j D 1; 2; ; J , Ǒ . j / is calculated by defined in (2); As a result, an estimator of Y i denoted by O Y i , can be obtained by substitutingˇand p.X i / with Ǒ SQ and Then, a weighted-corrected quantile empirical log-likelihood ratio function for Â can be defined as ) : The following Theorem shows that O l.Â / and l .Â / have the same asymptotic distribution.

Normal approximation
where O Y i is defined in (8). Meanwhile, we call O Â QME D arg maxf O l.Â /g a maximum quantile empirical likelihood estimator of Â . The asymptotic normality of O Â QW I and O Â QME is given in the following theorem.
Theorem 3.2. Suppose that Conditions C1 C 6 in the Appendix hold. Then Furthmore, a consistent estimator of V can be formulated by

A simulation study
In this section a simulation study is carried out to investigate the finite-sample performance of the proposed approaches. We consider the following two models.
; n; where the observation X i .i D 1; 2; ; n/ of the covariates X were drawn the N.0; 1/, " i .i D 1; 2; ; n/ i id N.0; 1/, and andˇwere set to be 0:5 and 1, respectively. In simulation study, we focus on D 0:5; 0:8. We considered the following three selection probability functions proposed by Wang and Rao (see [8]).  Tables 1-4.  Tables 1-4 show the following results. Firstly, for Case 1, QIEL yields lightly longer interval lengths but higher coverage probabilities than the other three methods. For Cases 2 and 3, QIEL performs better than the other three methods in the sense that its confidence intervals have uniformly shorter average lengths and higher coverage probabilities, which indicates that quantile regression imputation is necessary when the missing rate is large. Secondly, both QCEL and QWEL result in slightly longer interval lengths but higher coverage probabilities than NA( Ǒ QEL ) and NA( Ǒ Q ). In addition, the confidence inervals obtained by NA( Ǒ QEL ) and NA( Ǒ Q ) show nearly equal lengths and coverage accuracies in the same case. Thirdly, as expected all the interval lengths decrease and the empirical coverage probabilities increase as n increases for every given missing rate. Observably, the missing rate also affects the interval length and coverage probability. Generally, the interval length increases and the coverage probability decreases as the missing rate increases for every fixed sample size. However, the two values fail to change by a large amount for the QIEL method because the quantile regression imputation is used in QIEL. Furthermore, it is also seen that than the other methods for the heteroscedastic model QIEL still performs much better.   Table 5.
It is seen from Table 5 that WQCEL produces slightly longer interval lengths, but higher coverage probabilities than NA does. All the coverage probabilities increases and the average lengths decrease as n increase. In addition, the coverage probabilities and average lengths depend on the selection probability function p.x/ and the quantile level .

A real-data example
The data originally presented by [25] is investigated in this section to support the proposition that food expenditure constitutes a declining share of personal income. This data that has not any missing data consists of 235 budget surveys of 19th century European working class households. More details of the discussion on this data can be found in [26]. We consider the following linear QR model: Y i Dˇ0. / Cˇ1. /X i ; i D 1; 2; ; 235; where Y is the centered annual household food expenditure and X is the centered annual household income in Belgian francs. In order to use the data set to illustrate our method, artificial missing data was created by deleting some of the response values at random. Assume that 25% of the response values in this data are missed. The missing indicator ı is generated from the probability function p.x/ D 0:9 0:2jx 1j if jx 1j Ä 4:5; and 0:1 otherwise. We now present the estimator and the 95% confidence interval ofˇbased on the proposed QILE method and the normal approximation method (NA) based on Theorem 2.2 with D 0:4 and 0:7. The results are shown in Table 6. From Table 6, we can see that the confidence interval obtained by the QIEL method has much shorter confidence interval than that obtained by the NA method, which shows that the former method is superior to the latter one.

Proofs of the main results
Let r 2 be an integer. g.
x/, f . jx/ and F . jx/ are used to denote the density function of X, the density and distribution functions of " conditional on X i D x, respectively. Let c be a positive constant which is independent of n and may take a different value in different place. The following conditions will be used in this section.
(C1) f.Y i ; X i / W i D 1; 2; ; ng are independent and identically distributed random vectors. (C2) Both p.x/ (the selection probability function) and g.x/ have bounded derivatives up to order r almost surely and inf x p.x/ > 0: (C3) K. / is a kernel function of order r and is bounded and compactly supported on OE 1; 1. Furthermore, there exist positive constants C 1 , C 2 and such that C 1 I OEjjujjÄ Ä K.u/ Ä C 2 I OEjjujjÄ : (C4) P .kX k > M n / D o.n 1=2 /, where 0 < M n ! 1 as n ! 1: (C5) The positive bandwidth parameter h satisfies nh 2r ! 0 when n ! 1: and where Z i .ˇ/ is taken to be Z iw .ˇ/, Z iw .ˇ/ or Z iI .ˇ/, A D Ef .X /f .0jX /XX T g and B D .1 /Ef .X /XX T g with .x/ D 1=p.x/ when Z i .ˇ/ is taken to be Z iw .ˇ/ and Z iw .ˇ/; and .x/ D 1 when Proof. (a) The case of Z i .ˇ/ D Z iw .ˇ/ will be proved firstly for i D 1; 2; ; n. Some simple calculation yields It is easy to obtain E.J / D 0 and C ov.J / D B. Then it follows from the central limit theorem that (11) is obtained immediately. In a similar way, we can prove (12).
Similarly to the proof of Theorem 3 in [28], it follows from Conditions C2, C3 and C5 that Since Then, (11) is obtained immediately. On the other hand, the proof of (12) is similar to the proof in case (a) and hence is omitted here. (c) When Z i .ˇ/ D Z iI .ˇ/ for i D 1; 2; ; n, direct calculation obtains Then it is easily shown that k 1 n P n iD1 .1 Therefore, we have with which we can prove (11)and (12) where Z i .ˇ/ is taken to be Z iw .ˇ/, Z iw .ˇ/ or Z iI .ˇ/, and B D .1 /Ef .X /XX T g with .x/ D 1=p.x/when Z i .ˇ/ is taken to be Z iw .ˇ/ and Z iw .ˇ/, and .x/ D 1 when Z i .ˇ/ D Z iI .ˇ/.
Proof. (a) When Z i .ˇ/ D Z iw .ˇ/ for i D 1; 2; ; n, some simple calculation yields By the law of large numbers, we can derive the result immediately.
; n, where ; n, by the (16) and the same methods as that of (a) and (b), therefore, we can obtain the result. Lemma 6.4. Suppose that Conditions C1 C 6 hold. If Â is the true parameter of model (1), then and Proof. We prove (19) only. (20) and (21) can be proved similarly. It is straightforward to obtain where By Lemma 6.1, it is easy to show that A 22 D o p .1/. Simple calculation yields where .ˇ/ is a d 1 vector given as the solution of the equation According to Lemma 6.2 and the arguments in the proof of (2.14) in Owen [16], we can show that Applying the Taylor expansion to (23) and invoking Lemma 6. Proof of Theorem 3.2. It follows from (9) and (10) that This together with (19) proves Theorem 3.2.