Efficient estimation in expectile regression using envelope models

Abstract: As a generalization of the classical linear regression, expectile regression (ER) explores the relationship between the conditional expectile of a response variable and a set of predictor variables. ER with respect to different expectile levels can provide a comprehensive picture of the conditional distribution of the response variable given the predictors. We adopt an efficient estimation method called the envelope model ([7]) in ER, and construct a novel envelope expectile regression (EER) model. Estimation of the EER parameters can be performed using the generalized method of moments (GMM). We establish the consistency and derive the asymptotic distribution of the EER estimators. In addition, we show that the EER estimators are asymptotically more efficient than the ER estimators. Numerical experiments and real data examples are provided to demonstrate the efficiency gains attained by EER compared to ER, and the efficiency gains can further lead to improvements in prediction.


Introduction
The classical linear regression is a commonly used method when we want to study the relationship between a response variable Y and a p-dimensional predictor vector X. It depicts the linear dependence between the conditional mean of Y and X. However, it assumes that the conditional distribution of Y given X is homoscedastic, which is not always satisfied in real datasets. While the condition mean gives important information on the conditional distribution of Y and X, it does not provide a complete picture of the distribution, especially for skewed distributions arising from price or income data. As a result, quantile regression (QR), which can overcome the limitations, gained considerable interest in recent years. QR studies the conditional quantile of Y given X. It is a distribution free method and does not impose any assumption on the conditional distribution of the response Y . Investigation of multiple quantile levels gives us a comprehensive description of the distribution of Y conditional on X.
An alternative to QR is expectile regression (ER). The idea of expectile was firstly studied in [32]. The πth expectile of Y , denoted by f π (Y ), is defined as where φ π (z) = |π − I(z < 0)|z 2 is called the asymmetric least squares loss function. When π = 0.5, φ π (z) is the least squares loss and f π (Y ) = E [Y ]. ER studies the conditional expectile of the response Y given the predictors X. It has been explored in many statistics and econometric literature. [53] proposed a local linear polynomial estimator of the conditional expectiles with a onedimensional predictor and established the corresponding asymptotic properties. [25] developed the conditional autoregressive expectile models (CARE), which investigated the impact of previous asset returns on the conditional expectilebased Value at Risk (EVaR) of current asset returns, and allowed different effects from positive and negative returns. They established the asymptotic normality of the CARE estimators, and extended the results in [32] to stationary and weakly dependent data. [50] proposed a varying-coefficient expectile model to estimate the conditional EVaR of asset returns. This approach allows the coefficients to change with an effect modifying risk factor and provides a more flexible way to model the data. They showed that the varyingcoefficient expectile model yields more stable estimators and more accurate prediction intervals compared to the CARE models. In recent years, many advances have been taken place on model selection in expecile regression. [19] studied the sparse expectile regression under high dimensional settings where the penalty functions include the Lasso and nonconvex penalties. [43] introduced a variety of model selection methods in semiparametric expectile regression. [28] provided asymptotic distributions of penalized expectile regression with SCAD and adaptive LASSO penalties for both i.i.d. and non-i.i.d. random errors. In addition, semiparametric ( [37,41,40]) and nonparametric ( [52,51,16,22]) expectile estimation methods have been proposed in literature.
There are a lot of connections between QR and ER. Similar as QR, ER does not impose any distributional assumption on the response Y as well, and it can provide us with complete information on the conditional distribution of Y . Therefore, both QR and ER are able to overcome the previously mentioned limitation of the classical linear regression. Moreover, [53] pointed out there exists a one-to-one mapping between expectiles and quantiles. Hence ER can be interpreted as a flexible QR. In addition, expectiles can be used to estimate a quantile-based risk measure call expected shortfall [46]. However, at the same time, we should not ignore some different properties between QR and ER. Recall that the αth quantile of the response Y , denoted by q α (Y ), is defined as q α (Y ) = arg min q∈R E{τ α (Y − q)}, where τ α (z) = |α − I(z < 0)||z| is the check loss function. Therefore quantiles minimize the expected check loss τ α while expectiles minimize the expected asymmetric least squares loss φ π . The check loss function is not differentiable while the asymmetric least squares loss function is differentiable because of the quadratic term. The difference in the loss functions give QR and ER their respective advantages. The main advantage of QR is that its results are easier to be interpreted and it is more robust to outliers than ER; while the main advantage of ER over QR is that ER is more computationally friendly, especially in a semiparametric model, as pointed out by [41]. Moreover, ER is more sensitive to the extreme values in datasets because ER takes the distance between an observation and the estimated expectile into account while QR only considers whether an observation is greater or less than the estimated quantile. This characteristic makes ER more desirable in many applications. One example is on the risk measures in the fields of econometrics and finance. Value at Risk (VaR) is a popular measure for evaluating portfolio risk based on quantiles. It provides crucial information about the potential loss, however, it is insensitive to the severity of more extreme realization since it does not depend on the tail shape of the distribution ( [11,15]). [25] proposed the risk measure EVaR and indicated EVaR can reflect the magnitude of the extreme losses for the underlying risk. Therefore, given a set of risk factors, studying conditional EVaR rather than conditional VaR may lead to a more proper respond to a catastrophic loss. At last, for the special case with α = 0.5, QR degenerates to the conditional median regression. For the special case with π = 0.5, ER degenerates to the standard linear regression.
In this paper we propose a new ER method, the envelope expectile regression (EER), for efficient estimation of the parameters in ER. It is motivated by the fact that some linear combinations of the predictors may be irrelevant to the distribution of the response. Our method takes this data structure into account, which results in more efficient estimation compared with the standard ER. We call those irrelevant combinations the immaterial part and the remaining part the material part. To identify the material part and the immaterial part, we employ a nascent technique called the envelope method. The immaterial part is then excluded from the subsequent analysis, leading to efficiency gains in estimation. To be noted, the immaterial part is different from the subset of inactive predictors in the popular penalized variable selection methods. We can see that both the penalized variable selection methods and the envelope method reduce the number of free parameters in the model but they are based on different assumptions. The penalized variable selection methods assume a subset of individual predictor variables are irrelevant to regression and thus have coefficients zero; while the envelope method assumes some linear combinations of the predictors are irrelevant to regression but all predictors can have nonzero coefficients. There are different application scenarios for the penalized variable selection methods and the envelope method. For example, suppose that a biologist wants to analyze the relationship between genes and a certain disease, given a dataset with measurements on the severity of the disease (response) and the expression levels of different genes (predictors). If the biologist believes only a few genes in the dataset have effects on the disease, then he would apply the penalized variable selection methods to identify those genes. On the other hand, if the biologist feels like each gene in the dataset is related to the disease in some way, then he would use dimension reduction methods such as principal component regression, partial least squares or the envelope model to find the linear combination(s) of the genes that affects the disease. The envelope method was first proposed in ( [8]) under the context of multivariate linear regression for efficient estimation. It is then applied to many contexts in multivariate analysis including partial least squares ( [6]), generalized linear models ( [9]), elliptical linear regression ( [17]), reduced rank regression ( [7]), variable selection ( [45]), bayesian analysis ( [24]), matrix and tensor valued response ( [12,27]), and quantile regression ( [13]). In this paper, we apply the idea of envelope method to expectile regression. The parameters are estimated by the generalized method of moments (GMM; [21]). And we demonstrate that the EER estimators have smaller asymptotic variance than the ER estimators both theoretically and numerically. This paper is organized as follows. We derive the EER model in Section 2. Estimation of the EER model is discussed in Section 3. In Section 4, we investigate the theoretical properties of the EER estimators, and prove that the EER estimators are asymptotically more efficient than the ER estimators. We use simulations to demonstrate the performance of the EER model in estimation and prediction in Section 5. State murder rate data and S&P 500 index data are analyzed in Section 6 as examples. The extension to semiparametric settings is discussed in Section 7. Section 8 concludes the paper with a discussion. All technical proofs are deferred to the supplementary material [5].
The following notations will be used in our discussion. Let R p×u be the set of all p × u matrices. For any matrix A ∈ R p×u , span(A) represents the subspace spanned by the columns of A. Let P A be the projection matrix onto span(A) and Q A = I p − P A be the projection matrix onto its orthogonal complement span(A) ⊥ . For any subspace S in R p , P S represents the projection matrix onto S and Q S represents the projection matrix onto S ⊥ . Let "vec" represent the vectorization operator that vectorizes a matrix columnwise and "vech" represent the half-vectorization operator that vectorizes the lower triangle of a symmetric matrix. We use · to represent Frobenius norm and † to denote the Moore-Penrose inverse. We write d −→ for convergence in distribution and p −→ for convergence in probability. Moreover, a subspace R in R p is a reducing subspace of a matrix M ∈ R p×p if and only if M can be decomposed as M = P R MP R + Q R MQ R .

Expectile regression
Consider a univariate response Y and a p-dimensional predictor vector X. The standard ER considers the πth conditional expectile of Y as a linear function where f π (Y |X) represents the πth conditional expectile of Y given X, μ π is the intercept and β π ∈ R p contains the coefficients. When π = 0.5, f 0.5 (Y |X) degenerates to the conditional mean E(Y |X) and ER degenerates to the standard linear regression.

Envelope expectile regression
EER is derived from the motivation that certain linear combinations of the predictors are irrelevant to some conditional expectiles, e.g. E(Y |X), of the response. For example, a stock index may be related to only a few combinations of the economic factors, while these combinations are uncorrelated with the other combinations that are not responsible for the variation in the index. Following this observation, we suppose that there exists a subspace S π in the full predictor space R p such that the πth conditional expectile of the response Y is related to the predictor vector X only through P Sπ X. Specifically, we assume that The conditions in (2.3) imply that Q Sπ X does not provide any information to the πth conditional expectile of Y neither from itself nor from its association with P Sπ X. We call P Sπ X the material part of X and Q Sπ X the immaterial part of X.  [1,4,23]). For general π ∈ (0, 1), conditions in (2.3) establish a general asymmetric envelopebased PLS regression framework, which improves estimation efficiency and enhances prediction performance over ER both theoretically and numerically as evidenced in Sections 4-6.
Under the parameterization of ER (2.1), condition (2.3a) is equivalent to (i) β π ∈ S π , and condition (2.3b) holds if and only if (ii) S π is a reducing subspace of Σ X ([8]), where Σ X is the covariance matrix of X. If we find such a subspace S π , we can identify the immaterial part Q Sπ X when evaluating the relationship between f π (Y |X) and X. Then estimation efficiency gains can be achieved by accounting for the immaterial variation in subsequent analysis. There may exist more than one subspace satisfying (i) and (ii). For instance, the full space R p always satisfies (i) and (ii). To achieve the most efficiency gains, we focus on the smallest subspace (i.e., the subspace with smallest dimension) that satisfies (i) and (ii), such that we can identify all the immaterial information. We call this subspace the Σ X -envelope of β π , and denote it by E Σ X (β π ) or E π if it appears in subscripts. The dimension of the envelope subspace E Σ X (β π ) is denoted by u π (0 ≤ u π ≤ p). [8] discussed the existence and uniqueness of the envelope subspace E Σ X (β π ).
Before deriving the EER model, we first discuss the parameterization of the envelope subspace E Σ X (β π ). The envelope subspace can be determined by its basis. However, there can be many bases of E Σ X (β π ). To make the parameters identifiable, we define one representative basis Γ π for each envelope subspace: Take an arbitrary basis G π of the envelope subspace E Σ X (β π ). Since the dimension of the envelope subspace E Σ X (β π ) is u π , G π has rank u π and we can find u π rows in G π that constitute a nonsingular matrix. If there are multiple combinations of such u π rows, we take the rows with smallest indices. Without loss of generality, we assume that the first u π rows of G π constitute a nonsingular matrix, and we write G π as ( It can be shown that the representing basis defined as above is unique for E Σ X (β π ). Therefore we have established a one-to-one correspondence between A and the envelope subspace E Σ X (β π ). And if we know A, we can completely decide E Σ X (β π ). The following lemma shows that the basis of the orthogonal complement of E Σ X (β π ) can also be expressed as a function of A, which is useful for establishing the EER model. The proof is given in Section 1.1 of the supplementary material.
Using the representative bases Γ π and Γ 0π , we reparametrize β π and Σ X to derive the EER model. Since β π ∈ E Σ X (β π ), there exists a u π -dimensional coordinate vector η π such that β π = Γ π η π . Because E Σ X (β π ) is a reducing subspace of Σ X , Σ X can be decomposed as Σ X = P Eπ Σ X P Eπ + Q Eπ Σ X Q Eπ , where P Eπ Σ X P Eπ is the variance of the material part P Eπ X and Q Eπ Σ X Q Eπ is the variance of the immaterial part Q Eπ X. With Γ π and Γ 0π defined in Lemma 1, the coordinate form of the EER model is as follows In the EER model, we can see the predictor vector X affects the conditional expectile f π (Y |X) only through its linear combinations Γ T π X. The number of the linear combinations is u π because Γ π has u π columns. Therefore, u π represents the number of relevant linear combinations of the predictors, rather than the number of active predictors in penalized models. The u π -dimensional vector η π carries the coordinates of β π relative to Γ π . The positive definite matrix Ω π ∈ R uπ×uπ carries the coordinates of Σ X relative to Γ π and the positive definite matrix Ω 0π ∈ R (p−uπ)×(p−uπ) carries the coordinates of Σ X relative to Γ 0π . The parameter vector in the EER model is then Here, we include μ X in the parameter vector since it is involved in the estimating equations in Section 3. The total number of parameters is 1 + u π + p + p(p + 1)/2. The parameter count is as follows: μ π has 1 parameter, η π has u π parameters, A has u π (p − u π ) parameters, Ω π and Ω 0π are both symmetric matrices, and they have u π (u π +1)/2 and (p−u π )(p−u π +1)/2 parameters respectively, and μ X has p parameters. When u π = p, the EER model degenerates to the ER model (2.1). The parameters in ER are μ π and β π . In this paper, we also consider Σ X and μ X as parameters in ER to make it comparable with the EER model when asymptotic covariance of the estimators is concerned. The parameter vector in ER is θ = (μ π , β T π , vech(Σ X ) T , μ T X ) T , and the total number of parameters is 1 + 2p + p(p + 1)/2. Comparing the number of parameters, we can see that the EER model reduces the number of parameters by p − u π .

Estimation
In this section, we derive the EER estimators. Since there is no distributional assumption on the predictors or the response, the maximum likelihood estimation is not applicable for the EER model. Instead we adopt the generalized method of moments (GMM; [21]) to obtain the EER estimators. We first present the estimating equations under ER, and then reparameterize the equations for the estimation of the EER model. The estimating equations under ER are constructed from (2.2) and the moment conditions of X: Then we can build a map ψ between the EER parameter vector ζ and the ER parameter vector θ, i.e., ψ(ζ) = θ. Then we can reparameterize (3.1) to get the estimating equations for the EER model: T , the dimension of the envelope subspace u π is also an important parameter. Here we first discuss the estimation of ζ assuming u π is known. In the estimating equation (3.2), there are 1 + 2p + p(p + 1)/2 equations and 1 + u π + p + p(p + 1)/2 unknown parameters. As a result, it is possible that no solution exists. We then apply the GMM approach ( [21]) to obtain the EER estimatorζ. Let Z = (X T , Y ) T and W = (1, X T ) T . Define s(Z; θ) to be the population version of the moment conditions in (3.1): 3) The estimation procedure can be summarized in the following steps.
Step 2: Compute the scale matrix Step 3: Obtain the GMM estimatorζ by minimizing e * n (ζ) TΔ e * n (ζ). Once we obtainζ, the EER estimators of β π and Σ X areβ π =Γ πηπ and T 0π . And we useθ to denote the EER estimator of T . An analysis of the computational burden of the estimation procedure is given in Section 8 of the Supplement.

Remark 2.
In some envelope literature, e.g., [8], a different parameterization is adopted for Γ π : Γ π is required to be a semi-orthogonal matrix, i.e., Γ T π Γ π = I uπ . In this case, span(Γ π ) is a point on a p × u π Grassmann manifold, where a p × u π Grassmann manifold is the set of all u π -dimensional subspaces in a pdimensional space. The above procedure can still be used to estimate the parameters, except that Step 1 and Step 3 involve Grassmann manifold optimization, which could be complicated and slow in sizable problems. For more information about Grassmann manifold optimization algorithms, please refer to [14,29]. Now we discuss the selection of u π . Similar to other dimension reduction based methods (such as principal component analysis, partial least squares or reduced rank regression), the model selection for the EER model (2.4) is essentially the selection of the dimension u π . In existing envelope models and methods, the dimension u π is usually chosen by AIC, BIC or log-likelihood ratio testing. Since AIC, BIC as well as log-likelihood ratio testing all requires a likelihood function, they are not applicable in the context of the EER model. As a result, we adopt a nonparameteric method, robust cross validation (RCV; [33]), for the selection of u π . Following [18, page 244], we use the "one-standard error" rule with RCV to choose the most parsimonious model which has about the same predictive accuracy as the best model. RCV is performed in the following steps: Step 1: Randomly split the data into K folds. Usually K takes the value of 5 or 10. Successively use the kth fold for testing and the remaining folds for training, k = 1, . . . , K. Step 2: For each possible u π (0 ≤ u π ≤ p), compute the mean expectile loss and β π,−k(i) are the EER estimators using the data excluding the kth fold that contains the ith observation.
Step 3: Instead of choosing theũ π which achieves the smallest mean expectile loss RCV(ũ π ), we select the smallestû π whose mean expectile loss is less than one standard error above RCV(ũ π ).
We provide an implementation of the EER model in the R package expectEnv which is available at https://github.com/chentuo1993/expectEnv. Using the GMM approach, this package provides the EER estimator for parameters ζ and θ. It also implements RCV for the selection of the envelope dimension u π and computes bootstrap standard deviations for the EER estimators.

Theoretical results
In this section, we prove the EER estimatorθ is asymptotically more efficient than or as efficient as the ER estimatorθ. We first derive the asymptotic distribution ofθ. [32] proved that the ER estimatorβ π is consistent and asymptotically normal under some regularity conditions. We extend this result from β π toθ in Theorem 1. Next we establish the asymptotic distribution ofθ in Theorem 2, and show that the EER estimator is at least as efficient as the ER estimator by comparing the asymptotic covariance matrices. Because of the nonsmoothness in the estimating equations and the EER model does not impose any assumptions on the distribution of Y , the traditional likelihood approach in envelope literature cannot be applied to derive the asymptotic distribution ofθ. Furthermore, there is over-parameterization in the estimating equation in the sense that the number of equations are greater than the number of the parameters. The theoretical tools we use to overcome these issues are Theorem 2.1 in [31] and Proposition 4.1 in [39]. To simplify the notations, we use avar( √ nθ) to denote the asymptotic covariance matrix ofθ and avar( √ nθ) to denote the asymptotic covariance matrix ofθ hereafter.

Theorem 1. Assume the above regularity conditions (C1)-(C5) are satisfied, then we have
The proof of Theorem 1 is given in Section 1.2 of the supplementary material. Since both C and G are block diagonal,θ 1 = (μ π ,β T π ) T is asymptotically independent ofθ 2 = (vech(Σ X ) T ,μ T X ) T . Theorem 1 provides the asymptotic distribution for all the parameters, and the asymptotic distribution ofβ π agrees with the results in Theorem 3 of [32]. Now we established the asymptotic distribution of the EER estimator.
The proof of Theorem 2 is given in Section 1.3 of the supplementary material. Theorem 2 indicates that the EER estimator is √ n-consistent, and is asymptotically normal. Since we have the explicit form of the asymptotic covariance matrices of the ER estimator and the EER estimator, we can compare the efficiency of the two estimators.

Corollary 1. Under the same conditions in Theorem 2, avar
The proof of Corollary 1 is given in Section 1.4 of the supplementary material. Corollary 1 asserts that the EER estimator is asymptotically more efficient than or as efficient as the ER estimator.

Simulations
In this section, we demonstrate the estimation efficiency gains of the EER model via numerical experiments. In the simulations and examples in Section 6, we consider a set of expectile levels of the distribution, and estimation is performed for one level at a time instead of all levels simultaneously. Schnabel and Eilers [38] describes how to determine the complete distribution from a set of expectiles using penalized least squares.
We varied the sample size n from 50 to 800. For each sample size, 100 replications were generated. For each replication, we computed the EER estimator, the ER estimator, the boosting estimator (componentwise gradient boosting expectile regression assuming each predictor has a linear effect) as well as the sparse ER estimator ( [19]) of β π . The boosting estimator was computed by R package expectreg [42] with its default settings, i.e. the maximum number of boosting iterations is 4000, and cross validation is used to determine the optimal amount of boosting iterations between 1 and 4000. The sparse ER estimator was computed by R package SALES [20]. We use the default choice of the tuning parameter, i.e. the largest value whose cross validation error is within one standard error of the minimum cross validation error. An alternative choice of the tuning parameter is the value that gives the minimum cross validation error. Results based on this alternative choice is included in Section 10 of the supplementary materials. For each element in β π , we computed the sample standard deviation for the 100 EER estimators, 100 ER estimators, 100 boosting estimators and 100 sparse ER estimators. We took expectile levels 0.10, 0.25, 0.50, 0.75 and 0.90 as examples. The results of a randomly chosen element in β π with π = 0.50 and π = 0.90 are summarized in Figure 1. They reflect the mean and the upper tail properties of the response. The results of other expectile levels are given in Section 7 of the supplementary materials. Figure 1 shows substantial efficiency gains achieved by the EER model in the estimation of β π . With all error distributions and expectile levels π, the sample standard deviations of the EER estimators are much smaller than the sample standard deviations of the ER estimators, the boosting estimators and the sparse ER estimators for all sample sizes. Take the first plot as an example (normal errors and π = 0.5), with sample size 200, the standard deviation of the EER estimator is already smaller than the asymptotic standard deviation of the ER estimator. The asymptotic standard deviation of the ER estimator in this case is 0.27 and the asymptotic standard deviation of the EER estimator is 0.12. We notice that the boosting estimator and the sparse ER estimator have about the same sample standard deviations as the ER estimator. This is because the variation from the immaterial part affects these two methods in a similar way as to ER. The sample standard deviations of the ER estimators and EER estimators both approach their corresponding asymptotic standard deviations as n increases, which confirms the asymptotic distributions established in Theorem 1 and Theorem 2.
Moreover, we computed the bootstrap standard deviations of the four estimators for each component in β π using the paired bootstrap method with 100 bootstrap repetitions. For demonstration purpose, we use the normal error as an example and include the results for π = 0.5 and π = 0.9 in Figure 2. The results indicate that the bootstrap standard deviation is a good approximation to the actual sample standard deviation. Therefore we use the bootstrap standard deviation as an estimator of the actual standard deviation in data analysis in Section 6. Now we compared the prediction performance of the EER model with the ER model, the boosting model and the sparse ER model. Under the same simulation settings, we fixed the sample size at n = 800 and generated 300 replications. For each replication, 400 samples were used for training and another 400 samples were used for testing. With each of the four models, we first fitted the model to the training samples. Then for each (X i , Y i ) in the testing samples, we computed f π (Y i |X i ), the predicted πth conditional expectile of Y i given X i . The prediction  performance is measured by the root mean square error (RMSE) defined as Table 1 and Figures 3-6 summarize the RMSEs under the four models with different error distributions. The EER estimator shows a superior prediction performance in all the cases. The boosting estimator and the ER estimator has comparable performance. Among the four models, the prediction performance of the sparse ER model is the worst because it tends to fit an overly sparse model, which introduces bias to its estimator. Take π = 0.5 as an example, the EER reduces the average RMSE by 37.7% to 46.3% compared to ER estimator, 37.7% to 46.5% compared to the boosting estimator, and 67.4% to 76.4% compared to the sparse ER estimator.
As shown in Section 8 of the supplementary material, the computation complexity of the EER estimator is a polynomial function of the number of the predictors p. Here we outline a comparison on the run time of the ER estimator, the EER estimator, the sparse ER estimator and the boosting estimator under the same setting that produced Table 1 with ∼ N (0, 1) and π = 0.10. Table 2 displays the average run time for each estimator in the 300 replications. We notice that the ER estimator takes the least time to compute. Compare to the ER estimator, the EER estimator takes about three times longer to compute (assuming u π is known), the sparse ER estimator takes about 30 times longer to compute and the boosting estimator takes about 700 times longer to compute. The trend is similar for other error distributions and expectile levels.
In addition, we examined the performance of RCV in the selection of u π . In the same settings that generated Figure 1, we applied RCV to choose the envelope dimension u π . We performed 100 replications for each sample size. The   fraction that RCV selects the true dimension u π = 2 is summarized in Table 3. RCV shows a stable performance in the selection of u π . With a small sample size 25, RCV selects the true dimension more than 75% of the time. And its accuracy increases to 90% when sample size reaches 50. When RCV does not pick the  correct dimension, it always overestimates the dimension. A bigger u π yields a more conservative model, the resulting EER estimator loses some efficiency (compared to the EER model with the correct u π ), but it keeps all the material information and does not introduce bias. Therefore we use RCV to choose u π in data analysis examples. In this section, the data is generated from an EER model, and the EER estimator can achieve efficiency gains in estimation and better prediction performance over the boosting model and the sparse ER model. However, the results can be different if the data were generated from a different underlying structure. Since the boosting model and the sparse ER model are variable selection methods, if the sparsity structure rather than the envelope structure is present, or in other words, some predictors are inactive and have coefficients zero, these two models can be more efficient than the EER model by making use of the sparsity structure. A simulation under such setting is included in Section 2 of the supplementary materials. Therefore we cannot conclude if the EER estimator is more efficient or less efficient than the boosting estimator and the sparse ER estimator in general. It depends on the underlying relationship between the response and the predictors, if the envelope structure holds or the sparsity structure holds. If a particular predictor has coefficient zero, then the sparsity structure holds. If β π is contained in the subspace spanned by some eigenvectors of Σ X , then the envelope structure holds. A potentially interesting scenario is that the data have both the envelope structure and the sparsity structure at the same time. A simulation under such setting is included in Section 3 of the supplementary materials. For completion, a simulation with no immaterial part is included in Section 4 of the supplement. In such case, any non-degenerate EER model (u π < p) does not hold. However we can still expect to have a smaller mean squared error (MSE) from an approximate EER estimator in some cases due to the bias-variance tradeoff. Since quantiles and expectiles have a one-toone mapping [53], we also computed the envelope quantile regression estimator [13] and compared it with the EER estimator using the same simulation setting in this section and the S&P 500 data. Details are included in Section 9 of the supplement.

state.x77
The dataset "state.x77" (contained in datasets package in R) contains eight measurements including population, average income, illiteracy, life expectancy, murder rate, high-school graduates percentage, land area and frost level for the 50 states in the United States of America. The dataset has been used in [36] as an example of multiple linear regression. Following [36], we took the murder rate as response, and population, average income, illiteracy rate and frost levels as the predictors. The density plot of murder rate indicated it was a bimodal distribution. In addition, we fitted a standard linear regression model on the dataset and checked for the homoscedasticity by Breush-Pagan test [2]. A p-value of 0.03 showed evidence of heteroskedasticity. Therefore, ER is more appropriate to fit the dataset compared to the standard linear regression. We fitted the data Table 3 The fraction that RCV selects the true uπ with different error distributions.
From Table 4 we can see the estimated regression coefficients given by the EER model and the ER model differ from each other. Take the predictor income (the per capita income) as an example, the ER model suggests that income has a negative effect on murder rate at lower quantiles (π = 0.10 and π = 0.25) and has a positive effect to murder rate at high quantile levels (π = 0.50, π = 0.75 and π = 0.90) while the EER models indicates that income has a negative effect on murder rate for all quantile levels. The results from the EER model seems to be more meaningful. Table 5 shows the efficiency gains of the EER estimator over the other estimators at all investigated expectile levels. Taking π = 0.90 for example, RCV selected u π = 1. The ratios of the bootstrap standard deviations of the ER estimator versus the EER estimator range from 1.92 to 2.99 with an average of 2.31. To achieve the same efficiency gains under the ER, we need to increase the sample size to 2.31 2 ≈ 5.3 times the original sample size. Similar efficiency gains are also noticed when comparing with the sparse ER estimator and the boosting estimator. The efficiency gains also lead to different interpretations of the data. For instance, with π = 0.5, population and illiteracy rate are significant predictors with positive coefficients under ER. Income and frost level (number of days with minimum temperature below freezing) are not significant. Sparse ER model selects illiteracy rate as the only active predictor with a positive coefficient. The boosting method selects population and illiteracy rate as active predictors with positive coefficients, and income and frost level as inactive predictors. However, under the EER model, all predictors are significant. While population and illiteracy rate have positive coefficients, income and frost level have negative association with murder rate. This indicates that the EER estimator can detect weaker signal from the data from improved estimation efficiency. The predictive performance of the EER estimator is slightly worse than the ER estimator because the RCV tends to select a parsimonious model, which may have a larger predicted expectile loss than the ER model. The details are included in Section 6 of the supplementary materials.

S &P 500 index
Now we provide another example using the S&P 500 Index data to show that the efficiency gains from the EER model can lead to a better prediction performance. The data [26] contains 351 quarterly economic observations from January, 1927 to December, 2014. The response was the equity premium, which is the return on the S&P 500 Index minus the return on treasury bill. We used 11 quarterly predictors following [3,49,10,30,34,26]. The predictors included dividend yield (the difference between the log of dividends and the log of lagged prices), earnings-price ratio (the difference between the log of earnings and the log of prices), book-to-market ratio (the ratio of book value to market value for the Dow Jones Industrial Average), net equity expansion (ratio of 12-month moving sums of net issues by NYSE-listed stocks divided by the total market capitalization of NYSE stocks), stock variance (the sum of squared daily returns on the S&P 500), treasury bill rate (the 3-month rate), term spread (difference between the long-term yield on government bonds and the treasury bill rate), long-term rate of return for government bonds, default yield spread (difference between BAA-and AAA-rated corporate bond yields), default return spread (the difference between the return on long-term corporate bonds and the return on long-term government bonds) and inflation. Some of the predictors were stock characteristics and the others reflected operation conditions of the selected companies. All of them had significant impacts on S&P 500 Index. Since an investigation on the dataset showed strong evidence of heteroskedasticity, the standard linear regression is not adequate to explore the relationship between the response and the predictors. Moreover, there were two extreme values in the response. Hence instead of a QR, we conducted an ER on the dataset which is more sensitive to the extreme values and could make more efficient use of the information in the dataset. We fitted the data using the model: where Y t+1 was the equity premium at time t + 1 and X t was the predictor vector at time t. Both the ER and the EER model were applied to predict the conditional expectile of the response f π (Y t+1 |X t ) in a moving window with a size of 80 quarters, i.e., use observations , where t 0 = 81, ..., 351. We used 80 as the size of the moving window because [26] showed that a 20-years estimation window delivers better results than alternative estimation windows. The predicted expectile loss is an important measure of the prediction performance in ER. Once we haveμ π andβ π , the predicted expectile loss is computed as φ π (Y t0+1 −μ π −β T π X t0 ). We took expectile levels π = 0.10, 0.25, 0.50, 0.75 and 0.90 as examples. The results of the predicted expectile losses are summarized in Table 6. Table 6 shows that the EER model has smaller mean predicted expectile loss compared to the ER model with all the expectile levels. In addition, the EER model also has smaller standard deviations of the predicted expectile losses, which indicates that the EER model can give a more stable prediction. Figure 7 contains boxplots to graphically display the location and spread of the predicted expectile losses. Since there were some outliers, we trimmed the largest 5% of the predicted expectile losses under both the ER and the EER model for each expectile level to improve visibility. Both Table 6 and Figure 7 demonstrate substantial advantage of the EER model in prediction performance over the ER model.
We should note that the response in this example is actually weakly autocorrelated as revealed from its autocorrelation function (ACF) plot and the partial autocorrelation function (PACF) plot. Therefore an EER model that accommodates time dependent data is more suitable for the analysis of this dataset. [25] extended the asymptotic results of [32] to allow for stationary and weakly dependent data in the ER model. In some applications, the time effects can also be formulated by a mixed regression, and mixed expectile models are studied in [47]. The development of an EER model for time dependent data or for mixed expectile model is a potentially interesting future research direction and can have applications in financial, medical or meteorological datasets.

Extension to semiparametric settings
As an anonymous reviewer pointed out that one advantage to expectiles is the possibility to have very flexible semiparametric predictors. In this section, we consider a semiparametric ER model where the response is related to a combination of linear predictors X ∈ R p1 and nonlinear predictors Z ∈ R p2 f π (Y |X, Z) = μ π + β T π X + g(Z), where g is a smooth function, μ π is the intercept and β π contains coefficients for the linear predictors. We assume that E(g(Z)) = 0 for identification. where f π ( ) represents the πth expectile of the standard normal distribution. For the linear part, the slope coefficients are contained in β π = α 1 + α 2 f π ( ) and the intercept is μ π = 3 + 8f π ( ) + E(h(Z)). The parameters α 1 , α 2 and the predictor vector X were generated in the same way as in Section 5. So the linear part follows an envelope structure. The nonlinear part was given by g(Z) = h(Z) − E(h(Z)). We varied the sample size n from 50 to 800. For each sample size, 100 replications were generated. For each replication, we computed the semiparametric ER estimator of β π by the ER additive model ( [42]), and computed the semiparametric EER estimator of β π using the preceding algorithm. Then for each component of β π , we computed the sample standard deviations for the semiparametric ER estimator and the semiparametric EER estimator based on the results from the 100 replications. We took expectile levels π = 0.50 and π = 0.90 as examples. The results of a randomly chosen component in β π are summarized in Figure 8. The sample standard deviations of the semiparametric EER estimators are much smaller than the sample standard deviations of the semiparametric ER estimators. For example, when n = 200, the standard deviation of the semiparametric ER estimator is 0.61 while the standard deviation of the semiparametric EER estimator is 0.23 for π = 0.5. For π = 0.9, the standard deviations are 0.72 and 0.29 for the semiparametric ER estimator and the semiparametric EER estimator. This indicates that the envelope structure also yields substantial efficiency gains in estimation of β π under the semiparametric ER context.
The baseball salary data was studied in [48] to determine whether the salary of a baseball player is affected by his offensive performance. The data contains the salary information from the 1992 season for 337 Major League Baseball (MLB) non-pitchers who played at least one game during both the 1991 and 1992 seasons. It also provides 12 offensive statistics for each player from the 1991 season including batting average, on-base percentage, number of runs, hits, doubles, triples, home runs, batted in, walks, strike-outs, stolen bases and errors.
We took the salary as response and the 12 offensive statistics as predictors. By exploring the scatter plots of the response versus each predictors, we identified five predictors for which the association with the response is not sufficiently explained by a linear relationship. The five predictors were batting average, onbase percentage, number of triples, strike-outs and errors. Therefore, they were used as nonlinear predictors and the remains were used as linear predictors. Before the analysis, each predictor was scaled to have unit standard deviation. We fitted the semiparametric ER model and the semiparametric EER model with π = 0.5 and 0.9 to the data. RCV suggested u π = 1 for π = 0.5. For each element in β π , we calculated the bootstrap standard deviations for both the semiparametric ER estimator and the semiparametric EER estimator with 100 bootstrap samples. The ratios of bootstrap standard deviations of the semiparametric ER estimator versus the semiparametric EER estimator range from 3.99 to 13.65 with an average of 9.34. For π = 0.9, u π = 2 was selected by RCV for the EER model. The ratios of the bootstrap standard deviations range from 1.32 to 10.87 with an average of 5.44. The results indicate the semiparametric EER model achieves substantial efficiency gains compared to the semiparametric ER model. To get the same average estimation efficiency under the semiparametric ER model, we need to increase the sample size to 9.34 2 ≈ 87 times the original sample size with π = 0.5 and 5.44 2 ≈ 30 times the original sample size with π = 0.9.

Discussion and future work
In this paper, we develop the EER model as an efficient estimation method for the ER. We estimate the parameters using GMM and established the asymptotic distribution of the estimators. Efficiency gains are demonstrated both theoretically and numerically. A potentially interesting extension is to perform variable selection to the predictors and develop a sparse EER model. In many applications such as the S&P 500 Index data, some predictors do not affect certain conditional expectile(s) of the response and have coefficients zero. It is of practical interest to identify those predictors, especially in high dimensional settings. Another direction is to derive the EER model with censored response. Censored data are often encountered in medical studies and econometrics when the variable of interest is only observed under certain conditions, such as top coding in income surveys, e.g. "$500,000 and above" ( [35]). An EER model that can accommodate censored response will be applicable in such settings.