Model Selection in Semiparametric Expectile Regression

Ordinary least squares regression focuses on the expected response and strongly depends on the assumption of normally distributed errors for inferences. An approach to overcome these restrictions is expectile regression, where no distributional assumption is made but rather the whole distribution of the response is described in terms of covariates. This is similar to quantile regression, but expectiles provide a convenient generalization of the arithmetic mean while quantiles are a generalization of the median. To analyze more complex data structures where purely linear predictors are no longer sufficient, semiparametric regression methods have been introduced for both ordinary least squares and expectile regression. However, with increasing complexity of the data and the regression structure, the selection of the true covariates and their effects becomes even more important than in standard regression models. Therefore we introduce several approaches depending on selection criteria and shrinkage methods to perform model selection in semiparametric expectile regression. Moreover, we propose a joint approach for model selection based on several asymmetries simultaneously to deal with the special feature that expectile regression estimates the complete distribution of the response. Furthermore, to distinguish between linear and smooth predictors, we split nonlinear effects into the purely linear trend and the deviation from this trend. All selection methods are compared with the benchmark of functional gradient descent boosting in a simulation study and applied to determine the relevant covariates when studying childhood malnutrition in Peru.


Introduction
Expectiles as introduced by Newey and Powell (1987) can be introduced in two ways, either as the generalization of the ordinary mean or as an alternative to quantiles. For the ordinary mean, the aim is to find the value that minimizes the average squared distance between the observed data points and the value such that the mean μ represents the center of gravity. In the following, we assume observations y i , i = 1, . . . , n, drawn from a random variable Y with finite expectation μ and finite variance. Then the ordinary mean of the data can be estimated viaμ On the other hand, an α-quantile q α is determined as the value where at least α · 100% of the data are located below and at least (1 − α) · 100% of the data are located above. This value can be estimated aŝ with weights w α (y) depending on the data y and the asymmetry α via Expectiles are then introduced as a mixture of the mean and quantiles. On the one hand, they are determined by an asymmetrically weighted deviations criterion, where the L 1 -norm of quantiles is replaced by the L 2 -norm. On the other hand, they represent a weighted mean with weights depending on the asymmetry τ and the observed values of the data. Consequently, an estimate for the τ -expectile e τ is determined viâ Since the expectile e τ is the root of the first derivative of the loss function (1), the following characteristic equations can be derived: with sets of indices I 1 = {i|y i < e τ } and I 2 = {i|y i ≥ e τ }. As a consequence, expectiles e τ represent the weighted center of gravity (Yao and Tong, 1996).
Furthermore, Equation (3) indicates that the fraction between the distances below the expectile and the total sum of distances is given by the corresponding asymmetry parameter τ . This statement replaces the corresponding statement on the number of data points for quantiles. Both, expectiles and quantiles can be used to describe the complete distribution of a random variable when using a dense set of asymmetries / quantile levels τ . Even if they both differ in their definitions and therefore their properties, a bijective transformation between quantiles and expectiles exists (Yao and Tong, 1996;Schulze-Waltrup et al., 2015). Given the distribution function F of a continuous random variable Y , the bijective function h : (0, 1) → (0, 1) converting the α-quantile q α to the h(α)-expectile e h(α) is given by This function can easily be calculated for standard types of distributions with finite variance, see Schulze-Waltrup et al. (2015) for further details and examples.
Similar as quantiles have been subjected to a regression problem (Koenker and Bassett, 1978), expectiles can also be used to estimate regression models (Newey and Powell, 1987), where the model specification is then given by i= 1, . . . , n with y i being a continuous response and η i,τ specifying the corresponding regression predictor. For the error term, no specific type of distribution is assumed but rather we assume that the error terms ε τ,i are independent, have finite (but potentially different) variances and 0 = argmin e E(w τ (ε i,τ )(ε i,τ − e) 2 ) holds, i.e. the τ -expectiles of the error terms are all zero. The resulting regression problem is typically solved using an iterative scheme called least asymmetrically weighted squares (LAWS ), where estimation of the regression coefficients and the weights is done iteratively (compare Sobotka and Kneib, 2012, for further details). The main advantages of expectile regression are that no specific assumptions have to be made concerning properties of the error distribution such as homoscedasticity and that by estimating many expectiles the whole distribution of the response can be analyzed. In addition, expectile regression can easily be extended beyond the purely linear model specification when using semiparametric predictors (compare Section 2). While the specification of regression models based on quantiles or expectiles has the clear advantage of reducing the required assumptions, it also makes inference and model choice questions more challenging since the models are no longer based on a likelihood. As a consequence, both likelihood-based inference (e.g. likelihood ratio tests or Akaike's information criterion, AIC) are no longer immediately available. While either resampling approaches such as the bootstrap or asymptotic normality results can be used to conduct inferences in expectile regression models (see for example Sobotka et al., 2013), model choice questions have never been investigated in detail. This is even more relevant when considering semiparametric regression models where we do not only have to decide which covariates should be included but also whether effects should be purely linear or nonlinear. Furthermore, for expectile (and quantile) regression models, a common question is whether covariates should be selected for the different expectile levels separately or simultaneously for the complete response distribution.
We will mostly focus on two avenues for approaching model selection: (1) procedures based on information criteria such as stepwise selection based on AIC-type criteria that are very popular in ordinary least squares regression (Burnham and Anderson, 2002) and (2) shrinkage approaches such as the least absolute selection and shrinkage operator (LASSO) (Tibshirani, 1996) that augment a complexity penalty to the fit criterion. Since the L 1 penalty induced by the LASSO does not fit well with the L 2 geometry of expectiles, we will consider the non-negative garrote (Breiman, 1995) as an alternative.
An alternative to AIC-type criteria is provided by proper scoring rules derived from information-theoretic considerations Gneiting and Raftery (2007). Proper scoring rules can be applied for several regression types, including quantile regression and expectile regression (Gneiting, 2011). For quantile regression, further approaches on model selection have been introduced, starting with a goodness of fit criterion suggested in Koenker and Machado (1999). Alternatively, several authors including Li and Zhu (2008), Zou and Yuan (2008a), Zou and Yuan (2008b), Wu and Liu (2009), Koenker (2011) and Jiang, Bondell and Wang (2014, introduced LASSO-type or SCAD-type penalties to select variables in quantile regression (for a detailed review see Wu and Ma, 2015). All of them, except Koenker (2011), restrict their approaches to linear predictors while Koenker (2011) penalizes the variation of nonlinear effects but does not include the possibility to de/select them. Moreover, most of the former papers select the variables separately for each quantile level while Zou and Yuan (2008b) and Jiang, Bondell and Wang (2014) introduce methods to select the covariates jointly for a given set of asymmetries (albeit with the restriction that coefficients should vary smoothly over the quantile levels).
Among others (see Gijbels, Verhasselt and Vrinssen, 2015, for a review of recent approaches) Huang, Horowitz and Wei (2010), Greven and Kneib (2010), Marra and Wood (2011) and Chouldechova and Hastie (2015) introduce approaches to perform model selection in regression specifications with semiparametric predictors but a pre-specified type of response distribution. Greven and Kneib (2010) suggest criterion-based selection, while the others use shrinkage approaches.
Further research has also been done to implement semiparametric predictors in quantile regression (see Koenker, Ng and Portnoy, 1994, He and Ng, 1999, Doksum and Koo, 2000. In this context some ideas for deciding whether nonlinear effects are indeed necessary have been introduced. Most of them deal with varying coefficient models and try to decide if the varying part is required (see for example Wang, Zhu and Zhou, 2009, Tang et al., 2012, Noh et al., 2012, Tang, Wang and Zhu, 2013. In order to select the models, they make use of asymptotic distributional results or shrinkage methods. Moreover, they select the covariates for each quantile level separately. Some approaches to overcome this assumption in varying coefficient models are introduced in Kai, Li and Zou (2011) and Guo, Yang and Lv (2015). Here the models of several quantile levels are selected jointly via a penalization approach. In addition, Guo et al. (2013), Lin et al. (2013) and Lv, Yang and Guo (2015) suggest methods to use truly additive models and apply model selection on the linear and the nonlinear predictors via penalization approaches.
A flexible alternative to criteria-or penalization-based approaches is functional gradient descent boosting (Bühlmann and Hothorn, 2007), where coefficient estimation and model selection is done in one estimation run. Boosting also has the advantage that selection of nonlinear and linear effects is readily available and can be combined with a variety of model specifications. Boosting has been introduced to semiparametric quantile regression by Fenske, Kneib and Hothorn (2011).
All told, we are not just adopting the above procedures to expectiles. Instead we combine several approaches, which have been used only separately in previous papers and add new ideas. First, we select variables based on AIC-type criteria and shrinkage methods. Thereby we introduce methods to select models for the whole distribution by combining multiple asymmetries. Whether or not performing model selection for the complete distribution or for single asymmetries very much depends on the specific context of the application. If we focus on specific parts of the distribution such as the tail, it will in general be preferable to select only asymmetries associated with this part of the distribution and to perform separate model selection. If, on the other hand, the complete distribution of the response should be studied, we expect benefits from performing simultaneous selection. In particular, this will have the advantage to provide a consistent model selection for the complete distribution, allows us to borrow strength from neighboring expectiles and facilitates interpretation of the results.
Second, to overcome the assumption, that a covariate influences the response linearly the specification of semiparametric regression is advantageous. Therefore all our suggested methods can also select nonlinear predictors. However, model selection with semiparametric predictors is more complex. In this paper, we make use of the fact, that P-splines can be separated in a linear trend and the wiggly/nonlinear deviation of this trend (Fahrmeir, Kneib and Lang, 2004) and select them separately. This decomposition is orthogonal in the parameters such that selection between the linear and the nonlinear part is not deterred by concurvity of the linear and the nonlinear effect.
The remainder of the article is organized as follows: In Section 2, we briefly review some basic concepts of semiparametric regression. Section 3 discusses model selection methods that allow to separate between linear and nonlinear predictor structures. Furthermore, the expectile-specific model selection approaches are introduced. The following Section 4 contains a simulation study on the empirical performance of the different approaches while Section 5 presents the application of our methods on the estimation of determinants for chronic undernutrition of children in Peru. The concluding remarks are summarized in Section 6.

Semiparametric regression models
In the remainder of this paper, we will consider semiparametric regression specifications where the response variable y i is related to a combination of linear effects x il β l as well as smooth, nonlinear effects f s (x is ) of continuous covariates such that More generically, bivariate splines for interaction surfaces, Markov random fields or Kriging terms for spatial effects and a variety of other effect types can be included in a similar way (see Sobotka and Kneib, 2012). In the following we will focus on the special case of univariate nonlinear effects for the sake of illustration.
To further simplify the presentation, assume a model with only one covariate with nonlinear effect such that we obtain the specification with smooth function f . This function is then approximated by a weighted sum of B-spline basis functions B (d) r of a fixed degree d such that where γ r are the corresponding coefficients and R is the dimensionality of the basis, i.e. the number of basis elements. If the matrix Z contains the values of the basis functions evaluated at the observed covariate values and γ is the vector of coefficients, the model can then be rewritten in matrix-vector notation as To reduce the dependency on the dimension and the position of the basis functions, Eilers and Marx (1996) introduced P-splines where a large number of basis functions is combined with a difference penalty on the basis coefficients such that the estimated coefficients γ do not vary too much between neighboring basis functions. This leads to the penalized least squares criterion where K is a penalty matrix representing squared differences of a predetermined order p and λ ≥ 0 denotes the smoothing parameter that governs the trade-off between smoothness of the function estimate (λ → ∞) and fit to the data (λ → 0). To optimize λ, a first option is to use selection criteria like the AIC, or generalized cross-validation. Alternatively, the Schall algorithm (Schall, 1991) can be applied since P-splines can be cast in a mixed model framework (see for example Fahrmeir, Kneib and Lang, 2004). The Schall algorithm as well as optimization of selection criteria can also be adapted to the combination of P-splines with expectile regression, see Sobotka and Kneib (2012) and Schnabel and Eilers (2009).
When including nonlinear effects in a regression model, a typical model choice question relates to the decision whether a covariate should be included linearly or flexibly i.e. based on a P-spline. To facilitate this decision, we decompose the coefficients of the P-spline in its penalized and unpenalized part (following Lin and Zhang, 1999;Currie and Durban, 2002;Fahrmeir, Kneib and Lang, 2004) with matrices V and W spanning the null space and the orthogonal deviation from the null space of the penalty matrix K. Plugging this reparameterization into Equation (4) yields With this representation, the first part contains the unpenalized coefficients which represent a linear function for second order differences (p = 2) and more generally a polynomial of order p − 1. The second part contains the penalized coefficients, which can be interpreted as the smooth deviation from the linear effect. Both parts can then be treated as separate model components in the model selection task (see Section 3.1 for details). Note that the orthogonal decomposition provided by the matrices V and W improves the model choice performance since concurvity between the linear effect and the nonlinear deviation is avoided by construction.

Model selection for P-splines
For continuous covariates, we consider four selection approaches: • Linear vs. no effect: The covariate is either included as a linear function or not included at all. In this case, no nonlinear effects can be chosen. • Nonlinear vs. no effect: The covariate is included as a P-spline (without decomposing it into penalized and unpenalized part) or not included at all. Note that the smoothing parameter still allows to reduce the effect to a polynomial of degree p − 1 (and therefore a linear effect for the standard case of p = 2) if λ → ∞.
• Nonlinear vs. linear vs. no effect: In this case, all three possibilities are compared, i.e. the covariate is included as a P-spline (without decomposing it into penalized and unpenalized part), as a linear effect or it is dropped from the model. We will consider this possibility in combination with best subset and stepwise forward methods (see Section 3) and only allow for one of the competing alternatives, i.e. once the covariate is included as either linear or nonlinear effect, the other inclusion variant is not possible any more. Accordingly we call this method restricted. • Decomposition into linear and nonlinear effect: Here the decomposition of the P-spline in a linear part and the deviation from the linear effect is utilized as described in Equation (5). Hence the covariate can be included as a linear effect, the deviation from the linear function, or the combination of both effects which then yields the original P-spline again. Based on the orthogonality of the effects this corresponds to an independent treatment of penalized and unpenalized part such that both are included or excluded separately from the model. This method is called complete in the following and can be used in combination with all methods defined in the following including the non-negative garrote.
Other smooth function types such as Markov random fields can also be selected with the approaches defined in Section 3.2.1 to Section 3.3.3. However no decomposition is used for them such that the model choice is simplified considerably. Besides the selection of semiparametric predictors, the main novelty of model selection for expectile regression is that selection can be done for each asymmetry separately or jointly for the whole distribution. In the next sections we introduce model selection methods for each asymmetry separately and discuss joint selection afterwards.

Best subset and stepwise selection
Classical model selection methods like stepwise or best subset selection rely on the definition of an appropriate selection criterion. We therefore introduce adaptations of selection criteria well known from mean regression.
The first criterion is obtained by defining a cross-validated measure corresponding to the least asymmetrically weighted squares (LAWS ) criterion where y i is the observed response andŷ i,τ is the predicted expectile for this observation. For the cross-validated LAWS criterion, we determine expectile regression estimates on a subset of the data and evaluate its predictive ability based on the predictive LAWS criterion. In the following, this predictive index is called mean weighted squared error (MWSE, in generalization of the standard mean squared error, MSE).
As an alternative, we consider a generalization of Akaike's information criterion (AIC) (Akaike, 1974) defined as where the weighted sum of squared errors replaces the negative likelihood while the degrees of freedom df = trace(H) are determined based on the hat matrix where X is the complete design matrix of all effects, K is the (block diagonal) penalty matrix consisting of all individual penalty matrices and their associated smoothing parameters, and W τ is the diagonal matrix of the weights per observation for the current asymmetry τ . Similarly, an expectile version of the Bayesian information criterion BIC (Schwarz et al., 1978) can be defined by adjusting the weight on the degrees of freedom. Note that in expectile regression we did not make an explicit assumption on the distribution of the error term and therefore neither AIC nor BIC are strictly justified based on standard arguments as in Burnham and Anderson (2002) but they are rather defined as ad hoc analoga to these criteria.

Non-negative garrote
An alternative to best subset or stepwise selection procedures are shrinkage methods that avoid the necessity to estimate multiple models for a comparison. They rather combine estimation and model selection in one step. The most prominent example is the LASSO by Tibshirani (1996) which adds an L 1 penalty term to the fit criterion. Due to the specific geometry of the L 1 penalty of the LASSO, covariates are excluded from the model if their influence on the response is too small. Since combining the LASSO with penalized predictors is not directly possible due to the ties in the coefficients, several attempts to build shrinkage methods for semiparametric predictors have been developed (see Marra and Wood, 2011, for an overview). In this article, we focus on the non-negative garrote introduced by Breiman (1995) that uses a two-step approach for shrinking. In a first step, the saturated model with all effects included is estimated. In a second step, the prediction accuracy is optimized by multiplying every estimated predictor component with an extra weight δ and checking which predicted value has a relevant influence on the response. We introduce the non-negative garrote for semiparametric expectile regression following the notation of Marra and Wood (2011). To get the special case of ordinary least squares, one has to set τ = 0.5, such that τ can be ignored in the formulae.
In the following, we consider the non-negative garrote for models with semiparametric predictors η τ = k f k,τ (x k ) where, for the sake of simplicity, also linear effects are represented as arbitrary functions f k,τ (x k ) = x k β k,τ . In principle the aim of non-negative garrote is to optimize the coefficients not only due to the model fit, but also by the prediction accuracy. Therefore the MWSE is transformed to the following optimization criterion where δ k,τ ≥ 0 is an extra weight to optimize the predictive model fit. These weights δ k,τ are limited in size by the tuning parameter ξ τ = K k=1 δ k,τ . To explain the procedure in more detail, we first show how to estimate the weights δ τ = (δ 1,τ , . . . , δ K,τ ) T for a given tuning parameter ξ τ . Next, we explain how to estimate the optimal tuning parameter ξ τ . Finally we provide an algorithm to combine these steps and estimate the final weights δ τ for a specific asymmetry τ .
For the estimation of δ τ , we assume a specific tuning parameter ξ τ and estimate the saturated expectile regression with all effects included in the first step. Hence, the predictorsf k,τ and the weights w τ (y i ) are then available from this full model. Since we only consider cases with K < n, no specific form of the estimatesf k,τ needs to be assumed. As the intercept is not subject to selection, the response y = (y 1 , . . . , y n ) T is transformed toy = y − β 0,τ for simplification. With this, Equation (7) can be rewritten in matrix notation aŝ . . . ;f K,τ is the matrix of predicted values and W = diag(w τ (y i )) is the diagonal matrix of expectile weights. Moreover it is possible to apply the methods for solving quadratic programming problems on Equation (8). This yields an estimate forδ τ in the second step and the optimized model comprises the effectsf If the saturated model is indeed the true model, then ξ τ = K and δ k,τ = 1 for all k = 1, . . . , K, thus the non-negative garrote changes nothing. However, if there is a covariate x k , which is irrelevant for the prediction, then the prediction will be better if its weight δ k,τ is close to or even equal to zero such that the effect of x k is excluded from the model. Furthermore, the weights δ k,τ can be larger than 1. This is, however unlikely, as the regression also minimizes the divergence between the observed values and the predictions. Moreover the value of ξ τ is essential for the estimation, but it has to be specified in advance. In order to find the model with the best prediction accuracy, the tuning parameter ξ τ is determined via cross-validation out of a grid Ξ of possible tuning parameters ξ τ . This leads to the following algorithm: 1. Build the grid Ξ of possible tuning parameters ξ τ . 2. Split the data into G pairs of training and validation data sets.
3. Iterate over all pairs g ∈ G and a. Estimate the full expectile regression model for the training data set to obtainf k,τ and w τ (y i ). b. Iterate over all ξ τ ∈ Ξ and i. Estimate the parameterδ τ based on the training data set to obtain the effect estimatesf k,τ . ii. Compute the updated coefficients (f new iii. Use these new coefficients to predict the expectiles for the corresponding validation data set. iv. Estimate the MWSE for this validation data set g and tuning parameter ξ τ to obtain MWSE g,ξτ . 4. Build the cross-validation score for each ξ τ separately (score( 5. Find the minimal score and use the corresponding ξ τ as the optimal one (ξ opt τ ). With the algorithm described above, we can now define the complete algorithm for using the non-negative garrote: I. Find the optimal tuning parameter ξ opt τ based on cross-validation. II. Use this tuning parameter ξ opt τ to estimate the final weightsδ τ of the complete data set. III. Compute the new coefficientsf new k,τ =f k,τδk,τ and treat the result as the final model.
We prefer using the non-negative garrote compared to the LASSO due to several reasons. First, it can easily be adopted for semiparametric models (see Marra and Wood, 2011). Second, the non-negative garrote does not limit the value of the coefficients, as LASSO does, but it rather measures the influence of the predicted values. This is appropriate for expectile regression, as we allow the coefficients to vary freely in between the different asymmetries τ . Last but not least, this method can be expanded to a joint estimation of the weights for all asymmetry parameters simultaneously (see Section 3.3.3).

Selection methods for the complete distribution
The methods considered in the previous sections select the optimal model for each asymmetry parameter τ separately. This is advantageous in order to analyze which covariate has an influence on specific parts of the distribution of the response. Furthermore, the resulting models incorporate only relevant information for this specific asymmetry parameter. However, it might be easier to compare the effects of the covariates between the asymmetries if the same covariates are included for all asymmetries. Moreover, the probability of crossing expectiles is reduced with this joint approach. Our simulation studies show that with an approach for joint selection a superfluous covariate is excluded more often than otherwise. Still, a covariate with a specific influence on one tail can build up enough leverage to remain in a joint model.

Mean AIC
The first method we propose to select the optimal model for all asymmetry parameters jointly is the mean AIC or area under the criteria curve. This method is motivated by the fact that the whole distribution of the response can be estimated with expectile regression. Then the selection criterion for a given model, treated as a function of the asymmetry parameter τ , can be compared with the corresponding function of a competing model specification. To reduce the information to one single value (and therefore to facilitate the decision which of the models is "better"), the area under the criteria curve is determined and the model with the smaller area is preferred (for a negative orientation of the criterion as for example in case of the AIC ). This principle is illustrated in Figure 1 where the model including the P-spline of x 2 performs better on the left side of the distribution while the simpler model would be preferred on the right side. When comparing the area under criterion curve, the more complex model is found to perform better. Of course the integrated criterion can easily be weighted such that specific parts of the distribution are more relevant than other when doing the comparison.
In practice, the whole distribution will be approximated by a grid of asymmetry parameters τ j with j = 1, . . . , J. For these parameters, the expectile regression will be estimated separately and the corresponding selection criterion AIC(τ j ) can be computed. Furthermore the area under the criteria curve is approximated with the simple Riemannian sum as If the asymmetry parameters τ j are chosen on a homogeneous grid, the area criterion is proportional to the arithmetic mean of the individual contributions AIC(τ j ) Again, this index can also be replaced by a weighted mean with weights favoring outer or inner part of the distribution. Eventually this index can be used for stepwise model selection, for example, and the resulting model will contain the same covariates for all asymmetry parameters. Differences between models as area between the criteria curves. This graphic is based on the exponential simulation design of Section 4, but with more extreme parameters and fewer observations.

Proper scoring rules
The purpose of model selection is often to identify models with good predictive performance. In the case of optimizing the model separately for each asymmetry, we therefore considered cross-validation via the MWSE criterion. For joint selection, we adopt the notion of scoring rules S (Gneiting and Raftery, 2007;Gneiting, 2011) that evaluate the fit between predictive distributions and actually observed realizations. More precisely, we assume that a forecast is probabilistic and we get a predictive distribution P as forecast for a given realization y. Then S : (P, y) → s ∈ R is a scoring function and we write S(P, Q) for the expected value of S(P, ·) under Q. If Q is the best possible distributional forecast we say that S is a strictly proper scoring rule, if For expectiles we assume to have asymmetries τ 1 , . . . , τ J ∈ (0, 1) with corresponding forecasts e τ1 , . . . , e τ J for a given observation y and a probability measure P . The expected score is now given as S(e τ1 , . . . , e τ J ; P ) = S(e τ1 , . . . , e τ J ; y)dP (y) similar to the quantile representation in Gneiting and Raftery (2007). Furthermore we know from Gneiting (2011) that a strictly proper scoring rule for expectiles is given by with ψ being a convex function and ψ being its subgradient. The most prominent example is constructed with ψ(x) = x 2 . Then the proper scoring rule for expectiles is similar to our asymmetrically weighted squared error: To approximate (9), we utilize a set of asymmetry parameters to get the total distribution from the corresponding expectiles, i.e.
S(e τ1 , . . . , e τ J ; y) = J j=1 w τj (y)(y − e τj ) 2 (10) similar as for quantiles (see Gneiting and Raftery, 2007, for further details on the quantile setting). To apply this on the model selection framework, we build the score for a complete data set as where y i is the true data point that materialized and e i,τ =ŷ i,τ the predicted value for this point, G is the number of cross-validation folds and n g is the size of the validation data sets in the individual folds. The cross-validated score can then again be used for stepwise model selection or for comparing a pre-selected set of models. In applications, scoring can also be considered as a mixture of cross-validation and mean AIC, i.e. it is the prediction error integrated via asymmetries (as shown for quantiles in Gneiting and Ranjan, 2011). As a consequence, we cannot only look at the complete score but can also decompose it along the asymmetries similar as we did it for the AIC in Figure 1. Again, a weight vector emphasizing specific parts of the distribution can easily be included.

Non-negative garrote on a grid
To generalize the non-negative garrote from single asymmetries to the complete distribution, a grid of asymmetry parameters τ j is used and expectile regressions for all τ j are estimated, such thatf k,τj and w τj are given. With these estimates, we solve such that the weights δ = (δ 1 , . . . , δ K ) T ) are the same for all asymmetries while the same constraints as in Section 3.2.2 apply (δ k ≥ 0 and k δ k = ξ). To solve this minimization problem with quadratic programming routines, it is necessary to rewrite (11) in matrix notation. With y i,τj = y i −β 0,τjyτ j = (y 1,τj , . . . ,y n,τj ) T W τj = diag(w τj (y 1 ), . . . , w τj (y n ))F τj = (f 1,τj , . . . ,f K,τj ) the values of the separate asymmetries are combined in a row-wise fashion such thaty is an ((n · J) × 1) vector,F is an ((n · J) × K) matrix and W is an ((n · J) × (n · J)) diagonal matrix. With these definitions Equation (11) can be transformed as and (12) can be solved with standard tools for solving quadratic problems. As in the separate case, the optimal ξ is computed via cross-validation.

Boosting
As a competitor for the approaches introduced in this paper, we consider functional gradient descent boosting (Bühlmann and Hothorn, 2007;Hofner et al., 2014) that provides a generic way of minimizing the empirical loss with a pre-specified loss function ρ and a regression specification f (x i ). Boosting for each asymmetry parameter separately has been introduced to expectile regression by Sobotka and Kneib (2012) utilizing the weighted differences as loss function (for further details see Sobotka and Kneib, 2012). Similarly boosting was introduced to quantile regression by Fenske, Kneib and Hothorn (2011). They defined the loss function ρ as Boosting also allows to use semiparametric predictors and is therefore used as a benchmark in our simulation studies.

Simulation study
We conduct a simulation study that evaluates both the ability to identify relevant covariates and the ability to discriminate between linear and nonlinear effects for continuous covariates to determine the behavior of the different selection methods.

Design
For all scenarios considered in the following, we rely on the additive predictor where all covariates x 1 , x 2 , x 3 , x 4 are randomly drawn from a uniform distribution U (1, 2) and x 4 is always a noise covariate, i.e. f 4,τ (x 4 ) ≡ 0. We then designed three different settings (see also Figure 2): 1. Parallel design: The effects are equal for all asymmetry parameters τ (see Figure 2a): 2. Linear design: The effect size increases linearly with increasing asymmetry parameters τ , (see Figure 2b): 1.5,0.5 where qτ ,mean,sd is theτ -quantile of N (mean, sd) andτ i = h(τ i ) where h is the bijective function converting the τ -quantile q τ to the h(τ )-expectile e h(τ ) . 3. Exponential design: With increasing asymmetry parameters τ , the effect size increases. The effect is linear for x 1 and x 3 , while it is exponential for x 2 (see Figure 2c): 1.5,0.5 Note that there is a fundamental difference in the way the data are generated in the parallel design as compared to the linear and the exponential design. For 3024 E. Spiegel et al. the construction of data sets with a pre-specified structure for the expectiles, we rely on a generalization of importance sampling where the quantile function is replaced by the expectile function. More precisely, we draw quantile levels randomly from the uniform distribution, i.e. τ i ∼ U (0, 1) for observation i. Then the quantile level is transformed to the expectile asymmetry using the transfer function h such thatτ i = h(τ i ). The asymmetry is then plugged into the quantile function of the N (mean, sd) distribution, yielding qτ ,mean,sd . The mean of the normal then controls the overall size of the effect, while the standard deviation controls the overall variation of the data. The main advantage of this complex design is that we get varying but predictable effects for each asymmetry parameter. With sd = 0 this approach reduces to the parallel design where an additional error term is still necessary. All simulations are based on n = 2000 observations (results with n = 500 observations are available in the supplementary material at https://www.unigoettingen.de/en//511092.html, but the basic conclusions did not change) and the asymmetry parameters {0.02, 0.05, 0.1, 0.2, 0.5, 0.8, 0.9, 0.95, 0.98}. For each scenario, we consider 100 replications. Cross-validation in all selection methods is based on 10 folds of equal size. For determining the area under the criterion curve and the average scores, we use a grid of asymmetry parameters consisting of 49 homogeneous points between 0 and 1. For the non-negative garrote, we used 49 points for the grid for ξ. Boosting is done with step size ν = 0.1 and 10-fold in-bag cross-validation to find the optimal stopping iteration of the algorithm m stop which has a maximum value of 4000. For the parallel design, we compare our methods with quantile boosting. Therefore we adjust the asymmetry parameters with the transformation function from expectiles to quantiles based on the N (0, 4) distribution. The resulting quantile levels for quantile boosting are {0. 070, 0.127, 0.194, 0.291, 0.500, 0.709, 0.806, 0.873, 0.930}. All covariates were included as cubic P-splines with 20 inner knots and a second order difference penalty. The smoothing parameter was estimated via the Schall algorithm (see Sobotka and Kneib (2012) and Schnabel and Eilers (2009) for further details). For a distinction between linear predictors and smooth func-tions, we consider the different possibilities discussed in Section 3.1. Since the results for "restricted" and "complete" discrimination between linear and semiparametric predictors are similar, we will only discuss the "complete" separation in detail. The other results are presented in the supplementary material. The "complete" separation of the linear effect and the nonlinear deviation of the linear trend has the advantage that both can be selected independently by the algorithms. Thus it may appear that either both are included into the best model, or only one of them is included, or none of them is included into the best model. Therefore the sum of the selection frequencies of the linear and the nonlinear part of a covariate effect varies between 0 and 2.
All estimations and selection methods were implemented in R (R Core Team, 2017) using the R package expectreg (Sobotka et al., 2016). The applied version is available in the supplementary material.

Results
In the parallel design, the basic sensitivity of the selection methods is analyzed. In Figure 3, the frequencies of selection for the different coefficients are plotted depending on the asymmetry parameter.
The different characteristics of the selection approaches are discussed in the following, starting with the selection approaches for each asymmetry parameter separately.
• Separate selection: center vs. tails of the distribution: Overall, AIC-based selection strongly depends on the current asymmetry parameter τ , while the nonnegative garrote and boosting are less dependent on τ at least for the strong effects and CV is nearly independent of τ .
informative vs. noise covariate: The non-informative linear covariate x 4 is excluded for CV, non-negative garrote and boosting constantly in 80%-90% of the cases, while AIC excluded it in only 60%-70% of the replications. All approaches depend on the intensity of the informative covariates, i.e. the strong covariate x 1 is included almost always, while for the less influential covariate x 3 the general pattern of the dependence on the asymmetry parameter is most visible, even though the size of the parameter should be constant over all asymmetry parameters.
linear vs. nonlinear covariate: The data are simulated without any nonlinear part, so this should be excluded for all intensities of the covariates and all asymmetries. However the approaches behave differently concerning these nonlinear parts. While CV excludes the nonlinear part in most cases, boosting includes the nonlinear part relatively often for all asymmetry parameters. To the contrary, the non-negative garrote excludes the penalized part more likely in the center of the distribution than in the tails. This is similar for AIC selection, but there the frequency of excluding the noisy nonlinear part is rather low. So we can conclude that the known problem (see for example Saefken et al., 2014) of nonlinear selection with AIC is even more problematic for expectiles beyond the mean.
• Joint selection: The joint approaches stabilize the selection results but the overall sensitivities of the separate selections methods stay the same. This means that joint CV, i.e. scoring includes few noise covariates and is most restrictive concerning nonlinear parts, while the area under the AIC curve includes the linear covariates on the level of scoring, it does nearly never exclude a nonlinear covariate. The non-negative garrote on the grid is also more stable, i.e. the exclusion rate of noisy linear covariates is also on the level of scoring, while the rate for the wiggly deviation is a little bit higher than for scoring, but much lower than for the AIC. • quantile boosting: In this simulation study, the results of expectile boosting and quantile boosting with transformed asymmetry parameters are approximately the same.
The other design discussed here is the exponential design. Since the selection frequencies of the linear design behave similarly to the third design, their frequency plots can be found in the supplementary material (see Figure A.2). Other than in the parallel design, we cannot calculate the corresponding quantile levels for the heteroscedastic designs. Thus the results cannot be directly compared to quantile boosting.
As x 1 is a strong effect in the parallel case as well as in the one sided case, there are only minor changes in the behavior of the selection methods. The same is valid for the noise effect. Since the selection methods behave similarly as in the parallel setting, those frequencies are not shown in Figure 4, where the selection frequencies of the linear and nonlinear part of x 2 and x 3 are illustrated.
• linear effect: The strength of the small linear effect x 3 β 3,τ is varying with the asymmetry parameter. Thus the selection frequency of the unpenalized part should increase with increasing asymmetry parameter (for the separate selection per asymmetry parameter). This behavior can indeed be detected in the plots. However, for the non-negative garrote, boosting and AIC, we observe a decrease in the selection frequencies for very large asymmetry parameters. This is caused by the general restriction of these methods in the tail of the distribution, similarly as detected in the parallel design. • nonlinear effects: The nonlinear part of x 2 is also of major interest in this scenario. Using non-negative garrote and CV, the selection frequency for very small asymmetry parameters is not as high as expected. This is related to the fact that the selection frequency of the linear part is higher than expected. Thus the nonlinear trend is approximated with a linear function in 10-20% of the cases. With increasing asymmetry parameter, the selection frequency of the nonlinear part increases up to the 50% expectile. Beyond that, it is decreasing as expected for the non-negative garrote and CV. For AIC, the unpenalized part is selected into the best model more frequently on the upper part of the distribution, but this could be expected from the parallel design. The unusual design of the selection frequency of the linear part of x 2 can be explained by the design of the data sets. For the upper part of the distribution, the linear part of x 2 and x 3 should be similar, which is confirmed by the simulations. For the bottom part, a higher selection frequency, as expected, is visible due to the usage of a linear trend in the estimation of the nonlinear part.
We conclude that for the selection of the linear part CV, boosting and nonnegative garrote behave well, while for the nonlinear part only CV is approximately independent of the current asymmetry parameter. However the nonnegative garrote is also reliable in the center of the distribution. The joint selection approach over all asymmetry parameters reduces linear parts of noise covariates for all selection types. Furthermore, it is advantageous for the selection of nonlinear parts with CV and non-negative garrote.

Data structure
In the following, we apply the selection methods to a data set on childhood malnutrition in Peru. As an indicator for the nutritional status, we rely on the stunting score where the standardized height of a child is compared to the median size of children in a healthy comparison group given the child's age and gender and σ is a robust estimate for the standard deviation of the children's size in the comparison group. Stunting reflects chronic undernutrition which results in a lack of growth and a score of less than -2 indicates severe chronic undernutrition (WHO Expert Committee on Physical Status, 1995). As can be seen from Table 1, severe undernutrition occurs regularly but the mean nutritional status is above the cut-off value of -2. When analyzing childhood malnutrition, it is particularly interesting to obtain covariates associated with the lower part of the stunting distribution since these are especially important determinants for chronic malnutrition. At the same time, discriminating between informative and uninformative covariates considerably assists in identifying sparse and interpretable models. Our analyzes rely on the Demographic and Health Surveys (DHS ) data set of Peru from 2012 (Instituto Nacional de Estadistica e Informatica (INEI) Lima Peru and ICF International Calverton Maryland USA [Producer], 2012). Using only complete cases, we obtain a data set of n = 8391 children from the original 9620 observations.
From the original data base, we determined 21 potential covariates including (i) characteristics of the child such as age, gender, region of living (25 districts), duration of breastfeeding and birth order, (ii) characteristics of the mother such as age at birth, education, height and body mass index, (iii) household information such as the household size, the education of the partner, if there had been dead children in the family and (iv) variables related to the wealth of the family, i.e. several indicators for the presence/absence of specific assets.
Besides the estimation of effects beyond the mean, we would like to use flexible objects like P-splines or Gaussian Markov random fields to include nonlinear effects of continuous covariates and spatial effects. Those could also be estimated in quantile regression, but there it would be computationally demanding. Therefore we apply expectile regression where we can directly make use of the tools of standard least squares regression, thus we are computationally faster. For comparison, we also estimated a model based on quantile boosting but the results cannot be directly compared since the asymmetry levels of quantiles and expectiles are different. In the simulation study we knew the underlying distribution and could therefore correct the asymmetry parameters accordingly but this is not possible in the empirical example. Nevertheless, we applied quantile boosting on the Peru data with asymmetry levels obtained when assuming Gaussian distributed data. The results are in general similar to the ones of expectile boosting and are therefore only included in the supplementary material.
As in the simulations, estimation is performed based on the LAWS criterion in combination with the Schall algorithm (see Sobotka and Kneib (2012) for further details). Spatial information was included as a Gauss Markov random field and the continuous covariates were included as cubic P-splines with 20 inner knots and a second order difference penalty. In the selection, the "complete" discrimination between linear trend and nonlinear deviation of this trend was used.

Selected models
We only discuss the results of stepwise forward 10-fold cross-validation (CV) and 10-fold scoring with a grid of 49 asymmetry parameters. The other results (stepwise forward selection with AIC, area under the AIC curve, non-negative garrote, non-negative garrote on the grid and boosting) are summarized as tables in the supplementary material. Table 2 summarizes the selected covariates for CV and scoring. A black box indicates that the covariate is selected in the best model, while a blank signals that the covariate was not included. Obviously, several covariates are never or almost never included in the best model for the CV approach (e.g. bicycle, caesarian, electricity). These covariates are also not included in the best scoring model. Thus we can conclude that they do not have a relevant influence on the nutritional status. On the other hand, there are several covariates which are (almost) always included in the best CV model and also included in the scoring model (e.g. mother's education, refrigerator, television, region). Those covariates do have a relevant influence on the complete distribution of the nutritional status. Furthermore, there are several covariates that are selected in the scoring model, but only selected in some separate CV models (e.g. sex, partner's education). Those coefficients should then only be interpreted in the part of the distribution, where they are selected. Thus the sex of a child does only have an influence on the bottom part of the distribution, i.e. the undernourished children.
Additionally, there are covariates where no global trend is available, but a relevant deviation from the constant zero is detected, e.g. duration of breastfeeding. This means the influence of breastfeeding fluctuates around zero and has only a relevant size for the upper part of the distribution. Furthermore, the availability of a TV, a refrigerator, a motorcycle and a telephone are decisive indicators for the wealth of the family. The occurrence of a bicycle, a radio or electricity do not seem to explain the wealth of the family and thus the nutritional status of the child.
An advantage of interpreting the scoring approach as the area under the MWSE curve is that weights on more interesting parts of the distribution can be applied. In a second analysis, we applied a weight of 10 on all asymmetry parameters smaller than 0.11 to emphasize the importance of the lower part of the  Table A.7 of the supplementary material). Here, the mother's age has a nonlinear effect in the best model. Due to the marginal differences, our unweighted optimal model also seems to be appropriate for the undernourished children.
An alternative to comparing the best scoring model with the separate CV model of all covariates would be to use the result of the scoring model and apply separate stepwise backward CV selections on the selected covariates. Then the selected covariates of the backward selection are treated as relevant for this part of the distribution. The result of this approach can be seen in Table A.8 in the supplementary material. However, the asymmetry parameters, where the covariates are excluded are very similar to the standard CV approach.
For a final comparison we also estimated bootstrap confidence intervals for the saturated model (Sobotka et al., 2013) and determined simultaneous confidence bands (SCB) following a method proposed in Krivobokova, Kneib and Claeskens (2010), which originally was designed to get simultaneous Bayesian credible intervals and is implemented in the R package acid (Sohn, 2016). Although the estimated confidence intervals provide more information about the variance of the estimated effects and are not designed for model selection (compare Burnham and Anderson, 2002), model selection was implemented by checking which confidence bands were completely covering the zero line. The resulting "selection" table (Table A.6 in the supplementary material) is in general similar to the one of stepwise forward selection with 10-fold cross-validation (Table 2). However, minor differences appear and only the nonlinear part of the child's age is significant using the simultaneous confidence bands, while the other nonlinear functions are not significant.

Effects
In our analysis we first determine which covariates have a relevant influence on the response by model selection. Besides this abstract examination, the estimated effects also tell us the influence on the response. In Figure 5, 6 and 7 the estimated effects of the model optimized via scoring are plotted. The categorical covariate effects are visualized in Figure 5. Here we see that being a female is associated with an increase of the nutritional status for the bottom part of the distribution. Having a TV is also associated with an increase in the nutritional status more strongly on the bottom part of the distribution. On the other side, having a telephone has a larger influence on the upper part of the distribution. Moreover a higher eduction of the mother always increases the zscore, but is stronger for undernourished children. Aside from that, children in families with many members tend to have a worse zscore. This decreases even more if they are not the first child of this mother. For the effects of continuous covariates shown in Figure 6, we find that the mother's age, bmi and height have a linear influence on the response. In addition, the estimated effects of these three covariates are almost parallel when comparing different asymmetry parameters such that their influence is approximately the same for all nutritional statuses. With increasing age, height and bmi, the zscore of the child will tend to be higher. Here the effect of bmi indicates the current nutritional status of the whole family, while the mother's height influence the zscore in two ways. First genetically, since large mothers will have large children and the zscore is a linear function of the child's height. Secondly, if the mother did not have enough food as a child she stayed smaller and often poor families stay poor, such that the child also does not get enough food. The child's age does have a strong influence on the nutritional status.
Here we see a steep decrease between ten and 18 months. Afterwards the child's zscores are not affected by age anymore. This steep decrease is visible for all parts of the distribution, while it is considerably more expressed for the lower part of the stunting distribution. Finally, the regional effects of the 5%, 50% and 95% expectile are shown in Figure 7 (for the other asymmetry parameters see Figure A.25 in the supplementary material). For all asymmetry parameters, the regional distribution is nearly the same. Regions next to the Chilean border in the south and next to the capital Lima, on the middle of the Pacific coast line, are associated with a higher nutritional status. Furthermore, the zscore of children living in the Andes is generally found to be lower.

Conclusions
In this paper, we considered several approaches for model selection in semiparametric expectile regression differentiating between the separate selection for specific asymmetries and the selection of covariates for the complete response distribution. Moreover, we allowed for the separation between linear and nonlinear effects for continuous covariates. Generally we show that model selection strongly depends on the asymmetry parameter. Thus the joint model selection for the whole distribution may be advantageous if indeed the complete distribution of the response shall be analyzed. This approach also reduces the number of included noise covariates.
Overall, the different selection methods all have their advantages. Stepwise AIC selection is the most intuitive method and computationally moderately fast. It is reliable for the selection of linear predictors. However, it selects smooth predictors rather poorly. To the contrary, CV is restrictive in selecting linear and smooth predictors, but it is computationally the most demanding approach.
Here the non-negative garrote has its big advantage, because it is computationally much faster. The selection properties of the non-negative garrote are not as good as for CV, but it is a reliable alternative. Finally, boosting is a good way for selecting linear predictors, but it is not restrictive enough for smooth alternatives.
In summation, it is possible to decide if a spline is necessary by splitting the spline into its linear trend and the wiggly deviation of this trend. Those terms can then be selected separately with CV or the non-negative garrote. Again the joint selection approaches stabilizes the performance but for the area under the AIC curve it takes the upper bound, while for scoring and non-negative garrote on a grid the lower one.
Our example shows that our model selection approaches work for analyzes beyond the mean with a medium number of covariates.
Besides the approach based on the mixed model decomposition of P-splines, one could imagine further methods to check if the smooth part is necessary. So could a variable be transformed in a sequence of basis functions on which a L 1 penalty could be applied to select those which are relevant.
A sensible model selection technique is not only necessary to control the number of included covariates, but also overall model complexity. So far we can decide on how to include a continuous covariate, either restricted to a linear function or flexibly modeled by a P-spline basis. However, we find more and more spatio-temporal regression models. While an interaction between spatial / regional information and additional information for multiple points in time can offer a very flexible model, there is also a strong increase in model complexity. Especially if we aim to estimate more than the mean it will be very helpful to have a reliable measure to decide on additional model complexity.