Reducing bias and mitigating the influence of excess of zeros in regression covariates with multi-outcome adaptive LAD-lasso

Abstract Zero-inflated explanatory variables, as opposed to outcome variables, are common, for example, in environmental sciences. In this article, we address the problem of having excess of zero values in some continuous explanatory variables, which are subject to multi-outcome lasso-regularized variable selection. In short, the problem results from the failure of the lasso-type of shrinkage methods to recognize any difference between zero value occurring either in the regression coefficient or in the corresponding value of the explanatory variable. This kind of confounding will obviously increase the number of false positives – all non-zero regression coefficients do not necessarily represent true outcome effects. We present here the adaptive LAD-lasso for multiple outcomes, which extends the earlier work of multi-outcome LAD-lasso with adaptive penalization. In addition to well-known property of having less biased regression coefficients, we show that the adaptivity also improves method’s ability to recover from influences of excess of zero values measured in continuous covariates.


Introduction
In high-dimensional regression problems, the number of parameters is often larger than the number of individuals in the sample (p n).Ordinary least squares and other conventional estimation methods do not work in such situations.Simultaneous estimation and variable selection using lasso (least absolute shrinkage and selection operator) (Tibshirani 1996;Li and Sillanpää 2012) is a popular shrinkage-estimation approach to obtain sparse estimates for regression coefficients in high-dimensional regression problems with desirable statistical properties (Fan and Li 2001;Fan and Peng 2004).However, the generally known drawback of shrinkage-inducing methods, including lasso, is that they improve estimation accuracy but also introduce downward bias to the estimates (due to bias-variance tradeoff).To alleviate this problem, a re-estimation procedure has been proposed, where effects of selected predictors are re-estimated using no penalty (Efron et al. 2004;Meinshausen 2007).Adaptive lasso has been also presented (Zou 2006), so that the selected predictors are subject of reduced penalty and non-selected positions obtain heavier penalty.This is performed either by using two-stage or iterative estimation strategy.See also related work on thresholded lasso (Zhou 2010;Van De Geer, Bühlmann, and Zhou 2011).One would expect to see larger variance as a consequence of reduced bias in the adaptive lasso.However, it has been observed that the opposite could happen in the adaptive lasso context meaning that we may see improved accuracy together with reduced bias, especially in the p n situation (Huang, Ma, and Zhang 2008).
When the data set contains some outlying observations, it is possible to apply robust least absolute deviation (LAD) regression instead of ordinary regression.In lasso context, LAD-lasso has been proposed for univariate (Wang, Li, and Jiang 2007) and multi-outcome cases (Möttönen and Sillanpää 2015;Li, Möttönen, and Sillanpää 2015).The multi-outcome LAD-lasso is related to the group lasso (Yuan and Lin 2006) because the object function of multi-outcome LAD-lasso also contains group lasso penalty.As common shrinkage-inducing methods, LAD-lasso methods are also suffering from downward bias of the estimates.To alleviate this, adaptive LAD-lasso has been proposed for univariate LAD-lasso (Arslan 2012).
In this article, we study the multi-outcome adaptive LAD-lasso and make comparisons to the non-adaptive or standard version, which we refer to as ordinary LAD-lasso.Note that adaptive group lasso has been presented by Wang and Leng (2008), and it has already been used in empirical application and in modeling old-age retirement (Lähderanta et al. 2022).
In our earlier work with empirical data, we located a potential problem in lasso methods in general, when there are excess of real zero values in some covariates (Lähderanta et al. 2022).We describe this problem here in more detail in LAD-lasso and provide some potential solutions using controlled simulation and real data example.In short, the problem occurs as the lasso-type of shrinkage methods do not recognize any difference between zero value occurring either at the regression coefficient or in the corresponding value of the covariate.As a result, there is a danger of misinterpretation because all non zero regression coefficients do not necessarily represent true effects on the outcomes.As a potential solution, we show empirically how the use of adaptive LAD-lasso will minimize this kind of confounding to occur in practice.
In this article, we show the potential benefits of the multi-outcome adaptive LAD-lasso methods with three examples.In the first example, we utilize a public genotype data set (Crooks et al. 2009) to simulate new phenotypes with tri-variate traits.In the second example, we simulate several general data sets with zero-inflated (excess of zeros) values in continuous explanatory variables, which is typical in many fields of science.Examples of zero-inflated data can be found in environmental sciences (Maldonado, Aguilera, and Salmerón 2016;Kamarianakis et al. 2008).We demonstrate that this problem can be mitigated with the adaptive version of the lasso.In the third example, we use a real-life data set to study the main socioeconomic factors that affect pensions and length of working lives of disability pension retirees, both having highly skewed outcome distributions.

Methodology
In this section, we present the multi-outcome ordinary LAD-lasso (Möttönen and Sillanpää 2015;Li, Möttönen, and Sillanpää 2015) and the novel adaptive version of the method.

Consider multiple regression model with multiple outcomes
where Y = (y 1 , . . ., y n ) is an n × q matrix of n observed values of q outcome variables, X = (x 1 , . . ., x n ) is an n × p matrix of n observed values of p explanatory variables, B = (β 1 , β 2 , ..., β p ) is a p × q matrix of regression coefficients, and E = (ε 1 , . . ., ε n ) is an n × q matrix of residuals.We further assume that ε 1 , . . ., ε n is a random sample of size n from a q-variate distribution centered at the origin.
The multi-outcome lasso estimation method is based on the penalized objective function (e.g., Turlach et al. 2005;Yuan and Lin 2006;Yuan et al. 2007).The minimizer of the objective function ( 1) gives now multi-outcome lasso estimate for the regression coefficient matrix B.
Note that Equation ( 1) is also an objective function of group lasso where each of p − 1 explanatory variables (the intercept terms are omitted) have q regression coefficients (one for each outcome variable) which form a group.The multi-outcome lasso method gives sparse solutions but it is obviously not very robust.A more robust version of multi-outcome lasso, the multi-outcome LAD-lasso (Möttönen and Sillanpää 2015;Li, Möttönen, and Sillanpää 2015), can be obtained by replacing the squared norm with an L1 norm in the objective function (1): ( 2 ) It is easily seen that if we define where e i is a p × 1 unit vector with ith element as one and all the other elements are zero and 0 is a q × 1 vector of zeros, the objective function (2) reduces to the LAD estimation objective function (Oja 2010) of the form which shows that we can use any multi-outcome LAD regression estimation routine to find the multi-outcome LAD-lasso estimates.One can, for example, use the function mv.l1lm of the R-package MNM (Nordhausen, Möttönen, and Oja 2016;Nordhausen and Oja 2011).

Multi-outcome adaptive LAD-lasso
The multi-outcome LAD-lasso estimate for a fixed λ can be defined as where argmin stands for "the value minimizing".This brings up the question of how to choose the tuning parameter λ?If we are mainly concerned about recovering the right model, then we can use, for example, Akaike's information criterion (AIC) or Bayesian information criterion (BIC).On the other hand, if we are mainly concerned about prediction accuracy, then crossvalidation technique (CV) is often a good choice.For a general survey on CV methods, see Arlot and Celisse (2010).We continue this study with a CV criterion as robust AIC or BIC are not readily available in the multi-outcome LAD-lasso context.We have used fivefold cross-validation where the original sample of n observations is randomly partitioned into five approximately equal sized subsamples.Let C j denote the set of indices of the observations that belong to the jth subsample.The CV criterion is then defined as the mean absolute error over the five subsamples where n j is the number of observations in the jth subsample and B(−j) is the multi-outcome LAD-lasso estimate using all but the observations in the jth subsample.For a BIC-like criterion in multi-outcome LAD-lasso context, see Möttönen and Sillanpää (2015).
be the value of the tuning parameter λ which minimizes the cross-validation criterion function for the multi-outcome LAD-lasso.The multi-outcome LAD-lasso estimate can then be defined as B = B( λ).
It has been shown that lasso estimation tends to underestimate the regression coefficients and the same is true for the multi-outcome LAD-lasso estimate B. Zou (2006) proposed an adaptive lasso method, which gives an estimate whose bias is smaller than that of the ordinary lasso.Similarly, the adaptive method can also be used in the LAD-lasso (Arslan 2012).We extend this method here for multi-outcome LAD-lasso case.Since the performance of the adaptive LAD-lasso might be sensitive to the initial weights, we propose an iterative k-step adaptive estimation method.The sth (s = 1, . . ., k) step multi-outcome adaptive LAD-lasso estimate for fixed λ is defined as where is the jth row of the (s − 1)th step multi-outcome LAD-lasso estimate B(s−1) = B(s−1) ( λ(s−1) ) and λ(s−1) = argmin λ CV( B(s−1) (λ)).The multi-outcome LAD-lasso estimate B is used as the initial value B(0) .It can be easily shown that the sth step of the multi-outcome adaptive LAD-lasso estimate for fixed λ can be written in the form which further implies that it can be written in the multi-outcome LAD regression form where The estimate B(s) (λ) can then be found using the function mv.l1lm of the R-package MNM (Nordhausen, Möttönen, and Oja 2016;Nordhausen and Oja 2011).

Simulation studies
In this section, we present two simulation experiments with simulated data sets to provide insights to the multi-outcome adaptive LAD-lasso in certain scenarios.R code for the simulations can be found on GitHub (Lähderanta and Möttönen 2021).

Bias reduction
The first experiment stems from medical sciences, where single and multi-trait quantitative trait locus (QTL) mapping is commonly applied.In QTL mapping, for each individual in the population sample, we have quantitative trait phenotypes and genotype patterns measured in a high number of marker loci distributed over the genome.Based on this data, the target is to find small subsets of trait-associated loci out of many candidates.The trait-associated findings, as well as, the true causative loci, are both interchangeably called QTLs.More details of QTL mapping can be found in Balding (2006) and Li and Sillanpää (2012).
As a first step, we simulated new phenotypes with tri-variate traits using the public genotype data set from the 12th QTL-MAS workshop in Uppsala, Sweden, in 2008 (Crooks et al. 2009).The original genotype data set contains of 5865 individuals and 6000 markers.We took a random sample of 300 individuals and chose every 30th marker with a resulting total number of 200 markers.We then simulated tri-variate traits by using a multi-outcome multiple regression model where Y is a 300 × 3 matrix of tri-variate traits, X is a 300 × 200 matrix with ijth element B is a 200 × 3 matrix with four QTLs indicated as non-zero rows β 50 = (1.5, 1.5, 1.5), β 75 = (0, 0.5, 1.5), β 100 = (0.5, 0.5, 0.5) and β 150 = (1, 0.5, 0.5), and E is a 300 × 3 matrix with i.i.d.rows normally distributed as We estimated the tuning parameter λ by using fivefold cross-validation.In Figure 1a, we show the marker effects β ij corresponding to ordinary LAD-lasso.We see that the LAD-lasso method correctly finds all four QTLs.In the next step, we studied the bias of the LAD-lasso estimates of the non-zero coefficient vectors β 50 , β 75 , β 100 and β 150 : It is clear that the ordinary LAD-lasso estimates are biased in most cases.In the final step, we used the multi-outcome adaptive LAD-lasso estimation method.The Figure 2a shows that the biases decrease and the regression coefficients stabilize after some adaptive steps.We can see from Figure 2b that there are still small biases after five adaptive steps but the biases are smaller than the biases corresponding to the ordinary LAD-lasso.Note that with adaptive LAD-lasso the accuracy of the estimates has improved on average.The biases of the non-zero coefficient vectors corresponding to the adaptive LAD-lasso with five steps were In Figure 1b, we show the marker effects β ij corresponding to the adaptive LAD-lasso with five steps.We see that the multi-outcome adaptive LAD-lasso method correctly finds the true QTLs.Note that the true zeros indicated by the adaptive technique are not shown in Figure 1b.

Excess of zeros in covariates
In the next experiment, we simulate an other kind of data sets to demonstrate the excess of zeros in covariates scenario with multi-outcome adaptive LAD-lasso and multi-outcome ordinary LAD-lasso.Results from multiple data sets are shown to illustrate the performance of the algorithms in wide variety of situations.In practice, we generate two types of data sets, one with high number of observations compared to covariates (n = 100, p = 10), and contrarily one with high number of covariates when compared to observations (n = 25, p = 50).Moreover, we alternate the proportion of zeros (π zeros ) in the covariates from 0.1 to 0.4 and apply two different error E distributions: uniform and asymmetric Laplace.When all the combinations of these properties are considered, 16 different types of scenarios are examined in total.In each scenario, we simulate 100 data sets to further evaluate the performance of the methods.The data sets and the alternating proportion of zeros are shown in Figure 3.
In practice, the data sets are generated with a multi-outcome regression model with matrix X of a size n × p, response matrix Y of a size n × 2. The observed values x ij are simulated from a normal distribution such as where u ij ∼ Unif (0, 1) and π zeros is the proportion of zeros.B is a p × 2 matrix where b jk ∼ N(0, 1), when j = 1, 2, 3, 0, else.
E is a n × 2 matrix where or depending of the choice for error distribution.Above, Unif refers to uniform distribution and ALaplace stands for Asymmetric Laplace distribution.
From the simulations, we can see that the adaptive LAD-lasso is superior to ordinary LADlasso in every scenario, when we compare the correctly found zero coefficients (Figure 4).In scenario p n, the difference between the methods is somewhat smaller.

Empirical data example: retirement analysis
In this section, we apply the presented adaptive method to model retirement behavior.The data consist of disability pension retirees in Finnish statutory earnings-related pension system  1).The distributions of both outcomes have long tails, both left and right, and no zeros.Explaining variables include measures (spells) of social security benefits available for all employees -unemployment, occupational rehabilitation, long-term sickness and occupational accident compensation (days in year), and wages (Eur/year).The distributions of explaining variables have long right tails and excess number of zeros, as some of the aforementioned benefits are rarely used.A note is in order on the fact that in practice zero yearly wages follow from absence from work for different reasons, such as, working in private sector or as self-employed, or drawing social security benefits.Wages and pensions are transformed using asinh transformation (Bellemare and Wichman 2020).
The results indicate that the adaptive technique yields more non-selected variables for several social security benefits compared to ordinary LAD-lasso (see Figures 5 and 6 and Table A1).From practical point of view and variable selection perspective, it is important to separate which covariates become truly zero and which are nearly so.From Figure 6, we can also see that benefits drawn near retirement (2017) tend to be mild (in terms of values of the estimates), yet meaningful estimates on both outcomes.Occupational rehabilitation and unemployment benefits have positive effect on work ability and working life, whereas longterm sickness indicates weakening work ability and risk on working life.On the other hand, drawing social security benefits in general indicates negative effect on final pension.Wages have positive effect on pension, which is as expected in earnings-related pension system.The effects of yearly wages on the length of working life are somewhat mixed.
The spike in the estimate for wages in 2005 on pension is a result of high correlation between consecutive wages (see Figure A1).This multicollinearity in regression analysis can generally lead to instability in the estimates, and to a problem where the explained variance is allocated arbitrarily among the correlated variables (Farrar and Glauber 1967).

Concluding remarks
We have shown that the multi-outcome adaptive LAD-lasso is a versatile and robust tool, capable of reducing bias from effect estimates as well as alleviating the influence from the problem of an excess of zero values in continuous covariates.The competitive performance of the adaptive version compared to the ordinary LAD-lasso has been illustrated with simulated and real-life examples.The empirical data example illustrates the practical importance of multi-outcome modeling strategy with highly skewed outcome distributions (including outliers) and high share of zeros in covariates, which is common situation, for example, in social sciences analyzing complex socioeconomic phenomena.Researchers in other fields of science, with highdimensional study-designs (n p, p n or q > 1) could certainly also benefit from robust estimates provided by multi-outcome adaptive LAD-lasso regression technique.
As future steps, we have considered multiple ways to improve the LAD-lasso modeling approach.First, it would be beneficial to develop a robust BIC criterion for a multi-outcome LAD-lasso context in order to take into account linear dependencies between outcomes.Second, as we demonstrated with the empirical data, the problem of multicollinearity in the explaining variables can lead to some unintuitive estimates.To tackle this problem, some interesting approaches have been developed, such as, fused lasso (Tibshirani et al. 2005) and clustering algorithm in regression via data-driven segmentation (CARDS) (Ke, Fan, and Wu 2015).Implementing these concepts into the LAD-lasso framework might be beneficial to applications with correlated explaining variables.Third, the inclusion of categorized variables in the adaptive LAD-lasso modeling approach would be intriguing in many applications, especially in the life-course studies, where several categorical factors exist in the data sets.This requires a careful implementation of coding scheme such that the re-coded variables are managed correctly in the adaptive algorithm (O'Grady and Medoff 1988).Last, the size of the data sets in many fields bring some challenges to the estimation.Naturally, a more scalable algorithm is often needed to perform the estimation in a reasonable time.

Figure 2 .
Figure 2. Biases of the marker effects bias( βij ) = βij − β ij , i = 50, 75, 100, 150, j = 1, 2, 3. (a) Biases of the marker effects with different numbers of adaptive steps.Zero adaptive steps corresponds to the ordinary LAD-lasso estimate.(b) Box-plots of the biases of the marker effects for the ordinary LAD-lasso and the adaptive LAD-lasso with five steps.

Figure 3 .
Figure 3. Simulated data replicates in the excess of zeros study with different proportions of zeros (π zeros ).

Figure 4 .
Figure 4. Percentage of correctly found zero coefficients with adaptive LAD-lasso and ordinary LAD-lasso in multi-outcome context.

Figure 5 .
Figure5.ordinary LAD-lasso coefficient estimates for earnings-related disability pension (left) and length of working life (right).Filled circles refer to the value of the coefficient while triangles refer to zero-valued coefficients.

Figure 6 .
Figure 6.Multi-outcome adaptive LAD-lasso coefficient estimates for earnings-related disability pension (left) and length of working life (right).Filled circles refer to the value of the coefficient while triangles refer to zero-valued coefficients.

Table 1 .
Yearly explaining variables used in the study with the retirement data.