Robust Variable Selection and Estimation Via Adaptive Elastic Net S-Estimators for Linear Regression

Heavy-tailed error distributions and predictors with anomalous values are ubiquitous in high-dimensional regression problems and can seriously jeopardize the validity of statistical analyses if not properly addressed. For more reliable estimation under these adverse conditions, we propose a new robust regularized estimator for simultaneous variable selection and coefficient estimation. This estimator, called adaptive PENSE, possesses the oracle property without prior knowledge of the scale of the residuals and without any moment conditions on the error distribution. The proposed estimator gives reliable results even under very heavy-tailed error distributions and aberrant contamination in the predictors or residuals. Importantly, even in these challenging settings variable selection by adaptive PENSE remains stable. Numerical studies on simulated and real data sets highlight superior finite-sample performance in a vast range of settings compared to other robust regularized estimators in the case of contaminated samples and competitiveness compared to classical regularized estimators in clean samples.


Introduction
Simplicity of the linear regression model ensures its continued importance in many scientific and industrial applications, especially if a small sample size prohibits use of more complex models. This paper considers prediction and variable selection in the linear regression model with p-dimensional random predictors X independent of the random error term U and fixed parameters µ 0 ∈ R, β 0 ∈ R p . Based on a sample of n independent realizations of Y and X , collected as pairs (y i , x i ), i = 1, . . . , n, the statistical goal is to estimate the intercept µ 0 ∈ R and slope coefficients β 0 ∈ R p . Emphasis is on high prediction accuracy and identification of relevant predictors, i.e., those which have non-zero entries in β 0 .
With the ever growing abundance of data, often combined from various sources, it is increasingly challenging to make assumptions on the distribution of the error term U or assume that the data, particularly the predictors, are free of anomalous values or gross outliers. In proteomics or genomics studies, for instance, undetected equipment failure, problems with sample preparation, or patients with rare phenotypic profiles, are just a few sources of contamination that can severely damage the effectiveness of most statistical methods commonly used to esti- where O(y, µ + Xβ) is a regression loss (e.g., the sum of squared residuals) and Φ(β) is a penalty function (e.g., the L 1 norm). The LASSO (Tibshirani 1996) Fan et al. 2018;Loh 2018;Pan et al. 2020) promotes convex loss functions which increase slower than the LS-loss for larger residuals. These so-called "unbounded M-loss" functions (e.g., the sum of absolute deviations or the Huber loss), are designed to shield against heavy-tailed error distributions in high-dimensional settings. While the convexity of unbounded M-loss functions enables derivations of strong theoretical guarantees, these estimators are still exposed to the potentially devastating effects of contamination in the numerous predictors. Commonly suggested remedies, e.g., down-weighting observations with "unusual" predictor values or univariate winsorizing (Loh 2017;Sun et al. 2019), are ill-suited for high dimensional problems. In the sparse estimation regime, for example, down-weighting observations due to outlying values in irrelevant predictors (which are unknown in advance) may sacrifice precious information.
Regularized M-estimators using a bounded, and hence non-convex, M-loss function yield the desired protection against contaminated predictors. Asymptotic properties of these regularized bounded M-estimators have been recently studied. Smucler and Yohai (2017), for instance, propose the MM-LASSO estimator; a regularized M-estimator relying on an auxiliary Mestimate of the residual scale (hence "MM"). The authors derive the oracle property for their MM-LASSO estimator under fixed dimensionality but otherwise very general conditions. Loh (2017), on the other hand, proves oracle bounds for the estimation error for M-estimators regularized by folded-concave penalties (e.g., SCAD), by restricting the problem (2) to a neighborhood around the origin which must contain the true parameters. For these oracle bounds to hold, however, the loss function, given the sample, must satisfy strict conditions, including restricted strong convexity in a neighborhood of the true parameters. The main challenge for using regularized M-estimators in practice, however, is the requirement for a robust estimate of the scale of the residuals. Obtaining such a robust estimate with sufficiently small finite-sample bias is a difficult task in its own right; almost insurmountable in high dimensional problems with contamination and heavy-tailed errors.
Recently renewed attention has been given to the mean-shift outlier model (MSOM), introducing an additive nuisance parameter for each observation to quantify its outlyingness. To identify these outliers, She et al. (2021) add a constraint on the number of non-zero nuisance parameters, i.e., on the number of outlying observations. With this formulation, they develop a theoretical foundation for a broad class of robust estimators defined algorithmically, e.g., the popular SparseLTS estimator (Alfons et al. 2013). Under a very general framework allowing for different loss and penalty functions, She et al. (2021)  the L 0 -pseudo-norm for either variable selection or outlier detection tends to suffer from high variability. For variable selection and prediction, for example, other penalties may deliver better performance (Hastie et al. 2020;Insolia et al. 2020). In addition, for computational tractability and stability, it is necessary to restrict the optimization to a tight neighborhood around the true regression parameters, as well as to have a good understanding of the number of relevant predictors and the number of outliers. Insolia et al. (2020), for example, suggest to use an ensemble of other robust regression estimates to get preliminary estimates of the necessary bounds. In this paper we are proposing a method which does not require any prior knowledge of the number of truly relevant predictors and only a rough upper bound on the number of outliers. Therefore, our method is a good candidate to initialize the methods proposed for the MSOM.
This paper introduces a method which achieves good asymptotic properties as well as strong empirical performance without requiring any prior knowledge about the true parameters, the scale of the residuals, or the number of relevant predictors. We tackle this problem by building upon the "S-loss" function, a loss function based on a robust measure of the scale (hence Sloss) of the fitted residuals, circumventing the need for an auxiliary estimate of the residual scale. In the unpenalized case, the S-estimator is highly robust towards heavy-tailed errors and arbitrary contamination in the predictors (Rousseeuw and Yohai 1984 Section 4 presents the main theoretical results pertaining to the robustness and oracle properties of the adaptive PENSE estimator, along with a discussion of the imposed assumptions.
Section 5 outlines the strong empirical performance of adaptive PENSE in simulation studies and real-world applications. Supporting lemmas and proofs of the theorems, as well as additional simulation results, are provided in the Supplementary Materials.

Notation
To simplify the following exposition some notation is fixed throughout. The concatenated parameter vector of intercept and slope coefficients in the linear regression model (1)

Adaptive PENSE
We propose estimating the sparse regression parameter θ 0 in the linear regression model (1) by penalizing the robust S-loss with an adaptive elastic net penalty. In the presence of gross errors in the response variable (outliers) and unusual values in the predictors (leverage points), the highly robust S-loss is an appropriate surrogate for the LS-loss. The S-loss is given by where ρ is a bounded and hence non-convex function and δ ∈ (0, 0.5] is a fixed parameter governing robustness properties as will be shown later. Instead of the classical variance of the fitted residuals, the S-loss minimizes the square of the robust M-scale of the fitted residuals,σ M (y −ŷ). If the number of exactly fitted observations #{i : y i =ŷ i } < n(1 − δ), the M-scale estimate is greater than 0 and is given implicitly by the To ease notation, we define the M-scale of the residuals of an estimateθ byσ M (θ) =σ M (y −μ − Xβ).
The robustness of the M-scale depends on two components: (i) the choice of the ρ function and (ii) the fixed quantity δ.
The ρ function in the definition of the M-scale in (3) measures the "size" of the standardized residuals y i −ŷ i . The classical sample variance, up to a scaling by δ, can be obtained by setting To get a robust estimate of scale, the ρ function must be bounded (Yohai 1987), i.e., all standardized residuals larger than a certain threshold are all assigned the same "size".
Since the objective is to get a robust scale estimate, from here on we always assume that the ρ function satisfies the condition [A1] ρ : R → [0, 1] is an even and twice continuously differentiable function with ρ(0) = 0, that is bounded, ρ(t) = 1 for all |t| ≥ c > 0, and non-decreasing in |t|.
The boundedness implies that the derivative of the ρ function is 0 for |t| ≥ c. Therefore, residuals greater than c have no effect on the minimization of (3). As becomes evident in Theorem (2) for adaptive PENSE and as shown in Davies (1990) for the unregularized Sestimator, the choice of the ρ function directly affects the variance of the estimator. Hössjer (1992) derives an "optimal" ρ function for the unregularized S-estimator, in the sense that it minimizes the asymptotic variance for Normal errors. However, the author also shows that the gain in efficiency is minor for Normal errors when compared to the simpler Tukey's bisquare ρ function given by It should be noted that the cutoff c for Tukey's bisquare function does not affect the resulting S-estimator or the variance of the M-scale estimator; it is merely a multiplicative factor for the scale estimate and does not change the estimate of the regression parameters (this is true for any ρ function with cutoff c satisfying ρ(t; c) = ρ(t/c; 1)). We are therefore fixing c = 1 for the reminder of this paper when referring to the S-loss. If an M-scale estimate of the scale of the residuals is desired, however, we use a cutoff c which leads to a consistent estimate under Normal errors. This cutoff will depend on δ.
The second component that determines the robustness of an S-estimator is the constant δ Φ EN , is a convex combination of the L 1 and the squared L 2 norm given by The hyper-parameterα ∈ [0, 1] controls the balance between the L 1 and the L 2 penalty andλ controls the strength of the penalization. The Ridge penalty is recovered when settingα = 0, although it does not lead to variable selection. Forα = 1, the EN penalty coincides with the LASSO, but ifα < 1, the elastic net results in a more stable variable selection than the LASSO penalty when predictors are correlated (Zou and Hastie 2005).
The elastic net penalty, like the LASSO, introduces non-negligible bias and thus cannot lead to a variable selection consistent estimator. We are therefore proposing to combine the robust S-loss with the following adaptive elastic net penalty, a slight variation of the penalty introduced by Zou and Zhang (2009): The adaptive EN combines the advantages of the adaptive LASSO penalty (Zou 2006) and the elastic net penalty (Zou and Zhang 2009). Contrary to the original definition in Zou and Zhang (2009), (6) applies the penalty loadings |β j | −ζ to both the L 1 and L 2 penalties.
The adaptive EN leverages information from a preliminary regression estimate,β, to penalize predictors with initially "small" coefficient values more heavily than predictors with initially "large" coefficients. This has two major advantages over the non-adaptive EN penalty: Adaptive PENSE is a two-step procedure leveraging a PENSE estimate withα = 0, i.e., using a Ridge penalty. In the first step, a PENSE-Ridge estimate is computed as In the second step, the PENSE-Ridge estimate is used as the preliminary estimate and adaptive PENSE is computed aŝ Fixing the preliminary estimate to a PENSE-Ridge has two important advantages: (i) computation is fast because the Ridge penalty is smooth (hence amenable to more efficient algorithms) and because we do not need to choose from severalα values, and (ii) no predictors are discarded prematurely. While discarding some predictors in the preliminary stage may be computationally beneficial for very-high dimensional problems, empirical studies suggest variable selection performance of adaptive PENSE is better in most scenarios if the preliminary stage does not perform variable selection.
Even with the first-stage penalty fixed atα = 0, computing adaptive PENSE estimates involves choosing several hyper-parameters: (i) α, the balance of L 1 /L 2 regularization for adaptive PENSE, (ii)λ, the level of regularization for PENSE, (iii) λ, the level of penalization for adaptive PENSE, and (iv) ζ, the exponent in the predictor-specific regularization.
Interpreting the exponent ζ is less intuitive than the other regularization hyper-parameters.
In general, the larger ζ the stricter the differentiation between "small" and "large" coefficient values. In other words, if ζ is large, all but a few predictors with initially very large coefficient values will be heavily penalized and thus likely not included in the set of relevant predictors.
In addition to the large number of hyper-parameters that need to be selected, both optimization problems (7) and (8) are highly non-convex in µ and β. Finding global minima through numeric optimization is therefore contingent on a starting value that is close to a global minimum. Section 3 describes a strategy for obtaining starting values used for adaptive PENSE.

More robust variable selection
The adaptive EN penalty brings the additional advantage of more robust variable selection properties compared to non-adaptive penalties. So-called "good" leverage points in nonrelevant predictors (i.e., observations with extreme values in one or more predictors with a coefficient value of 0 but without gross error in the response), as shown in Figure 1, can lead to an arbitrary number of false positives in robust estimates when using non-adaptive penalties. Interestingly, these predictors are often the first to enter the model. This is caused by a combination of how large values in predictors affect the sub-gradient of the objective function and robust scaling of the predictors. The phenomenon is best seen from the sub-gradient of the PENSE objective function at β = 0 p , given by where w i are determined by the S-loss evaluated at the intercept-only model. These weights are > 0 if and only if the residual from the intercept-only model is not too large (relative to all other residuals) and different from 0 (i.e., not fitted exactly).
Consider now that predictor j is truly inactive and contains an extremely large value for observation i, but the residual for observation i in the intercept-only model is small and nonzero, i.e., the i-th observation is a good leverage point. An example of this scenario is shown in Figure 1. Robust scaling of the predictor is likely not substantially shrinking this extremely large value and hence the j-th predictor dominates the sub-gradient at β = 0 p ; therefore, it enters the model first. In other words, this single aberrant value leads to the false impression that the j-th predictor is relevant. However, because the leverage is caused by an extreme value in a non-relevant predictor, the estimated coefficient for this predictor is likely very small in magnitude, compared to coefficients of truly relevant predictors. Importantly, the higher the leverage of this observation, the smaller the estimated coefficient. This allows adaptive PENSE to screen out the wrongly included predictor, making it more robust against this form of contamination.
Interestingly, good leverage points in irrelevant predictors can -but usually do nothave this effect on non-robust estimators. The prevalent scaling of the predictors using the non-robust sample standard deviation shrinks such extreme values, thereby reducing their contribution to the sub-gradient and hence their influence. As depicted in Figure 2, however, the variable selection performance of non-robust estimators is severely damaged by bad leverage points and hence in general is unreliable under contamination.
This form of good leverage points may occur in many practical problems, particularly in very sparse settings. In protein expression data, for example, a group of proteins could be highly expressed in a small fraction of subjects while only trace amounts of the protein are detected in the vast majority of subjects. Even if the group of proteins is not relevant for the outcome of interest, robust methods with non-adaptive penalties are prone to selecting these proteins. An example of this behavior using synthetic data is shown in Figure 2. Here, 5 out of 28 irrelevant proteins have higher expression levels in 10 out of 100 observations.
Additionally, the response variable contains outliers alongside high-leverage points in 2 out of 5 relevant proteins in another 5 observations. It is obvious that the effect of such high-leverage points is more pronounced the higher the leverage of the contaminated observations, but the estimation methods are affected differently. Non-robust EN estimators tend to select only the 2 contaminated truly relevant predictors, but given the outlyingness of the observations, the actual parameter estimates are highly biased. Similarly affected, I-LAMM doesn't select any proteins if the leverage caused by the affected proteins is too high. The PENSE estimator, on the other hand, selects almost all of the affected irrelevant proteins, but as the leverage increases, tends towards not selecting any proteins at all. Only adaptive PENSE is mostly unaffected by the contamination in relevant and irrelevant predictors, identifying on average 4 out of 5 truly relevant predictors, while screening out 24 out of 28 irrelevant predictors.
Before presenting more empirical evidence of the robustness of variable selection by adaptive PENSE in Section 5, we discuss the intricate computational challenges and the theoretical properties of the estimator.  Figure 2: Effect of high-leverage points on the sensitivity and specificity of various variable selection methods for synthetic data. Average performance over 50 replications are reported separately for proteins with contaminated observations and proteins free from any contamination. Generated data comprises n = 100 observations of p = 32 protein expression levels, 5 of which are relevant. In 5% of observations the response and 2 relevant proteins are contaminated, and in 10% of observations 5 irrelevant proteins contain contaminated expression levels.

Computing adaptive PENSE estimates
The highly non-convex objective function paired with the need to select several hyper-parameters requires specialized algorithms and strategies for computing adaptive PENSE estimates. The problem is separated into two stages: (1) computing PENSE estimates and selecting the appropriate penalization level for the preliminary PENSE estimate and (2) computing adaptive PENSE estimates based on the (fixed) preliminary estimate. These two stages are done sequentially and the only information passed from stage 1 to 2 is the preliminary parameter estimate.

Selecting hyper-parameters
In both stages the penalization level is selected via independent repeated K-fold cross-validations (CVs) over a range of possible values. In the first stage, only the penalization level is selected as the elastic net parameter,α, is fixed at 0. For the second stage, the two hyper-parameters α and ζ are selected from a small set of pairs. For every desired pair, the penalization level is chosen via separate CVs (but using the same splits), and the combination resulting in the best prediction performance is selected. The range of penalization levels is different in both stages, as well as for every pair of α and ζ considered in the second stage.
Estimating the prediction error of robust estimators via cross-validation suffers from high variability due to potential contamination in the data and non-convexity of the objective function. The prediction error estimated from a single CV is highly dependent on the random split and hence unreliable for selecting the hyper-parameter. We work around this high volatility by repeating CV several times to get a more reliable assessment of the estimator's prediction performance for given values of the hyper-parameters. In addition to repeating CV, the measure of the prediction error needs to be stable in the presence of gross errors in the observed response values. For PENSE and adaptive PENSE we use the highly robust τ -scale of the uncentered prediction errors (Maronna and Zamar 2002) to estimate the prediction accuracy in each individual CV run: Here,ŷ i is the predicted response value from the CV split where the i-th observation is in the test set. The parameter c τ > 0 specifies what constitutes outlying values in terms of multiples of the median absolute deviation and hence governs the tradeoff between efficiency and robustness of the scale estimate.
Repeated CV leads to several estimates of the prediction accuracy. Since the presence of gross errors in the predictions is already handled by the robust τ -scale, we average the prediction errors using the sample mean. This gives an overall measure of prediction performance for a fixed set of hyper-parameters. Repeating cross-validation furthermore gives insights into the variability of the prediction performance and affords more sensible selection of the penalization level, e.g., using the "one-standard-error rule" (Hastie et al. 2009).
Another important remedy to reduce the variability incurred by CV is scaling the input data to make penalization levels more comparable across CV splits. We first standardize the original data set by centering the response and each predictor using univariate M-estimates of location. Then we scale each predictor to have unit M-scale and refer to this data set as the standardized input data. In each CV split, the training data is re-standardized in the same way as the original data set. Therefore, a fixed penalization level λ induces a level of sparsity to the parameter estimate computed on the training data comparable to the sparsity when computed on the standardized input data.

Algorithms for adaptive PENSE
The algorithm to compute (adaptive) PENSE estimates is optimized for computing estimates over a fine grid of penalization levels. In each individual CV run, hyper-parameters α and ζ are fixed. The biggest challenge when computing adaptive PENSE estimates is the nonconvexity of the objective function. Many local minima of the objective function, however, are artifacts of contaminated observations and undesirable. This insight is used in Cohen Freue et al. (2019) to find "initial estimates" for PENSE, i.e., approximate solutions which are closer to "good" local minima than to local minima caused by contamination. We adapt their elastic net Peña-Yohai (EN-PY) procedure for the adaptive EN penalty to compute initial estimates for adaptive PENSE. We call our procedure adaptive EN-PY.
Additionally we improve upon the "warm-start" heuristics described in Cohen Freue et al.
(2019) to substantially increase exploration of the search space while maintaining computational feasibility. For a small subset of the grid of penalization levels (e.g., every tenth value), we compute a set of initial estimates using the adaptive EN-PY procedure. Even for a fixed penalty level, the effect of penalization on adaptive EN-PY may be vastly different than on adaptive PENSE. Therefore, we collect all of these initial estimates (i.e., from all penalization levels) into one set of initial estimates.
Beginning at the largest value in the fine grid of penalization levels, we use every approximate solution in the set of initial estimates to start an iterative algorithm following the The computational solutions discussed here are readily available in the R package pense, available on CRAN (https://cran.r-project.org/package=pense).

Asymptotic theory
To establish theoretical guarantees for adaptive PENSE, we formalize the model (1). We assume that the random predictors X with distribution function G 0 are independent of the error term U which follows the distribution F 0 . The joint distribution H 0 is assumed to satisfy In the remainder of this section we omit the intercept term to make the statements more concise and easier to follow. All of the following statements regarding the slope also apply to the intercept term.
The robustness properties of PENSE and adaptive PENSE are tightly connected to the bounded ρ function and we therefore assume ([A1]). To establish statistical guarantees for adaptive PENSE we additionally require that the derivative of the ρ function, denoted by ψ,

satisfies
[A2] tψ(t), is unimodal in |t|. Specifically, there exists a c with 0 < c < c such that tψ(t) is strictly increasing for 0 < t < c and strictly decreasing for c < t < c.

Assumption [A1] is standard in the robust literature for redescending M-estimators and
is identical to the definition of a bounded ρ function in Maronna et al. (2019). Assumption [A2] is a slight variation of more common assumptions on the mapping t → tψ(t), but it is nevertheless satisfied by most bounded ρ functions used in robust estimation, including Tukey's bisquare function.
Finally, to establish the root-n consistency of the PENSE estimator and the oracle property of the adaptive PENSE estimator we need the same regularity conditions as in Smucler and Yohai (2017): [A3] P(x β = 0) < 1 − δ for all non-zero β ∈ R p and δ as defined in (3).
[A4] The distribution F 0 of the residuals U has an even density f 0 (u) which is monotone decreasing in |u| and strictly decreasing in a neighborhood of 0.
[A5] The second moment of G 0 is finite and It is noteworthy that the assumption on the residuals [A4] does not impose any moment conditions on the distribution, which makes our results applicable to extremely heavy tailed errors. Furthermore, unlike many results concerning regularized M-estimators, we only require a finite second moment of the predictors.
Under these assumptions we first establish the finite-sample robustness of adaptive PENSE.
We quantify the finite-sample robustness by the replacement finite-sample breakdown point where the set Z m contains all possible samplesZ with 0 ≤ m < n of the original n observations in Z replaced by arbitrary values (Donoho and Huber 1982). As noted by several authors Theorem 1. For a sample Z = {(y i , x i ) : i = 0, . . . , n} of size n, let m(δ) ∈ N be the largest integer smaller than n min(δ, 1 − δ), where δ is as defined in (3). Then, for any fixed λ > 0 and α, the adaptive PENSE estimator,θ, retains the breakdown point of the preliminary PENSE estimator,θ: Noting that |β j | −ζ > 0 for all j = 1, . . . , p, the proof of this theorem is essentially the same as for PENSE which can be found in Cohen Freue et al. (2019). The main message from Theorem 1 is that the adaptive PENSE estimator is bounded away from the boundary of the parameter space, as long as contamination is restrained to fewer than m(δ) observations. It does not, however, mean that the estimated parameter is close to the true parameter under contamination. Without assumption on the type of contamination, the performance of the estimator can only be assessed with numerical experiments and we show some results in Section 5.1. If the regularization parameters are chosen by a data-driven strategy, it is important to stress that the strategy itself must provision for contamination to not break the robustness of adaptive PENSE. In the strategy proposed in Section 3.1, this is ensured by repeating CV several times and using the robust τ -scale for measuring the prediction accuracy.
We now turn to the asymptotic properties of adaptive PENSE as n → ∞ and the dimensionality p remains fixed. Propositions 1 and 2 in the supplementary materials establish strong consistency and root-n consistency of both PENSE and adaptive PENSE estimators. In our definition of adaptive PENSE in (8) we take a PENSE estimate to define the penalty loadings.
With the root-n consistency of the PENSE estimator, variable selection consistency and a limiting distribution of the adaptive PENSE estimator can be derived. These properties hold if using the PENSE-Ridge as preliminary estimate, but also for any otherα in the computation of the preliminary PENSE estimate. (i) variable selection consistency, i.e., the estimator of the truly non-relevant coefficientsβ II is zero with high probability: (ii) if in addition √ nλ n → 0, the estimator of the truly relevant coefficientsβ I is asymptotically Normal: The constants are given by a(ψ, and The theoretical results depend on an appropriate choice for the regularization parameters λ and λ, for PENSE and adaptive PENSE, respectively. In practice, the choice is not obvious and the regularization parameters are usually determined through a data-driven procedure.
Depending on the procedure, however, the required conditions are difficult if not impossible to verify. To substantiate the theoretical properties and verify that they translate beneficially to practical problems, we conduct a numerical study and carefully investigate adaptive PENSE's properties in finite samples with data-driven hyper-parameter selection as detailed in Section 3.

Numerical studies
The following simulation study covers scenarios where adaptive PENSE could theoretically possess the oracle property (if not for data-driven hyper-parameter selection), but also situations where the required assumptions for the oracle property are not met (e.g., when p > n).
Similarly, the real-world application showcases the potential of the highly robust adaptive PENSE estimator and the proposed hyper-parameter search in challenging settings where other estimators may be highly affected by contamination.

Simulation study
The aim of the simulation study is to assess the prediction and model selection performance of adaptive PENSE and how well the proposed hyper-parameter selection procedure approximates to the optimum. We assess the reliability of the estimators by considering a large number of different contamination structures and heavy-tailed error distributions. Below we present the results from a scenario with n = 200 observations and p = 32 to p = 512 predictors. Of those, s = log 2 (p) predictors have non-zero coefficient value. We present the results for an additional scenario with n = 100 and more severe contamination in the supplementary materials.
We consider multiple error distributions and either no contamination or contamination in 10% of observations. For each combination of error distribution and presence/absence of contamination, we randomly generate 50 data sets according to the following recipe. The predictor values are randomly generated by K = 1+ √ p/2 latent variables Z i1 , . . . , Z iK following a multivariate t-distribution with 4 degrees of freedom and pairwise correlation of 0.1.
The predictor values are then generated as x ij = Z i,1+ (j−1)/K) +I[j>s] + ξ ij , with ξ ij i.i.d. normal with standard deviation of 0.2. This yields K groups of predictors which are highly correlated within groups (correlation about 0.96) and mildly correlated between groups.
The response value is generated according to the true linear model Therefore, the true regression coefficient is 1 for the first s predictors and 0 for the remaining In the same 10% of observations, the response variable is contaminated with gross outliers.
For these observations, the response is generated by the contaminated model The prediction performance is shown in Figure 3. Adaptive PENSE is less affected by  these circumstances. In the presence of contamination and/or heavy-tailed errors, however, adaptive PENSE is generally more reliable than adaptive MM.
Without adverse contamination, I-LAMM performs best for light-tailed error distributions, but also with heavy-tailed error distributions I-LAMM performs as well as or even better than the (non-adaptive) PENSE and MM estimators. However, I-LAMM has the same issues with good leverage points as PENSE and is much more affected by bad leverage points and gross errors in the residuals than highly robust estimators. Classical EN and adaptive EN estimators are not shown in these plots as they perform very poorly for heavy-tailed error distributions.
Additional plots are provided in the supplementary materials and include the least-squares EN estimators. It is noteworthy that the considered contamination model is not very detrimental to the prediction accuracy of EN and adaptive EN due to the relatively small magnitude of the contaminated slope coefficients. Additional results for the moderately heavy-tailed stable distribution are presented in the supplementary materials.
It is evident that highly robust estimators lead to better variable selection than other estimators, especially in regimes with heavy-tailed error distributions. Moreover adaptive penalties clearly improve variable selection upon non-adaptive penalties, and even without adverse contamination, good leverage points appear to distort variable selection for non-robust esti- The simulation study highlights that adaptive PENSE and the strategy for choosing hyperparameters are highly resilient towards many different forms of contamination in the predictors, even if they occur in tandem with heavy-tailed errors and gross outliers in the residuals.
Both the prediction accuracy and the variable selection performance are at least on-par with, but most often better than, other robust and non-robust regularized estimators for highdimensional regression. While adaptive MM is often comparable in performance to adaptive PENSE and better for Normal residuals, in certain situations it is substantially more affected by contamination. This is particularly noticeable in some settings presented in the supplementary materials, where adaptive MM sometimes has more than 30% higher prediction error than adaptive PENSE.

Real-data example
We apply adaptive PENSE and the other methods considered in the simulation study to the analysis of chemical composition of 180 archaeological glass vessels from the 15-17th century (Janssens et al. 1998). The analysis is performed on electron probe X-ray micro analysis spectra comprising 1920 frequencies. This data set has been analyzed in several other papers on robust high-dimensional regression (e.g., Smucler and Yohai 2017;Loh 2018) as it is known the dataset contains contaminated observations both in the response variable and the frequency spectrum (Maronna 2011  It is interesting that the classical EN estimator is performing only slightly worse than some highly robust estimators. In addition, non-robust estimators and adaptive MM show a very high amount of variation in the prediction accuracy. This is also reflected in the number of relevant predictors estimated by these methods, which varies widely between the 50 different CV splits for some estimators. EN, for example selects anywhere between 4 to 34 frequencies, while adaptive MM selects between 8 and 45. While adaptive PENSE selects more frequencies than other estimators (between 42 and 49), the selection seems much more stable. This suggests that several frequencies are contaminated by extreme or unusual values with deleterious effects on most methods except for adaptive PENSE.
The adaptive PENSE fit computed on all 180 glass vessels shown in Figure 5(b) suggests about 14 vessels have unusually large residuals. The adaptive PENSE fit also suggests a moderately heavy-tailed error distribution, further explaining why adaptive PENSE is performing better than other estimators in this application. As demonstrated in the numerical experiments above, in such a setting with an heavy-tailed error distribution the contamination in the frequency spectrum can have very detrimental effects on non-adaptive and non-robust estimators. This is also a likely explanation why I-LAMM performs very poorly in this ap-  . For (a) the prediction accuracy is estimated by the uncentered τ -scales of the prediction errors from 50 replications of 6-fold cross-validation. Hyper-parameters are selected independently in each of these CVs via an "inner" 6-fold CV using 10 replications. In (b) the 14 data points in the shaded areas have unusually low or high log-concentrations of P 2 O 5 according to adaptive PENSE, with residuals larger than 3 times the estimated residual scale.
performance than I-LAMM in this application, the variable selection seems highly affected by the contamination, with only 4 frequencies being selected. In comparison, adaptive PENSE has been shown to retain reliable variable selection performance in such a challenging scenario and selects 43 frequencies belonging to several groups of adjacent and hence highly correlated frequencies. Adaptive MM selects a similar but slightly larger set of 48 frequencies and these slight differences likely translate to less accurate predictions of the concentration of P 2 O 5 .

Conclusions
Unusually PENSE is more resilient in challenging scenarios than other estimators considered here. This is underscored by adaptive PENSE's superior prediction performance for predicting the concentration of the compound P 2 O 5 in ancient glass vessels from their spectra. Adaptive PENSE not only achieves better prediction accuracy, it does so with a more parsimonious model than other robust methods and more reliability than non-robust methods.
In addition to the strong empirical performance, we have established theoretical guarantees for the estimator. Adaptive PENSE is asymptotically able to uncover the true set of relevant predictors and the estimator of the respective coefficients converges to a Normal distribution.
Importantly, these guarantees hold regardless of the tails of the error distribution and overall very mild assumptions on the distribution of the predictors or the residuals.
Computing adaptive PENSE estimates is challenging, but the proposed computational solutions ensure a wide range of analyses and problem sizes are amenable to adaptive PENSE.
The algorithms and the proposed hyper-parameter search are made available in the R package pense (https://cran.r-project.org/package=pense). The high reliability even under adverse contamination in predictors and responses alike, combined with the readily available computational tools, make adaptive PENSE a feasible alternative in many statistical applications.

Acknowledgement
The author would like to thank Gabriela V. Cohen Freue for her feedback on early drafts of Van der Vaart, A. and J. Wellner (1996). Weak convergence and empirical processes: with applications to statistics. Springer Series in Statistics. New York, NY: Springer.

A Proofs
Below are the proofs of asymptotic properties of adaptive PENSE as presented in Section 4.
For notational simplicity, the intercept term is dropped from the model, i.e., the linear model 1 is simplified to Y = X β 0 + U and the joint distribution H 0 of (Y, X ) is written in terms of the error All the proofs also hold for the model with an intercept term included. Another notational shortcut in the following proofs is to write the M-scale of the residuals in terms of the regres- Proof of Lemma 1. We will show step by step that the space F = {η v,s : v ∈ R p , s ∈ (0, ∞)} is a bounded Vapnik-Chervonenkis (VC) class of functions and hence Glivenko-Cantelli. The space F is bounded because ϕ(t) is bounded by assumptions on ρ. Define the mapping The corresponding function space G = {g v,s : v ∈ R p , s ∈ (0, ∞)} is a subset of a finitedimensional vector space with dimension dim(G ) = p + 1. Therefore, G is VC with VC index V (G ) ≤ p + 3 according to Lemma 2.6.15 in van der Vaart and Wellner (1996). Due to the assumptions on ρ, the function ϕ(t) can be decomposed into with ϕ 1,2 monotone functions. Thus, Φ 1,2 = {ϕ 1,2 (g(·)) : g ∈ G } and Φ (−) 1,2 = {ϕ 1,2 (−g(·)) : g ∈ G } are also VC due to Lemma 2.6.18 (iv) and (viii) in van der Vaart and Wellner (1996). Using Lemma 2.6.18 (i) in van der Vaart and Wellner (1996) a.s.
Proof of Lemma 2. The first result (a) is a direct consequence of the conditions of the lemma (u − x v n → u a.s.) and Theorem 3.1 in Yohai (1987).
For part (b), it is know from Lemma 1 the empirical process converges uniformly almost sure. Since σ M (β 0 ) > 0, the continuous mapping theorem gives surely. Finally, due to the continuity and boundedness of ϕ: which concludes the proof.
holds with arbitrarily high probability if n is sufficiently large.
Proof of Lemma 3. The proof for (16) relies on Lemma 4.5 from Yohai and Zamar (1986) which states that under the same conditions as for this lemma, the following holds: Therefore, the missing step is to show that sup v∈K |σ M (β * n ) − σ M (β 0 )| → 0 almost surely as n → ∞. This is done by contradiction.
Assume there exists a subsequence (n k ) k>0 such that for all k, sup v∈K |σ M (β * n ) − σ M (β 0 )| > > 0. Since v ∈ K with K a compact set, for every sequence v n there exists a subsequence Therefore, either one of the following holds: In the first case (i) it is know that Due to the boundedness of ρ, the dominated convergence theorem gives which contradicts the definition of σ M (β 0 + v n k / √ n k ). In case (ii) similar steps yield for all n k > N with N large enough. Therefore, the assumption sup v∈K |σ M (β * n ) − σ M (β 0 )| > > 0 can not be valid and hence sup v∈K |σ M (β * n ) − σ M (β 0 )| → 0. This concludes the proof of (16).
Before proving (17), note that is well defined because > 0 as per Lemma 6 in Smucler (2019). To prove (17), we first bound the denominator uniformly over v ∈ K. From Lemma 1 it is known that the empirical processes converge almost surely, uniformly over v ∈ K and s > 0. As a next step, we show the deterministic uniform convergence of where f n (U, X , v, s) is defined as The functions f n (U, X , v, s) are bounded and converge pointwise to ϕ U s , entailing pointwise convergence of E H 0 [f n (U, X , v, s)] → E F 0 ϕ U s as n → ∞ by the dominated convergence theorem. Because ρ has bounded second derivative, the derivative of f n (U, X , v, s) with respect to v ∈ K and s ∈ [σ M (β 0 ) − 1 , σ M (β 0 ) + 1 ] is also bounded, meaning f n (U, X , v, s) is equicontinuous on this domain. Pointwise convergence together with the equicontinuity make the Arzelà-Ascoli theorem applicable and hence conclude that (18) holds.
From (16) it follows that for any δ 2 > 0 there is a N δ 2 such that for all v ∈ K and all Combined with (18) this yields that for every δ 2 > 0 and 2 > 0 there is an N δ 2 , 2 such that for all n > N δ 2 , 2 and every v ∈ K with probability greater than 1 − δ 2 . Since both expected values are positive this can also be written as The final piece for the denominator to be bounded is to show that Set Ω 1 = {ω :σ M (β * n ; ω) → σ M (β 0 )} which has P(Ω 1 ) = 1 due to the first part of this lemma. Similarly, set Ω 2 = {ω : equation (20) holds}. Assume now that P(Ω 1 ∩ Ω c 2 ) > 0. This assumption entails that there exists an ω ∈ Ω 1 ∩ Ω c 2 , an 3 > 0 and a subsequence (n k ) k>0 such that However, since v n k is in the compact set K, the sequence β 0 + v n k / √ n k converges to β 0 as n → ∞. Additionally, ϕ is bounded and together with the dominated convergence theorem this leads to and in turn to contradicting the claim in (21). Therefore, P(Ω 1 ∩ Ω c 2 ) = 0, proving (20). Combining (19) and (20) leads to the conclusion that with arbitrarily high probability for large enough n for every v ∈ K.
From the first part of this lemma,σ M (β * n ) a.s.
− − → σ M (β 0 ), and due to (22), for every δ > 0 there exists an N δ, such that for all v ∈ K and n ≥ N δ, equation (17) holds. For part (i), the proof is identical to the proof of strong consistency for the S-Ridge estimator (Proposition 1.i) in Smucler and Yohai (2017) and hence omitted. Although the penalty functions used for the S-Ridge and PENSE are different, the growth condition onλ n has the same effect on PENSE as on the S-Ridge; making the penalty term negligible for large enough n.

A.2 Root-n consistency
Similarly, for part (ii), noting that the level of L 2 penalization given by λ n (1 − α)/2 converges deterministically to 0, the proof of strong consistency of adaptive PENSE is otherwise identical to the proof of strong consistency of adaptive MM-LASSO given in Smucler and Yohai (2017).  (7), satisfies: (ii) if λ n → 0 and λ n = O(1/ √ n), the adaptive PENSE estimatorθ as defined in (8), Proof. We will prove part (ii), as part (i) is essentially the same. The only difference is that the penalty loadings are deterministic and fixed at (1, . . . , 1) leading to a different constant D below.
The first step in the proof is a Taylor expansion of the objective function around the true parameter β 0 : where v n = τ (β − β 0 ) and β * n = β 0 + v n for a 0 < τ < 1. Due to the strong consistency ofβ from Proposition 1, v n → 0 a.s. and hence from Lemma 2 and the continuous mapping term Z n is handled by a Taylor expansion of ψ for some v * n = τ * v n with τ * ∈ (0, 1). The rest of the proof follows closely the proof of Proposition 2 in Smucler and Yohai (2017).
− − → σ M (β 0 ), the results in Smucler and Yohai (2017) (which are derived from results in Yohai (1985)) state that x i and hence with arbitrarily high probability for n sufficiently large there is a B such that Similarly, the results in Smucler and Yohai (2017) can be used to show with C n a.s.
Next is the difference in the penalty terms D n := Φ(β) − Φ(β 0 ), which can be reduced to the truly non-zero coefficients: Observing thatβ is a strongly consistent estimator, |β j − β 0 j | < j < |β 0 j | for all j = 1, . . . , s and any j ∈ (0, |β 0 j |) with arbitrarily high probability for sufficiently large n. This entails that, for all 0 ≤ t ≤ 1 and j = 1, . . . , s, the sign of the convex combination sgn(γ j (t)) = sgn(β 0 j ) = 0 and thus |γ j (t)| is differentiable. This allows application of the mean value theorem on the quadratic and the absolute term in D n to yield for some τ j ∈ (0, 1), j = 1, . . . , s, with arbitrarily high probability for large enough n. Because bothβ andβ are strongly consistent for β 0 and λ n = O(1/ √ n), there exists a constant D such that with arbitrarily high probability for sufficiently large n.
Sinceβ minimizes the adaptive PENSE objective function O AS , With the bounds derived in (23), (24), and (25) this in turn yields with arbitrarily high probability for large enough n. Rearranging the terms leads to the Combining (26) and (27) yields uniformly over v 1 and v 2 with arbitrarily high probability for sufficiently large n. By assumption αλ n n ζ/2 → ∞ and hence the right-hand side in (28) will eventually be positive, concluding the proof.

A.4 Asymptotic Normal distribution
Proof of Theorem 2, part (ii). For this proof we denote the values of the active predictors and the active predictors in the i-th observation by X I and x i,I , respectively. Becauseβ is strongly consistent for β 0 , the coefficient values for the truly active predictors are almost surely bounded away from zero if n is large enough. This entails that the partial derivatives of the penalty function exist for the truly active predictors and the gradient at the estimateβ is . The truly active coefficients can be separated from the truly inactive coefficients by noting that ψ Equation (29) can now be written as and using the mean value theorem there are τ i ∈ [0, 1] and hence a matrix such that the equation can be further rewritten to Separating the term √ n β * I − β 0 I then gives The strong consistency ofβ for β 0 and Lemma 2 lead toσ M (β) Combined with the assumption that √ nλ n → 0 this leads to the last two lines in (30) converging to 0 s in probability. Finally by Lemma 5.1 in Yohai (1985) and the CLT which, after applying Slutsky's Theorem, completes the proof.

B Additional simulation results
Here we present additional plots for the scenario detailed in Section 5.1 as well as an alternative scenario with more severe contamination.
For the first scenario, Figure 6 shows plots for prediction accuracy including the leastsquares EN and adaptive least-squares EN estimators for comparison. Moreover, in Figure 7 we show the prediction accuracy relative to the prediction accuracy achieved by adaptive PENSE, i.e., Here m is the estimation method (e.g., adaptive MM), andτ m is the scale of the prediction error achieved by method m. Hence, a value of RPP m = 0.1 means that the scale of the prediction error of method m is 10% larger than that of adaptive PENSE, and a value of e m = −0.1 means that the scale of the prediction error of adaptive PENSE is 10% higher than that of method m. For Normal errors,τ is the root mean squared prediction error, while for all other error distributions the τ -size of the prediction errors is used.
For a more fine-grained picture of variable selection performance, we also show the sensitivity and specificity in Figure 8.  Figure 6: Prediction performance of robust and non-robust estimators, measured by the uncentered τ -scale of the prediction errors relative to the true τ -scale of the error distribution (lower is better). The median out of 50 replications is depicted by the points and the lines show the interquartile range.  Figure 7: Prediction accuracy of robust and non-robust estimators in the alternative scenario, relative to the prediction accuracy of adaptive PENSE. The relative accuracy is defined in Equation (31) Figure 8: Variable selection performance of robust and non-robust estimators. Higher is better for both measures. Sensitivity (left) is the number of selected predictors which are truly active relative to the total number of truly active predictors. Specificity (right) is the number of predictors which are truly inactive and not selected relative to the total number of inactive predictors. The median out of 50 replications is depicted by the points and the lines show the interquartile range.

B.1 Alternative scenario
Here we present additional results for a more challenging alternative scenario with n = 100, p = 32 to p = 128, and s = log 2 (p). The scenario introduces a higher proportion of contamination (25%) and outliers are more severe (k lev = 256 and the contamination model uses coefficient value of −3 instead of −1). In addition to the log 2 (p) irrelevant predictors, bad leverage points also affect 2 truly relevant predictors. Good leverage points have extreme values in at most log 2 (p) non-relevant predictors. The predictors follow a multivariate t-distribution with 4 degrees of freedom and AR1-type correlation structure of Cor(X j , X j ) = 0.5 |j−j | , j, j = 1, . . . , p.
Figures 9 and 10 show the prediction accuracy relative the the true error scale and relative to the prediction accuracy achieved by adaptive PENSE, respectively. Overall variable selection performance in terms of the MCC is shown in Figure 11, whereas Figure 12 gives more detailed insights into the sensitivity and specificity of variable selection performance.
Overall, the conclusions are very similar to the scenario presented in Section 5.1, showcasing that adaptive PENSE is highly robust and performance is similar to adaptive MM. In this more challenging scenario, however we can observe that in some situations adaptive MM is affected considerable more by the contamination than adaptive PENSE, sometimes having more than 30% higher prediction error than adaptive PENSE, as highlighted in Figure 10.  Figure 9: Prediction performance of robust and non-robust estimators in the alternative scenario, measured by the uncentered τ -scale of the prediction errors relative to the true τ -scale of the error distribution (lower is better). The median out of 50 replications is depicted by the points and the lines show the interquartile range.  Figure 10: Prediction accuracy of robust and non-robust estimators in the alternative scenario, relative to the prediction accuracy of adaptive PENSE. The relative accuracy is defined in Equation (31), with positive values indicating a larger scale of the prediction errors than achieved by adaptive PENSE and negative values indicating better prediction accuracy than adaptive PENSE. The median out of 50 replications is depicted by the points and the lines show the interquartile range.  Figure 11: Overall variable selection performance of robust and non-robust estimators in the alternative scenario. Performance is measured by the Matthews correlation coefficient (MCC; higher is better), defined in (13). The median out of 50 replications is depicted by the points and the lines show the interquartile range.  Figure 12: Detailed variable selection performance of robust and non-robust estimators in the alternative scenario. Sensitivity (left) is the number of selected predictors which are truly active relative to the total number of truly active predictors. Specificity (right) is the number of predictors which are truly inactive and not selected relative to the total number of inactive predictors. The median out of 50 replications is depicted by the points and the lines show the interquartile range.