CR-Lasso: Robust cellwise regularized sparse regression

Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. We propose CR-Lasso, a robust Lasso-type cellwise regularization procedure that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. To evaluate the approach, we conduct empirical studies comparing its selection and prediction performance with several sparse regression methods. We show that CR-Lasso is competitive under the settings considered. We illustrate the effectiveness of the proposed method on real data through an analysis of a bone mineral density dataset.


Introduction
Identifying the most important features in an n × p design matrix X to predict an outcome vector y is a fundamental problem in statistics.This problem becomes challenging when there is contamination in the data.That is, when some elements of the full data matrix [y, X] are corrupted.It is commonly believed that most raw real-data, prior to data cleaning, contain a small proportion of outliers (Hampel et al., 1986).
Potential issues caused by these outliers are often ignored (Rousseeuw and Leroy, 2005) even though they may negatively impact estimation and variable selection (Weisberg, 2005).
There are two typical categories of outliers: rowwise outliers and cellwise outliers.Rowwise outliers refer to observation vectors whose components are entirely contaminated, such as visualised in row 2 and row 15 in Figure 1.One way to overcome this challenge is through robust sparse estimators that can deal with rowwise outliers under sparse settings by combining traditional robust estimators with Lasso-type regularization.For example, Alfons et al. (2013) proposed to incorporate least trimmed squares with the Lasso (Tibshirani, 1996), Smucler and Yohai (2017) and Chang et al. (2018) proposed combining the adaptive Lasso (Zou, 2006) and MM-estimation (Yohai, 1987), respectively.These estimators work by downweighting outlying observations.
Cellwise outliers refer to cells that are contaminated in a way that makes them different to the true data generating process.It is typically assumed that these outliers occur independently across the design matrix X. Observations may experience some form of corruption in one or more of the pre- dictors, such as visualized by the scattered white cells in Figure 1.Cellwise outliers are more challenging than rowwise outliers because of the propagation of these outliers to many of the rows, where even a small proportion of cellwise outliers in each predictor could cause a large proportion of outlying observation vectors (Alqallaf et al., 2009).
Outlier detection is often the first step taken when dealing with cellwise outliers in a dataset.One common approach, as suggested by Rousseeuw and Van Den Bossche (2018), is to predict the value of a cell and flag it as an outlier if the difference between its predicted value and observed value exceeds a certain threshold.
Another technique, described by Debruyne et al. (2019), treats all outliers as rowwise outliers and identifies the cells that contribute most to the outlying rows.Raymaekers and Rousseeuw (2021) presented "cellflagger", a method that detects outlying cells for each row by combining Lasso regularization with a stepwise application of cutoff values.
In the field of robust regression under cellwise contamination, several methods have been proposed to address the issue of outliers.For example, Öllerer et al. (2016) introduced the Shooting S-estimator, which combines the Shooting algorithm and the S-estimator to perform robust regression.Leung et al. (2016) proposed a three-step regression method that involves detecting rowwise and cellwise outliers, followed by processing covariance In this paper, we provide a new perspective on dealing with cellwise outliers in regression models.An active predictor with a cellwise outlier generally increases a regression residual's magnitude and the cell deviation.Building on this idea and the work in Chen et al. (2013), we propose the cellwise regularized Lasso (CR-Lasso), which incorporates a regression and cell deviation measure into a loss function.We then apply a cellwise regularized and sparse regression procedure that helps identify active predictors and possible outliers.
The structure of this paper is as follows.Section 2 describes the proposed method and algorithm details.Section 3 illustrates the simulation results in low and high-dimensional settings.A real data application is presented in Section 4. Finally, we state some conclusions in Section 5. R functions and sample codes that implement the proposed approach are available on the GitHub page of the first author (https://github.com/PengSU517/regcell).

Cellwise robust sparse regression
We first explore the limitations of traditional regularization techniques when dealing with cellwise contamination.We then introduce a novel approach that addresses these issues by providing a constrained loss function, which will be discussed in detail below.
Consider an observed response y i , a set of p predictors x i and a corresponding coefficient vector β in a linear regression model framework: where in classical settings the error term, ε i , is assumed to be independent N(0, σ 2 ) distributed.We can write this as y = Xβ + ε, where y is an n dimensional response vector and X is an n × p design matrix both of which may include some outliers, and ε is the error vector.Without loss of generality, unless otherwise specified, we assume all predictors have zero mean and unit variance and the variance of the error term ε equals one.If not otherwise mentioned, we do not include an intercept term in the linear regression model as we work with centered data.

Modification of the regression loss
The iterative procedure for outlier detection (IPOD) method (She and Owen, 2011) is a modification of Lasso and handles outlying rows by adding an extra term ζ: where ζ indicates possible outlying parts in the response y and θ is a tuning parameter for ζ.She and Owen (2011) showed the equivalence between IPOD and the Huber's M-estimate βH = argmin β ρ θ (y − Xβ) (Huber, 2004), where Huber's loss function is With some non-convex penalties, IPOD is equivalent to other M-estimates.
The non-convex penalties in this scenario can be likened to redescending loss functions, known for their robustness against rowwise outliers, including leverage points.However, like M-estimates, IPOD is only robust against rowwise outliers.
To be robust against cellwise outliers, Chen et al. ( 2013) suggested the modification argmin where ∆ is an n × p matrix, indicating possible outlying parts in the design matrix X, and η is a tuning parameter for ∆.However, the solution of ( 4) is non-convex and non-tractable because of the bi-linear term ∆β (Chen et al., 2013).
In earlier work, Zhu et al. (2011) considered using the squared Frobenius norm ∆ 2 F as a penalty: which is more suitable to deal with measurement errors (where ∆ is dense and bounded) instead of cellwise outliers (where ∆ is assumed to be sparse).
Addressing cellwise outliers involves individually regularizing cells within the design matrix X.This leads to the formulation of the following loss function: argmin which is solved by Then a regularized design matrix X = X − ∆ is obtained by This solution is equivalent to Winsorization directly (Dixon, 1960).The similarity between Winsorization and Huber's loss was noted in Huber (1964).Although this connection has not been extensively investigated, a link has previously been established between Huberization and the L 1 penalty (Sardy et al., 2001).
However, using the formula (5), we can only marginally shrink cells in the design matrix.To deal with cellwise outliers in a regression model, we propose to combine loss functions (4) and ( 5).The underlying idea is that, in an ideal scenario, a cellwise outlier within an active predictor should have two noticeable effects.Firstly, the regression residual's magnitude should increase, indicating a deviation from the expected relationship.Secondly, it should also contribute to an increased deviation of this cell, highlighting its distinctiveness compared to the other cells in the design matrix.
From this view, we modify the loss function as the sum of the regression loss and the deviation loss: where y−(X −∆)β 2 2 measures the regression loss and X −∆ 2 F measures the deviation loss.The goal of the minimisation problem in (6) is to decrease the sum of the regression loss and the deviation loss by shrinking only a few outlying cells in the design matrix.
However, shrinking cells in predictors can only deal with outliers in X.
Thus, we add a ζ term that addresses possible outliers in y: where ζ indicates possible outlying parts in the response y and θ is a tuning parameter to control the sparsity in ζ.
To illustrate the mechanism, Figure 2 shows a simple linear regression model through the origin: The majority of the clean observations are distributed in the ellipse, such as point A. The grey dashes represent the boundaries of cell regularization in x.Any cells outside the grey boundaries will be regarded as outliers in x and will undergo η-regularization.This regularization process involves horizontally shifting the outliers towards the grey boundary until they reach it.Regularizing point B will simultaneously lead to a decrease in both the regression residual and deviation magnitude.Conversely, regularizing point C will reduce the deviation loss but increase the regression loss.In this case, point C will be regarded as a good high-leverage point and will not be regularized, despite having a large deviation magnitude.Blue dashes indicate the boundaries of cellwise regularization in the response y.Point D has a large regression residual but a small deviation.In this case, point D will be regarded as an outlier in y and will be θ-regularized, which means it will be moved vertically until they reach the blue boundary.
We observe that the objective loss function ( 7) is convex for ζ, ∆, and β, respectively.However, because of the presence of the bilinear term ∆β, it is not jointly convex to all the parameters we considered.Fortunately, to solve the objective loss function ( 7), an iterative block coordinate descent algorithm can be used, yielding successive estimates of ζ, ∆, and β.
The algorithm can be divided into two stages: cellwise regularization (as in Algorithm 1) and regression (as in Algorithm 2).In the first stage, given an estimate β, the loss function ( 7) becomes a minimisation problem with respect to ∆ and ζ: Note that (8) has a L 2 loss with L 1 regularization, which can be solved using the gradient descent algorithm directly.The cellwise regularization algorithm proceeds as in Algorithm 1.
In the second stage, after getting ∆ and ζ, a regularized design matrix which is a regression loss that can be solved directly using ordinary least squares (OLS).Combining the cellwise regularization technique with OLS, we have the Cellwise Regularized Least Squares (CR-LS) to run regression and detect outliers simultaneously.
This alternating procedure has a similarity with the EM algorithm.Initially, the algorithm detects and imputes outliers based on the observed data and current parameter estimates in the first step.In the second step, the algorithm updates the model parameters to maximize the likelihood of the observed data, incorporating the information gained from the first step.Details of the CR-LS algorithm are as per Algorithm 2.

Cellwise regularized sparse regression
For high-dimensional data and under sparse settings, only a small subset of predictors are active, which means β is a sparse vector including only a few nonzero coefficients, typically much less than min (n, p).Many techniques were proposed in the last thirty years to recover the sparse β.For instance, the popular Lasso (Tibshirani, 1996) solves an L 1 regularized objective loss: where λ is a tuning parameter.When X is well-conditioned, Lasso-type estimators can guarantee a high recovery rate for β when the chosen λ is appropriate, meaning that the estimate of β depends on λ.
To simultaneously select active variables and detect outlying cells under cellwise contamination, we combine equations ( 7) and ( 10) as follows: Similar to the procedure utilized to solve the optimization problem (7), we can employ an iterative block coordinate descent algorithm to solve (11) and sequentially estimate β, ∆, and ζ.The algorithm consists of two stages: cellwise regularization and sparse regression.In the cellwise regularization stage, given an estimate β, the transformed loss function remains as ( 8), which can be efficiently solved using Algorithm 1.
In the second stage, after getting ∆ and ζ, a regularized design matrix X = X − ∆ is obtained as well as a regularized response ỹ = y − ζ.Then we obtain a new β via solving argmin which is a classical Lasso-type optimization.Combining the cellwise regularization with the Lasso, we have the Cellwise Regularized Lasso (CR-Lasso) to select variables and detect outliers simultaneously.Details of the CR-Lasso algorithm are stated in Algorithm 3.
It is important to understand the convergence properties of Algorithm 3. Let Because the block coordinate descent iterations are non-increasing, the CR- for any k ≥ 0. Therefore, the final estimates converge to local minima.The algorithm's efficiency depends on various factors, including the dimension Algorithm 3 Cellwise Regularized Lasso ( β ← CR-Lasso(X, y, λ, η, θ)) Input: X, y, λ, η, θ Initialization: β(0) and ∆(0) , k ← 0 Output: β and the data-generating distribution.With the default convergence level set at ǫ 1 = 10 −6 and ǫ 2 = 10 −3 and the simulation settings detailed in Section 3.1, we observed convergence in all simulation runs within 20 iterations.

Data initialization
For a given dataset [y, X], it is possible to execute Algorithm 3 with non-standardized data.However, standardizing the dataset and subsequently back-transforming the coefficients is a strategic approach to enforce equivariance properties, which would otherwise be compromised due to the regularization process.For the estimation of location parameters, utilizing the median may pose a challenge, as it could fall outside the geometric boundaries of the dataset.While this concern can be mitigated by opting for the multivariate L 1 median (Dodge and Rousson, 1999), it comes at the expense of increased computational complexity.In this paper, we choose the classic median to enhance computational efficiency.
To achieve robust scale estimation comparable to standard deviation estimates under the normal distribution, various robust scale estimators, such as the median absolute deviation and the Q n estimator (Rousseeuw and Croux, 1993), as well as the P n estimator (Tarr et al., 2012), can be employed.We advocate for the use of the Q n estimator due to its robustness and efficiency.
This results in a robustly standardized design matrix X ⋆ .The effectiveness of the proposed method hinges on a suitable scale estimate, and it is imperative to determine this estimate before applying our algorithm.We obtain a robust estimate of the residual standard deviation, σ, using RLars (Khan et al., 2007).Then we obtain standardized estimates β⋆ , ∆⋆ , and ζ⋆ from a standardized version of the objective loss: Obtaining good initial values is important as the loss function ( 13) is nonconvex.To accelerate the algorithm's convergence and improve its accuracy, we recommend using RLars to obtain an initial estimate of β, where the R function rlars is in the R package robustHD (Alfons, 2021).Nevertheless, in our simulation studies, we observe that even when utilizing a zero vector as the initial value for β, we were able to achieve satisfactory estimation results.

The choice of tuning parameters
In Lasso-type problems, a common approach for determining the optimal regularization parameters involves computing the solution path for a sequence of values of λ and then selecting the λ value that provides the best prediction results or smallest information criterion value.As stated by Friedman et al. (2010), the sequence begins with λ max , the smallest value of λ for which the entire vector β = 0.The minimum value is set as λ min = ιλ max .
Finally, a sequence of K values for λ is constructed, increasing logarithmically from λ min to λ max .In our studies we set ι = 0.001 and K = 50.
Two of the classical selection criteria for sparse regression models are the AIC (Akaike, 1974) and BIC (Schwarz, 1978), where AIC = L + 2k, BIC = L+ log(n)k and L denotes the corresponding loss and k is the number of active predictors.For the proposed method, we define We use the BIC as the default selection criterion because of its ease of implementation and good performance in our simulation studies and real data applications.For conciseness, we do not report results for criteria other than the BIC.
We implement a hard threshold for model selection to avoid potential over-regularisation issues, where the algorithm selects only a few predictors and shrinks almost all the cells in the selected predictors.Such an occurrence is more likely when setting a relatively large λ.To address this, we exclude any model from consideration when the tuning parameter λ results in a cell shrinking rate in one active predictor that exceeds 30% of the number of observations.
Regarding η, a natural choice is to set η = z 0.995 for all cells, where z 0.995 = 2.576, the 99.5% quantile of the standard normal distribution.A quantile threshold is commonly used, such as in Huberization (Huber, 2004) and DDC (Rousseeuw and Van Den Bossche, 2018).Similarly, setting θ = z 0.995 is also a natural choice.However, this parameter is sensitive to the estimated error scale, σ, which may not be sufficiently accurate under cellwise contamination.Therefore, to ensure robustness in our simulation studies, we set θ = 1.Maronna et al. (2019) showed that good efficiency under Gaussian assumptions can be achieved despite a conservative choice for θ.

Post-cellwise-regularized regression
It is well known that the Lasso estimator can introduce bias.To counteract this, we perform CR-Lasso with post-cellwise-regularized regression based on the selected predictors to enhance the model's performance.The process is as follows: 1. Parameters η and θ are established as mentioned in Section 2.4.This choice is made since optimising three tuning parameters simultaneously is time-consuming in practice.
2. For a grid of λ values, we run Algorithm 3 sequentially and use BIC to choose the optimal tuning parameter λ opt .
3. After obtaining λ opt , we run Algorithm 2 using only the variables that were retained under λ opt to obtain the final regression coefficient vector estimate β.
Our simulation studies presented in Section 3 consider the performance of CR-Lasso with post-cellwise-regularized regression.In Appendix A, we show the performance of CR-Lasso both with and without post-cellwiseregularized regression.The findings indicate that even without post-cellwiseregularized regression, CR-Lasso still outperforms other methods under cellwise contamination.

Simulation
To demonstrate the effectiveness of the proposed method, we ran simulation studies and compared the performance of six methods for a moderatedimensional and a high-dimensional setting: CR-Lasso, sparse shooting S (Bottmer et al., 2022, SSS), robust Lars (Khan et al., 2007, RLars), MM-Lasso (Smucler and Yohai, 2017), sparse LTS (Alfons et al., 2013, SLTS) and Lasso (Tibshirani, 1996).For the implementation of SSS, we utilized the R function sparseshooting (Wilms, 2020).RLars and SLTS were implemented using the R functions rlars and sparseLTS in the R package robustHD (Alfons, 2021).The R function mmlasso (Smucler, 2017) was employed for MM-Lasso.Lasso was implemented through the R function glmnet in the R package glmnet (Friedman et al., 2010).By default in the utilized functions, Lasso and MM-Lasso employ cross-validation (CV) to select optimal tuning parameters.In contrast, SSS, RLars, and CR-Lasso utilize the Bayesian information criterion (BIC) for tuning parameter selection.SLTS uses a default tuning parameter without optimization.Among all the compared methods, CR-Lasso, SSS, and RLars are relatively robust to cellwise outliers, MM-Lasso and SLTS are rowwise robust methods, and Lasso is a classical variable selection technique that is not robust to any outliers and is included here for completeness.

Moderate-dimensional setting
In our simulations, we set n = 200, p = 50 and β = (1 ⊤ 10 , 0 ⊤ p−10 ) ⊤ with an intercept term β 0 = 1.Clean observation vectors xi were sampled from N(0, Σ), and errors ε i were sampled from N(0, 3 2 ), indicating a relatively high level of noise in the data.The correlation structure among predictors was given by Σ ij = ρ |i−j| and we set ρ = 0.5.We also generated xi from the multivariate t 4 distribution to simulate distributions with heavier tails.
To introduce cellwise outliers, we set the contamination proportion e to 0%, 2% and 5% for all predictors and generate outliers independently.
Outlying cells ∆ ij and ζ i were randomly generated from N(γ, 1) and N(−γ, 1) with equal probability, where γ varies to simulate outliers with different magnitudes.
We ran 200 simulations for each scenario and used the root of the mean squared prediction error (RMSPE) to assess the prediction accuracy of the considered methods.In addition, to assess the accuracy of variable selection, we employed where TP, FP, and FN indicate true positives, false positives, and false negatives, respectively.While the F 1 score is commonly used for classification problems, it is also used to evaluate the performance of variable selection techniques, as in Bleich et al. (2014).
The advantage of using the F 1 score to measure variable selection is that it takes into account both the precision and recall of the selected variables.Precision measures the proportion of selected variables that are relevant, while recall measures the proportion of relevant variables that are selected.By combining precision and recall, the F 1 score provides a balanced evaluation of variable selection performance.This allows us to measure the effectiveness of each method in selection and prediction.a 2% contamination rate, CR-Lasso exhibits superior and stable performance, even with a high magnitude of outlyingness.On the other hand, Lasso, MM-Lasso, RLars, SLTS and SSS perform inferior with increasing γ.At a 5% contamination rate, CR-Lasso maintains stable performance.The prediction results are similar to 2% contamination.In contrast, other methods show inferior results compared with their counterparts under 2% contamination.
The behaviour of the six compared methods differs significantly when the predictors follow a t 4 distribution, which is known to generate many high-leverage points.CR-Lasso exhibits inferior performance compared to the normal cases since it treats high-leverage points as outliers and shrinks their values, resulting in biased estimates.However, even in this challenging scenario, CR-Lasso outperforms other methods with 2% or 5% contamination with high magnitudes of outlyingness when γ is as high as 6 or 8.The overall performance of all compared methods is similar to the Gaussian setting, except for some extreme RMSPE values observed in MM-Lasso, RLars, and SLTS. Figure 4 presents the mean F 1 scores across all evaluated methods.When the predictors are generated from a normal distribution, as depicted in the top row of the figure, CR-Lasso demonstrates exceptional F 1 scores across all scenarios.RLars also exhibits good F 1 scores, followed by MM-Lasso and Lasso.On the other hand, SLTS and SSS exhibit inferior F 1 scores as they select many inactive variables.When the predictors follow a t 4 distribution, CR-Lasso does not perform as well as when the predictors are multivariate normal.RLars shows competitive results as CR-Lasso since there are more good high leverage points, strengthening the estimate.The overall performance of other compared methods for heavy-tailed predictors is similar to the normal setting.

High-dimensional setting
In this subsection, we present the simulation results for high-dimensional settings.Specifically, we set the number of predictors to p = 300 and β = (1 ⊤ 10 , 0 ⊤ p−10 ) ⊤ , keeping all other settings the same as in the moderatedimensional settings.
Figure 5 reports the results of RMSPEs for the data generated with p = 300.With such a high noise level, all compared methods exhibit significantly different performances in high-dimensional cases compared to their moderate-dimensional counterparts.Specifically, with e = 0% or 2% contamination, only CR-Lasso and RLars demonstrate considerable performance, while Lasso also shows superior performance with a low magnitude of outlyingness.MM-Lasso, SLTS and SSS perform less effectively.When the contamination rate increases to 5%, only CR-Lasso maintains stable performance.Other methods show inferior performance in this case, particularly with the increase of γ.
Like the moderate-dimensional cases, CR-Lasso performs better with nor- mally distributed predictors than when the predictors follow a multivariate t 4 distribution.In contrast, RLars exhibits better performance because of the presence of good high-leverage points.Even in this case, CR-Lasso still performs better with a 5% contamination rate and a large magnitude of outliers.Besides, MM-Lasso, SLTS and SSS exhibit many extreme RMSPE values, indicating that they are not robust to outliers with such a high-level noise.the methods demonstrate inferior F 1 scores.RLars exhibits competitive F 1 scores with a t 4 distribution.Conversely, Lasso, MM-Lasso, SLTS and SSS show poor F 1 scores as they tend to select too many inactive predictors.

Additional simulation results
Additional simulation results are presented in the Appendix.First, Appendix A demonstrates commendable performance of CR-Lasso even in the absence of post-regression, although it may not reach the same level of superiority observed when post-regression is applied.Second, Appendix B demonstrates the effectiveness of CR-Lasso under rowwise contamination.
Moreover, it is noteworthy, however, that CR-Lasso, like other methods, faces challenges when dealing with predictors generated from heavy tail distributions, such as from Cauchy distribution (see Appendix C).

The bone mineral density data
To illustrate the proposed methodology, we considered the bone mineral density (BMD) data from Reppe et al. (2010) as Rocke and Durbin (2001) highlighted.This contamination can stem from multiple sources, thereby obscuring the gene expression in the data (Zakharkin et al., 2005).
Given the large number of variables in the dataset, a pre-screening step was implemented to identify the subset of variables that are most correlated with the outcome of interest, the total hip T-score.To accomplish this, we first log-transformed all the predictors and then utilized the robust correlation estimate based on winsorization as in Khan et al. (2007), instead of the Pearson correlation, since winsorization is more robust to outliers that may occur in the dataset.The screened data comprise measurements of p = 100 genes from n = 84 Norwegian women.
Figure 7 displays the outlier detection results based on the screened predictors using DDC (Rousseeuw and Van Den Bossche, 2018).On average, the screened genes exhibit a contamination rate of 3.61%, with the probe 236831 at having the highest contamination rate of 9.52%.Among the observations, the thirteenth observation, identified by patient ID 20, shows the highest contamination rate at 22%.
Given the cellwise contamination, we first standardized all variables with the median and Q n .We then conducted a simple simulation study to validate the effectiveness of CR-Lasso, Lasso, MM-Lasso, RLars, SLTS and SSS on the bone mineral density data.
We first obtained a clean (imputed) dataset X using DDC (Rousseeuw and Van Den Bossche, 2018).We then generated an artificial response y = Xβ + ε using screened clean predictors and ε ∼ N(0, 0.5 2 I).We randomly picked ten active predictors in each simulation run and set β j ∼ U(1, 1.5) for each of them.We then randomly collected 80% observations from the original (contaminated) dataset for model training, while the remaining 20% of the imputed (clean) dataset was used to assess the prediction accuracy.We repeated the simulation procedure 200 times.For each method, we show MAPE (mean absolute prediction error) and RMSPE values in Table 1, as well as their True positive numbers (TP), true negative numbers (TN) and F 1 scores.shown in Table 2. Out of the 100 pre-screened genes, nine genes were commonly chosen by CR-Lasso, Lasso, MM-Lasso, RLars, and SLTS.We ran the Leave-One-Out Cross-Validation to assess the performance of the selected models.The RMSPE and MAPE (mean absolute prediction error) from Leave-one-out prediction residuals are also shown in Table 2. and MAPE values, this is followed by the MM-Lasso and CR-Lasso.We note that for this data, RLars and SSS have slightly larger prediction errors, highlighting that in this example, these cellwise-robust methods would require at least some minimal amount of extreme cellwise outliers to show superior prediction performance.

Discussion
Cellwise outliers can create significant challenges when building a regression model.Most observations may eventually be contaminated by outliers in regression models due to the propagation of cellwise outliers.Most of the existing methods may not effectively identify and manage such outliers.
Therefore, there is a need to identify and manage these outliers.
Motivated by this challenge, we proposed a novel approach called CR-Lasso, which incorporates a constraint on the deviation of each cell in the loss function to detect cellwise outliers in regression models.We developed an iterative procedure for sparse regression and outlier detection by combining Lasso and cellwise outlier regularisation.Our simulation studies and real data analysis demonstrate that CR-Lasso generally has superior variable selection performance and estimation accuracy compared to other methods when outliers are present, especially when the noise ratio is high and the magnitudes of outliers are extreme.In datasets with only a few outliers that are not too extreme, traditional non-robust estimators, such as the Lasso, perform considerably well.
However, we acknowledge two critical issues that require further investigation.Firstly, estimating σ accurately under cellwise contamination is challenging.Although we used RLars (Khan et al., 2007) to obtain an initial estimate, it can overestimate σ when the noise ratio is high.Therefore, obtaining a more precise initial estimate of σ is necessary and needs further exploration.Secondly, we only considered symmetric light-tailed predictors in this paper.However, handling asymmetric or heavy-tailed predictors with outliers is challenging, as the algorithm may identify clean tails as outliers.
Our simulation results indicate that CR-Lasso may have inferior performance under extremely heavy-tailed distributions such as Cauchy and t 1 .In real data applications, it is recommended to perform data transformations as a preprocessing step to improve the model performance.For instance, we applied the log transformation in our real data application.Other transformations such as the Box-Cox transformation (Box and Cox, 1964) may also be beneficial.Nevertheless, it is crucial to conduct further investigation to develop more robust methods that can handle asymmetric or heavy-tailed predictors effectively.Moreover, the loss function ( 7) can be formulated as a constrained mixed-integer optimization problem, providing an opportunity for resolution through alternative algorithms, similar to the work of Insolia et al. (2022), which can be explored in the future.
In conclusion, our proposed CR-Lasso approach produces competitive results.Meanwhile, addressing the above-mentioned issues is essential for robust outlier detection and feature selection in more general settings.Each column represents a contamination rate e, and each row represents a distribution type of predictor.

Figure 1 :
Figure 1: Illustration of outlying cells in the design matrix of a regression model.Rowwise outliers refer to observations whose entire observation vector is contaminated, such as in row 2 and in row 15.Cellwise outliers refer to cells that are contaminated.
matrix estimation and regression.Alternatively, Filzmoser et al. (2020) proposed a cellwise robust M-estimator that replaces the detected outliers with imputed values.Recently, there has been substantial interest in robust sparse regression under cellwise contamination.Machkour et al. (2020) investigated giving adaptive weights for predictors and observations based on detected outliers.Onur et al. (2021) proposed to filter and impute outliers before regression modelling to prevent the possible damage caused by outliers.Bottmer et al. (2022) considered solving this problem by running successive penalized Sregression for all predictors and called this method Sparse shooting S (SSS).

Figure 2 :
Figure 2: Illustration of cellwise regularization for a simple linear regression model.Grey dashes indicate the boundaries of cellwise regularization (horizontally) in x.Blue dashes indicate the boundaries of cellwise regularization (vertically) in y.

Figure 3 Figure 3 :
Figure 3 reports the distributions of RMSPEs for the data generated with p = 50.When the predictors are generated from a normal distribution, as depicted in the top row of the figure, CR-Lasso demonstrates superiority compared to other methods.Without contamination (e = 0%), CR-Lasso, RLars, MM-Lasso and Lasso exhibit similar performance, with an average RMSPE of around 3.2, while SLTS and SSS display inferior performance.At

Figure 4 :
Figure 4: Variable selection results as summarised by the F 1 scores over 200 simulation runs for various contamination rates when p = 50, from which 10 predictors are active.Each column represents a contamination rate e, and each row represents a distribution type of predictor.

Figure 5 :
Figure 5: Prediction results as summarised by the RMSPE over 200 simulation runs for various contamination rates when p = 300, from which 10 predictors are active.Each column represents a contamination rate e, and each row represents a distribution type of predictor.

Figure 6 Figure 6 :
Figure 6 depicts the mean F 1 scores across all evaluated methods in the high-dimensional cases.Compared to the moderate-dimensional cases, all

Figure 7 :
Figure 7: Outlier cell map for 100 screened variables on 84 Norwegian women.Cells are flagged as outlying if the observed and predicted values differ too much.Most cells are blank, showing they are not detected as outliers.A red cell means the observed value is significantly higher than the predicted value, and a blue cell means the observed value is significantly lower.

Figure A. 8
Figure A.8 suggests that post-cellwise-regularized regression contributes to enhanced prediction capabilities.Even without post-regression, CR-Lasso exhibits superior performance compared to other methods.It is crucial to note that post-regression is solely employed to enhance prediction and does not influence variable selection results.Consequently, the variable selection outcomes for CR-Lasso with or without post-regression remain unchanged.

Table 1 :
The average of RMSPEs, MAPEs, TPs, TNs and F 1 s from 200 simulations.The method that yields the highest performance is highlighted in bold.

Table 2 :
Model size, RMSPE and MAPE from Leave-one-out prediction residuals.The method that yields the highest performance is highlighted in bold.