Robustness in sparse high-dimensional linear models: Relative efficiency and robust approximate message passing

Understanding efficiency in high dimensional linear models is a longstanding problem of interest. Classical work with smaller dimensional problems dating back to Huber and Bickel has illustrated the clear benefits of efficient loss functions. When the number of parameters p is of the same order as the sample size n, p ≈ n, an efficiency pattern different from the one of Huber was recently established. In this work, we study relative efficiency of sparsity linear models with p n. In the interest of deriving the asymptotic mean squared error for l1 regularized M-estimators, we propose a novel, robust and sparse approximate message passing algorithm (RAMP), that is adaptive to the error distribution. Our algorithm includes many non-quadratic and non-differentiable loss functions. We derive its asymptotic mean squared error and show its convergence, while allowing p, n, s → ∞, with n/p ∈ (0, 1) and n/s ∈ (1,∞). We identify new patterns of relative efficiency regarding l1 penalized M estimators. We show that the classical information bound is no longer reachable, even for light–tailed error distributions. Moreover, we show new breakdown points regarding the asymptotic mean squared error. The asymptotic mean squared error of the l1 penalized least absolute deviation estimator (P-LAD) breaks down at a critical ratio of the number of observations per number of sparse parameters in the case of light-tailed distributions; whereas, in the case of heavy-tailed distributions, the asymptotic mean squared error breaks down at a critical ratio of the optimal tuning parameter of P-LAD to the optimal tuning parameter of the l1 penalized least square estimator. MSC 2010 subject classifications: Primary, 62G35, 62J07; secondary 60F05.


Introduction
In recent years, scientific communities face major challenge with the size and complexity of the data generated. The size of such contemporary datasets and the number of variables collected makes the search for, and exploitation of, sparsity vital to their statistical analysis. Moreover, the presence of heterogeneity, outliers and anomalous data in such samples is very common. However, statistical estimators that are not designed for both sparsity and robustness to the data irregularities simultaneously will give biased results, depending on the "magnitude" of the deviation and on the "sensitivity" of the method. An example of an early work on robust statistics is [12,11]. Specifically, they argue that robust estimators based on a minimization of non-differentiable loss functions are insensitive to changes not involving the parameters. Subsequently, [39] [25], [24] and [9] laid the comprehensive foundations of a theory of robust statistics. In particular, Huber's seminal work on M-estimators [26] established asymptotic properties of a class of M-estimators in the situation where the number of parameters, p, is fixed and the number of samples, n, tends to infinity. Since then, numerous important steps have been taken toward analyzing and quantifying robust statistical methods -notably in the work of [40,17,44], among others. Even today, there exist several (related) mathematical concepts of robustness (see [35]). This illustrates diverse and rich aspects of robustness. However, its intricate dependence on the dimensionality of the parameter space hasn't been explored much.
Modern dataset, where number of parameters is larger than the number of samples led statisticians to move away from the M-estimators and to consider the penalized M-estimators. To further the focus on penalized M-estimators, we consider a linear regression model: with Y = (Y 1 , ..., Y n ) T ∈ R n a vector of responses, A ∈ R n×p a known design matrix, x o ∈ R p a vector of parameters; the noise vector W = (W 1 , ..., W n ) T ∈ R n having zero-mean components each with distribution F = F w and a density function f w . When p ≥ n a form of sparsity is imposed on the model parameters x o , i.e., it is imposed that supp(x o ) = {1 ≤ j ≤ p : x oj = 0} with |supp(x o )| = s. Early work on penalized estimators include least squares loss (LS) with l 1 -penalty, Lasso, [38], concave penalty, SCAD [21], MCP [45], adaptive l 1 penalty [48], elastic net penalty [47], and many more. However, when the error distribution F w deviates from the normal distribution, the l 2 loss function is typically changed to the − log f w . Unfortunately, in applications the error distribution F w is unknown and a method that adapts to many different distributions is needed. Following classical literature on M-estimators, penalized robust methods such as penalized Quantile regression [7], penalized Least Absolute Deviation estimator [41], AR-Lasso estimator [22], robust adaptive Lasso [1] and many more, have been proposed. These methods penalize a convex loss function ρx From hereon, we refer tox(λ) as the l 1 penalized M-estimator. Despite the substantial body of work on robust M-estimators, there is very little work on robust properties of l 1 penalized M-estimators. Robust assessments of penalized statistical estimators customarily are made ignoring model selection. Typical properties discussed are model selection consistency or tight upper bounds on the statistical estimation error (e.g., [14,36,22,23,31,32,30,42,16]). In particular, the existing work has been primarily reduced to the tools that are intrinsic to Huber's M-estimators. In order to do that, the authors establish a model selection consistency and then reduce the analysis to this selected model assuming that that is the true model. However, this analysis is dissatisfactory, as the necessary assumptions for the model selection consistency are far too restrictive. Hence, departures from such considerations are highly desirable. This is where our work makes progress as our robustness analysis does not assume restrictive Irrepresentable condition (and hence perfect model selection). This enables us to answer question like: in high dimensional regime, which estimator is preferred? In the low-dimensional setting, several independent lines of work provide reasons for using distributionally robust estimators over their least-squares alternatives [27]. However, in high dimensional setting, it remains an open question, what are the advantages of using a complicated loss function over a simple loss function such as the squared loss? Can we better understand how differences between probability distributions affect penalized M-estimators? One powerful justification exists, using the point of view of statistical efficiency. Huber's proposed measure of robustness [26] allows a comparison of estimators by comparing their asymptotic variance; one caveat is that the two estimators need to be consistent up to the same order. For cases with p ≥ n little or nothing is known about the asymptotic variance of the robust estimator (1.2) as p → ∞ whenever n → ∞. Moreover, the penalized M-estimator is biased and shrinks many coefficients to zero. For such estimators, the set of parameters for which Hodge's super-efficiency occurs is not of measure zero. Hence, asymptotic variance may not be the most optimal criterion for comparison. This suggest that a different criterion for comparison needs to be considered in the high dimensional asymptotic regime where n → ∞, p → ∞ and n/p → δ ∈ (0, 1). We examine the asymptotic mean squared error (AMSE). AMSE is an effective measure of efficiency as it combines both the effect of the bias and of the variance [17]. However, in p n regime, it is not obvious that the asymptotic mean squared error will satisfy the classical formula.
AMSE was studied in [5,29] for the case of ridge regularization, with the penalty x 2 2 , and when p ≤ n but p ≈ n. In this setting AMSE is equal to the asymptotic variance ofx(λ). They discovered a new Gaussian component in the AMSE ofx(λ) that cannot be explained by the traditional Fisher Information Matrix. To analyze AMSE for the case of no-penalization, with p ≈ n, [20] utilized the techniques of Approximate Massage Passing (AMP) and discovered the same Gaussian component. The advantage of the AMP framework is that it provides an exact asymptotic expression of the asymptotic mean squared error of the estimator instead of an upper bound. For the case of the least squares loss with p ≥ n, [4] make a strong connection between the penalized least squares and the AMP algorithm of [19]. However, the AMP algorithm of [4] cannot recover the signal when the distribution of the noise is arbitrary. For this settings, we design a new, robust and sparse Approximate Message Passing (RAMP) algorithm.
Our proposed algorithm belongs to the general class of first-order approximate massage passing algorithms. However, in contrast to the existing methods it has three-steps. It has iterations that are based on gradient descent with an objective that is scaled and min regularized version of the original loss function ρ. Moreover, it allows non-differentiable loss functions. The three-step estimation method of RAMP is no longer a simple proxy for the one-step M estimation. Due to high dimensionality with p ≥ n, such a step is no longer adequate. Our proof technique leverages the powerful technique of the AMP proposed in [3]; however, we require a more refined analysis here in order to extend the results to one involving non-differentiable and robust loss functions while simultaneously allowing p ≥ n. We relate the proposed algorithm to the Lasso penalized M estimators when p n and show that a solution to one may lead to the solution to the other. We show its convergence while allowing non-differentiable loss functions and p, n, s → ∞, with n/p → δ ∈ (0, 1) and n/s → a ∈ (1, ∞). This enabled us to derive the AMSE of a general class of l 1 penalized M-estimators and to study their relative efficiency.
We show that the AMSE depends on the distribution of the effective score and that it takes a form much different than the classical one, in that it also depends on the sparsity s. Moreover, we present a new study of the relative efficiency of the penalized least squares method and the penalized least absolute deviation method. We discover regimes where one is more preferred than the other and that do not match classical findings of Huber. Several important insights follow immediately: relative efficiency is considerably affected by the model selection step; the most optimal loss function may no longer be the negative log likelihood function; the sparsest high dimensional estimators have an additional Gaussian component in their asymptotic mean squared error that does not disappear asymptotically. Moreover we find that the l 1 penalized least squares (P-LS) is preferred over the l 1 penalized least absolute deviations (P-LAD) when the error distribution is "light-tailed" with a new breakdown point for which the two methods are indistinguishable; furthermore, we find that P-LS is never preferred over P-LAD when the error distribution is "heavy-tailed". We briefly describe the notation. We use u ≡ m i=1 ui m to denote the average of the vector u ∈ R m . Moreover, if given f : to be the partial derivate with respect to the first argument; similarly ∂ 2 f (u, v), is the partial derivate with respect to the second argument. We use · 1 to denote l 1 and · 2 to denote the l 2 norm. We define the sign function as sign(v) = 1{v > 0} − 1{v < 0}, and zero whenever v = 0. We set δ = n/p and ω = E X 0 0 with a vector X 0 following a p x0 distribution. We set ω = s/p and θ denotes the nonnegative thresholding parameter. Moreover, This paper investigates the effects of the l 1 penalization on robustness properties of the penalized estimators, in particular, how to incorporate bias induced by the penalization in the exploration of robustness. We present a new approximate massage passing algorithm (RAMP) that is adaptable to different loss functions and sparsity simultaneously; including the one of Least Absolute Deviation (LAD) and Quantile loss (see Section 2). Section 3 studies a number of important theoretical results concerning the RAMP algorithm as well as its convergence properties and its connections to the penalized M-estimators. Section 4 studies Relative Efficiency and establishes lower bounds for the AMSE. Moreover, this section also presents results on relative efficiency of P-LAD estimator with respect to P-LS estimator. Section 6 contains detailed numerical experiments on a number of RAMP losses, including LS, LAD, and a number of Quantile losses, and a number of error distributions, including normal, mixture of normals and student. In 6.1-6.3, we demonstrate how to use RAMP method in practice, its convergence properties and the study of state-evolution equation where we find that the RAMP works extremely well. In 6.4, we demonstrate properties of the RAMP algorithm with varying error distribution. Lastly, in 6.5 we present analysis and new patterns of relative efficiency between P-LS and P-LAD estimators where we consider p ≤ n, s < n, p ≥ n and s ≈ n.

Robust sparse approximate message passing (RAMP) algorithm
We propose an iterative algorithm called RAMP, for "robust approximate message passing" that begins from the initial estimate x 0 = 0 ∈ R p and guarantees a sparse estimator at its final iteration. Let the loss function ρ : R → R + to be a non-negative convex function with a subgradient For b > 0 let G(z, b) denote the rescaled, min regularized effective score function, i.e., with the proximal mapping operator P rox(z, b) defined as: Lemma 5 (see Supplemental Materials) shows the reason behind the use of the effective score G(z, b) in the RAMP algorithm. In particular it shows that for every λ = θω/(bδ), the solution to the penalized M-estimator problem (1.2) corresponds to the fixed point solution of the RAMP algorithm described below. For all convex and closed losses ρ, the operator P rox(z, b) exists for all b and is unique for big enough b and all z. The proximal mapping operator is widely used in non-differentiable convex optimization in defining proximalgradient methods. The parameter b controls the extent to which the proximal operator maps points towards the minimum of ρ, with smaller values of b providing a smaller movement towards the minimum. Finally, the fixed points of the proximal operator of ρ are precisely the minimizers of ρ; for appropriate choice of b, the proximal minimization scheme converges to the optimum of ρ, with least geometric and possibly superlinear rates ( [8]; [33]).

RAMP algorithm
The RAMP algorithm below applies to sparse estimation with p ≥ n and loss functions ρ that are not necessarily differentiable and those that do not necessarily satisfy restricted strong convexity condition [36] (a condition typically used in the literature). The extension is significantly complicated, as the set of fixed points of the proximal operator is no longer necessarily sparse. Important examples of such loss functions ρ are absolute deviation and quantile loss, as they are neither differentiable nor do they satisfy restricted strong convexity condition. Each iteration t = 1, 2, 3, . . . is defined through a three-step procedure to update its estimate x t ∈ R p . We name the iteration steps as the Adjusted Residuals, the Effective Score and the Estimation Step.
Adjusted Residuals: Using the previous estimate x t−1 and a current estimate x t , compute the adjusted residuals z t ∈ R n where δ = n/p < 1. We add a rescaled product to the ordinary residuals Y − Ax t , that explicitly depends on n, p and s. This step can be recognized as proximal gradient descent [6] in the variable x of the function ρ using the stepsize Effective Score: Choose the scalar b t from the following equation, such that the empirical average of the effective score G(z; b) has the slope 1, As n/s > 1, for differentiable losses ρ previous equation has at least one solution, as G(z; b) is continuous in b and takes values of both 0 and ∞. Whenever, ∂ 1 G is not continuous it can be defined uniquely in the form For non-differentiable losses ρ, we consider two adaptations. First, we allow parameter b t , which controls the amount of min regularization of the robust loss ρ function, to be adaptive with each iteration t. Second, we consider a population equivalent of the (2.4) first, then design an estimator of it and solve the fixed point equation. In more details, for non-differentiable losses we propose to consider 1 =ν(b t ), (2.5) for a consistent estimatorν =ν(b t ) of a population parameter ν defined as A particular form ofν depends on the choice of the loss function ρ and the density of the error term f W . Estimation: Using the regularization parameter b t determined by the previous step, update the estimate x t as follows,  Lemma 4 below). Unlike least squares problems [19], rescaling of δ/ω in the above term is absolutely crucial for the convergence of the proposed algorithm.
To the best of our knowledge, RAMP algorithms is the first that simultaneously allows robustness in the loss function and shrinkage in the estimators simultaneously. Robust AMP of [20] merely applies to the p ≤ n case; when p > n, the second step of their algorithm fails to iterate and the other two stages do not match with (1.2). RAMP algorithm has a different Adjusted Residuals step that incorporates sparsity directly and a different Effective Score step to allow δ < 1. One may attempt to apply the AMP of [20] to a modified proximal mapping operator (2.2) by including the l 1 norm (penalty) directly. However, such an algorithm would not be a generalized AMP algorithm and its solution can be shown doesn't converge to the penalized M-estimator (1.2).

Examples
In the following we present a few examples of RAMP algorithm for different choices of the loss function ρ. Let Φ(z, b) = ωG(z; b)/δ.

Example 1.
[Absolute Deviation Loss] The Absolute Deviation loss function is defined as ρ(x) = |x|. We obtain Observe that the form above is equivalent to the soft thresholding operator. Moreover, the Absolute Deviation effective score function becomes, for F z , f z denoting the distribution and density functions of z. Given a set of adjusted residuals z t 1 , . . . , z t n , provided by (2.3) at any iteration t, b t is a solution to an implicit function equation (2.5)

Example 2.
[Quantile Loss] Let τ be a fixed quantile value and such that τ ∈ (0, 1). The quantile loss function is defined as and with it that the Quantile score function becomes, For the case of the quantile loss νω/δ = E∂ 1 Φ(z, b, τ ). Adding Condition (R) to the setup, we obtain ν = ∂ 1 EΦ(z, b, τ ). Narrowing the focus to ). Now, refining the equation for ν we obtain for F z , f z denoting the distribution and density functions of z. Given a set of adjusted residuals z t 1 , . . . , z t n , provided by (2.3) at any iteration t, and a fixed τ ∈ (0, 1), b t is a solution to an implicit equation In practice,F t z (bτ ) typically takes the form of an empirical cumulative distribution functionF t z (bτ ) = 1 n n i=1 1{z t i ≤ bτ }. In contrast, there are numerous consistent estimators of f z (bτ ). For instance, by the asymptotic linearity results of Lemma 9, we consider for a bandwidth parameter h > 0. In practice, it is difficult to obtain estimatorŝ F t z (bτ ) andf t z (bτ ) that are continuous functions of b. Hence, to solve the fixed point equations we implement a simple grid search and set b to be the average of the the first value of b on the grid for which the estimated function is bellow s/n and the the last value of b on the grid for which the estimated function is above s/n.

Theoretical considerations
In order to establish theoretical properties, we will impose a number of conditions on the density of the error term W , a class of robust loss functions ρ and a design matrix A. Although we assume that the error terms W i 's have bounded density, we allow for densities with possibly unbounded moments and we do not assume any a-priori knowledge of the density f .

Condition (D):
Let W 1 , . . . , W n be i.i.d. random variables with the distribution function F . Let F have two bounded derivatives f and f and f > 0 in a neighborhood of r 1 , · · · , r k , appearing in Condition (R)(i) below.

Condition (R):
Let i = 1, . . . , n. The loss function ρ is convex with subdifferential ρ . Moreover, (i) for all u ∈ R, ρ (u) is an absolutely continuous function which can be decomposed as where υ 1 has an absolutely continuous derivative υ 1 , υ 2 is a continuous,piecewise linear continuous function, constant outside a bounded interval and υ 3 is a nondecreasing step function. In more details, υ 2 (u) = α ν , and υ 3 Condition (i) depict explicitly the trade-off between the smoothness of φ and smoothness of F . This assumption covers the classical Huber's and Hampel's loss functions. Although we allow for not necessarily differentiable loss functions, we consider a class of loss functions for which the sub-differential ρ is bounded, a condition that is easily satisfied by many loss functions such are lad, quantile and Tukey's bi-squared loss. Condition (iii), is to assure uniqueness of the population parameter that we wish to estimate. Condition (iv) is essentially a moment condition that holds, for example, if v 1 is bounded and either v 1 (z) = 0 for z < a or z > b with −∞ < a < b < ∞, or E|W | 2+ < ∞ for some > 0.

Condition (A): The design matrix
While this setting is admittedly specific, the careful study of such matrix ensembles has a long tradition both in statistics and communications theory and is borrowed from the AMP formulation [4]. It simplifies the analysis significantly and can be relaxed if needed. In particular, it implies the Restricted Eigenvalue condition of [10]; that is, n vj 2 > 0 with high probability, as long as the sample size n satisfies n > c (1 + 8c) 2 s log p/κ(s, c) 2 , for some universal constant c . The integer s here plays the role of an upper bound on the sparsity of a vector of coefficients X 0 . Note that, with c ≥ 1, the square submatrices of size ≤ 2s of the matrix 1 n n i=1 A T i A i are necessarily positive definite.

State evolution of RAMP
In State Evolution (SE) formalism ( [18], [19]), the asymptotic distribution of the residual and the asymptotic performance of the estimator can be measured while allowing p → ∞. The parameterτ 2 t can be considered as the state of the AMP algorithm. Moreover, the asymptotic mean squared error (AMSE), defined as is a function of a state evolution parameterτ 2 t . We show that RAMP converges and we offer how to computeτ t , through a new iteration scheme adjusted for p ≥ n and not differentiable losses ρ. [3]. Letσ 2 0 = 1 δ EX 2 0 and let X 0 and W follow density p X0 and f W respectively, where EW 2 = σ 2 . Let Z be a standard normal random variable. Then, for all t ≥ 0 the sequence {τ 2 t } t≥0 is obtained by the following iterative system of equations:
In more details, define the sequenceτ 2 t by settingσ 2 Then, the solution to the iterative equations (3.1) and (3.2),τ 2 t , can be defined as the solution to the iterative system of equations

Lemma 2. Let ρ be a convex function and let Conditions (R), (D) and (A)
hold. For any σ 2 > 0 and α > α min , the fixed point equation admits a unique solution τ * = τ * (α) for all smooth loss functions ρ. Moreover, lim t→∞ τ t = τ * (α). Further, the convergence takes place at any initial solution and is monotone. Additionaly, for all non-smooth loss functions the fixed point equation above, admits multiple solutions τ * = τ * (α). In such cases, the convergence take place but it depends on the initial solution and is monotone for each initialization.
Remark 2. The display above offers explicit expressions of the additional Gaussian variable Z, its effects on the fixed points τ * and σ * and the loss function ρ. In the case of a simple Lasso estimator, with G being a rescaled least squares loss, Let the bandwidth, h, for the consistent estimatorν be such that h → 0 and nh → ∞.
Then, for the non-necessarily differentiable losses ρ, where v 1 is defined in Condition (R).

Asymptotic mean squared error
We say a function ψ : Theorem 1. Let Conditions (R), (D) and (A) hold and let ψ : R ×R → R be a pseudo-Lipschitz function. Moreover, X 0 follows p x0 , which is a non-degenerate distribution. Then, almost surely for allτ t andσ t defined by the recursion (3.1)-(3.2) and θ t = ατ t .
Next, we measure the L 2 norm distance between the RAMP iteration and the penalized estimator.
Previous theorem extends the existing work on approximate message passing with shrinkage factors, as the latter only focuses on least squares losses. Our result above allows both non-differentiable but also loss functions that do not necessarily satisfy a restricted strong convexity condition -as both least absolute deviation and quantile losses don't. Per results above, we see that for an optimal value of λ there exists an optimal value of α so that the optimally tuned l 1 penalized estimator is approximated by an an optimally tuned RAMP solution.

Remark 3. This result offers not only an upper bound on AMSE(x, x 0 ), but also an exact expression of it for an appropriate choices of the tuning parameters.
Note the optimal choice of λ truly depends on the loss function (throughτ ).

Relative efficiency
The robustness properties of sparse, high-dimensional estimators are difficult to quantify due to the shrinkage effects and the subsequent bias in estimation.
Shrinkage is known to lead to the super-efficiency phenomena in the domain of classical efficiency studies. Hence, efficiency cannot distingush between two biased estimators. However, relative efficiency, can capture both the size of the bias and the variance together leading to a relevant robustness evaluation. According to Theorem 3, the asymptotic mean squared error of penalized M -estimators is with the expectation taken with respect to X 0 and Z and Remark 4. Observe that whenever 1/δ = O(1/n) i.e., p ≤ n and p does not grow with n,σ 2 t =τ 2 t−1 . Specifically, when p = o(n), the bias in estimation disappears [20]. In contrast, we observe that Gaussian Z component or the estimation bias never disappear with 1/δ = p/n ≥ 1. This indicates that efficiency, with p ≥ n, never converges to the low-dimensional case unless perfect model selection is achieved. Whenever we allow deviations of model selection consistency the additional Z component has a substantial role in both the size of the asymptotic variance and the asymptotic bias, even if s n.
The following result computes the asymptotic lower bound of the variance and AMSE of the RAMP estimator.
(iii) for fixed values of α = α(λ) and X 0 , with θ = ατ , there exist functions ν 1 , ν 2 that are convex and increasing, respectively, and are such that the asymptotic mean squared error mapping for high dimensional problems satisfies: Recall that traditional lower bound of M -estimators with pfixed and p ≤ n and n → ∞ is 1/I(F W ) and is such that asymptotic mean squares error is equal to the variance and is achievable. Theorem 4 implies that under diverging p and s and n, such that p n ≥ s, traditional lower bound is not achievable for all s ≥ n/2, i.e., for all "dense" high dimensional problems. Hence, we observe a new phase transition regarding robustness in high dimensional and sparse problems. The effect of sparsity is extremely clear. If the problem is significantly sparse, with n/s < ∞, then the traditional information bound may be achieved, whereas for all other problems the traditional information bound cannot be achieved, as there is inflation in the variance.

Relative efficiency of l 1 -penalized least squares and l 1 -penalized absolute deviations
Next, we study the relative efficiency of the l 1 penalized least squares (P-LS from hereon) estimator, with respect to the l 1 penalized least absolute deviation (P-LAD from hereon) estimator.
Remark 6. From the results above, we can clearly compute the asymptotic mean squared error of the P-LS and P-LAD as the recursive equations Here, both σ P-LAD and σ P-LS satisfy the equation of (3.2) with τ P-LAD and τ P-LS , respectively.
Notice that in, sparse, high dimensional setting, the distribution of the X 0 can be represented as a convex combination of the Dirac measure at 0 and a measure that doesn't have mass at zero. Let us denote with Δ and U two random variables, each having the two measures above. Then, the asymptotic mean squared error satisfies We will explore this representation to study the relative efficiency of P-LS and P-LAD estimators. The relative efficiency of P-LS vs. P-LAD is defined as the quotient of their asymptotic mean squared errors. By results of previous sections, this amounts to the quotient of σ 2 P-LS /σ 2 P-LAD . To evaluate this quotient, we study the behavior of σ 2 P-LS /σ 2 W and σ 2 P-LAD /σ 2 W independently. In order to do so, we need a preparatory result below.
Next, we consider a class of distributions f W such that σ 2 W exists and consider state variable σ 2 P-LAD as a function of σ 2 W . We provide limiting behavior when both p, n → ∞ of both P-LS and P-LAD. We separate the analysis further into two cases: the case of "light tailed distributions" and the case of "heavy-tailed distribution."
The result above is a path-dependent result, in the sense that it holds for every value of α as well; that is, for a sequence of λ values we can find an accompanying sequence of α values and the limits above would still apply (note that the right hand sides depend on α explicitly).

Remark 8.
As 1/(1 − Γ(α)/δ) ≥ Γ(α)/δ and Γ is an increasing function, optimal P-LAD is more efficient than the optimal P-LS if α LS and α LAD are such that α LS ≥ α LAD . However, the size of the optimal tuning parameter are unknown in general, hence further studies need to be developed.

Numerical examples
Within this section, we would like to show the finite sample performance of RAMP.

Tuning parameter selection & implementation
The policy to choose for thresholds θ t is based on [19], which sets θ t = ατ t , where α is taken to be fixed. We choose a grid of α within an interval [α min , α max ]. For each α, we get the RAMP estimator x t and SE iterative parametersτ t and σ t . We use these parameters to evaluate the AMSE(x t , x 0 ) and then tune the optimal α by minimizing AMSE(x t , x 0 ). In other words,τ t is calculated by the where V is the right hand side of equation (3.1) and σ is calculated from equation (3.2). A number of simulation sections substitute θ(α) to be λ as a tuning parameter based on Lemma 5, in order to do a pathwise comparison between RAMP and penalized estimator. However, relative efficiency is always studies for only optimally tuned RAMP and optimally tuned penalized estimator. In the simulations each element of A is i.i.d. and follows N (0, 1/n). Unless otherwise stated we consider a fixed ratio δ = 0.64. The distribution of the true parameter is set as P(x 0 = 1) = P(x 0 = −1) = 0.064 and P(x 0 = 0) = 0.872.

Existence and uniqueness of state evolution parameters
In this section only we work with α = 2 to illustrate the worst case behavior of the RAMP alpgorithm. We fix p = 500 and focus on Gaussian distribution N (0, 0.2) for the errors W . Results of the state evolution equations are presented in Figures 1 and 2 below, where in the Gaussian setting above, we consider the least absolute deviation loss and the quantile losses with τ = 0.7 and τ = 0.3. We observe that the unique value of the state-evolution recursions is easily found even for the non-differentiable losses, under the recommendations of Section 2. Figure 1 shows howτ 2 t evolves to the fixed point near 2.264, 2.933, 3.378 for the case of the least absolute deviations and quantile losses, respectively. Simultaneously the mapping V(τ 2 , b, θ) evolves to the fixed points near 2.260, 2.926, 3.359 for all non-differentiable losses. Moreover, Figure 2 illustrates that the loss is not great, even when we start from the randomly chosen starting α value.

Limit behavior of the parameters of RAMP
We assess the limit behaviors of parameters of different loss functions to express the iterations of the RAMP algorithm. The error W follows N (0, 0.2) and the sample size is 320. We use ω = s/p = 0.128 based on the setting of the p x0 into equation (2.3) to generate b. We generate a series of α, and regard the threshold θ t = α * τ t . Then, we use the iteration ofσ t ,τ t from Lemma 1 to find the stable pointτ * with stopping at |τ t −τ t−1 | < tol, where tol is a small positive number and is taken to be 10 −6 here. Lastly, we use the expression of λ = ατ * ω bδ and the expression of AMSE in Theorem 2 to find the AMSE(x t , x 0 ). The penalized Mestimators theory suggest cross-validation for the optimal values of λ. For such value we find its corresponding AMSE(x t , x 0 ) and present it in Table 1 below.    Table 1 compares several necessary parameters in the iteration of the RAMP algorithm. We contrast four different loss functions: Least Squares loss, Huber loss, Least Absolute Deviation loss and Quantile Loss. The results presented in the table are averages over 100 repetitions. We notice that within only twenty iteration steps, the RAMP algorithm becomes stable no matter of the loss function considered. Furthermore, we present values of a number of parameters of the RAMP algorithm: min-regularization b, regularization λ and state evolution τ * . We observe that they all differ according to the loss function considered, illustrating that there is no universal choice of the above parameters that works uniformly well for all loss function.
Additionally, we present Figure 3 and show the empirical convergence of asymptotic variance AVAR(x t , x 0 ) with respect to the tuning parameter λ and different loss functions. The plots illustrate the bias-variance decomposition. For example, that when λ becomes larger, the AVAR(x t , x 0 ) of RAMP decreases dramatically and stabilizes around 0.136 for the case of Least Squares loss and Normal errors W . The reason AVAR(x t , x 0 ) becomes fixed on 0.12 is because the RAMP algorithm shrinks the estimator x t to be the zero vector for which AMSE(x t , x 0 ) = ||x t || 2 2 = 0.064 + 0.064 = 0.128 and asymptotic Bias 2 = 0.008, when λ is large enough. We also see that for the optimal value of λ the AVAR of P-LS and P-LAD changes depending on the error distribution: for Normal errors, the optimal AVAR of P-LS ≤ than that of P-LAD (notice that the tuning is done independently and that the scale of λ is different); for Student errors we observe that the optimal AVAR of P-LS is ≥ than that of P-LAD.

Robustness of RAMP with respect to the error distribution
Further, we know that using square loss to solve problem (1.2) is very sensitive with respect to the error distribution, which is the reason we release the loss function from the least squares loss to the general convex loss function satisfying Condition (R). We consider the robustness of the solution when the tail of error in model varies. We considered n = 640 observations and compared five scenarios for the error vector w: (a) light-tailed distribution: Normal N (0, 0.2), Mixnormal 0.5N (0, 0.3) + 0.5N (0, 1) and (b) heavy-tailed distribution: t 8 , t 4 , MixNormal 0.7N (0, 1) + 0.3N (0, 3) and Cauchy(0, 1). The Mixture of Normals distribution generates samples from different normal distributions with corresponding probability and samples are centered to have mean zero.
Results of this experiment are presented in Figure 4. A few observations immediately follow. The Lasso estimator is sensitive to the heavy tail error distribution whereas, the Huber loss and the Least Absolute Deviation loss perform better as the tail of the error distribution becomes heavier. Moreover, with larger tails the Least Absolute Deviation loss is clearly preferred over both the Huber and the Least Squares loss, whereas situation reverses when the tails are light. The Mixture of Normals errors are particularly difficult due to the bimodality of the error distribution. We see that in both light and heavy tales cases of Mixture distribution, Huber Loss is preferred over the Least Squares loss. Lastly, as the tails becomes even heavier, all estimators face the problem of estimating the unknown parameter accurately.

Relative efficiency
We use RAMP iteration to calculate the relative efficiency of the Least square estimator versus the Least absolute estimator. It is known that the least square estimator is preferable in normal error assumption, but the least absolute estimator beats the least square estimator in double-exponential error assumption under classical low-dimensional setting.
In Table 2, we fix p = 50 and discuss the comparison of relative efficiency between the low-dimensional case (where p < n) and the high-dimensional dense case (where p ≈ n). We discuss the AMSE(x t , x 0 ) with a different ratio of p n (10, 8, 3, 1.6, 1.4, 1.2) under two error settings (which are N (0, 0.2) and double exponential (0, 1)). When we implement the equations (3.1), (3.2) and (3.4), we consider η function to be an identity function and ω is 1, because neither the penalty nor the sparsity is needed.
From the first two rows of Table 2, we see that in a Normal error setting, the Least Square estimator is preferable and the relative efficiency of the Least Square estimator vs. the Least Deviation estimator is around 2 π . Further, we can see that in the Double exponential error setting, the Least Square estimator performs worse. This result matches the classical inference. From the last two rows, we can see that the Least Squares estimator is preferable whenever error is Normal or Double Exponential. This result is foreseen by [20] and [29].
Remarkably, in Table 3, we discuss a high-dimensional and sparse case (p > n). We fix δ = 0.64 and p = 500 and consider optimally tuned (cross-validation) P-LS and P-LAD. For the number of the non-zeros in true parameter, s, we choose a variety of options which range from low-sparsity, 25, to high sparsity, 300. From the first two rows of Table 3, we see that in a Normal error setting, P-LS estimator is no longer preferred in all settings. When the sparsity, s, is high and reaches n, P-LAD estimator is preferred, whereas when the sparsity, s, is low, P-LS estimator is preferred. However, from the last two rows, in the setting of the Laplace distribution, we see that P-LAD estimator is always preferred no matter of the size of s. This contradicts the findings of Table 2 and shows that model selection affects the choice of the optimal loss function.

Proofs
In this section we collect all the detailed proofs of the main and auxiliary results.

Preliminaries
The last term 3)) is a correction of the residual, called Onsager reaction term.
This term is generated from the theory of belief propagation in factor graphical models. Adding the Onsager reaction term in each iteration is the main difference from AMP iteration and soft thresholding iteration. The intuition of this term in each step is considering undersampling and sparsity simultaneously. The following Lemma 4 shows the relationship between the Onsager reaction term and in Donoho's [19] term the undersampling-sparsity. Similarly to [20] and [29] we use min regularization to regularize the squared loss with the robust loss ρ. This introduces the family of regularizations of the robust loss ρ as follows: Moreover, the proximal operator P rox(z, b) admits the subgradient characterization: if P rox(z, b) = u then z − u ∈ bρ (u). x * 0 /p, i.e.,to ω.

Proofs of the main results
Proof of Theorem 4. Let I(F W ) be a well defined information matrix of the errors, W i . If the distribution of the errors W i is a convolution D = F W • N (0, σ 2 ), then, . By Lemma 3.5 of [20], the lower bound can be further reduced to The proof is finalized by obtaining a lower bound of σ t .

RAMP algorithm 3917
Iterating previous equation k times, we obtain that for t > k When k → ∞, τ 2 t → τ * 2 , we obtain .
Part (iii). Utilizing the scale-invariance property of the soft-thresholding function η, we obtain that Let us first focus on the second component, i.e., ν 2 (τ ). The derivative of ν 2 (τ ) is By observing that the last term on the RHS is non-negative for all X 0 > 0 and negative for all X 0 < 0, we conclude that ν 2 (τ ) is an increasing function. We conclude the proof with the analysis of the first term, ν 1 (τ ). The displays above imply that the first and the last term of ν 1 (τ ) together lead to E (Z 2 + α 2 )1 Z + X0 τ ≥ α , whereas the middle term can be written as By Stein's lemma we know that the previous expression is equal to Furthermore, utilizing the variance computation of a truncated random variable, conditional on X 0 , it is easy to check that The rest of terms can be computed similarly. Combining all of the above we obtain Evaluating the derivative of ν 1 (τ ), we obtain Hence, for small τ 2 the expression above is negative and for large values of τ 2 it is positive. It follows that, ν 1 (τ ) is a convex function of τ 2 .
Proof of Theorem 5. Notice that in sparse, high dimensional setting, the distribution of the X 0 can be represented as a convex combination of the Dirac measure at 0 and a measure that doesn't have mass at zero. Let us denote with Δ and U two random variables, each having the two measures above. Let First, we prove that whenever σ 2 W → 0 then τ 2 P-LAD → 0 as long as lim τ →0 Ψ α (τ ) = 0. To accomplish this, let's prove that lim τ →0 Ψ α (τ ) = 0 and look at the relationship between τ P-LAD and σ W .
Notice that by the result of Theorem 4 [46], we conclude which is different from 0 whenever s = 0. Observe that whenever σ 2 W → 0, it holds that Y → σ 2 P-LAD Z and f (W ; τ 2 P-LAD ) → 0. In this case In turn, by plugging in τ P-LAD = 0 it satisfies both sides of the equation (7.1).
Proof of Theorem 6. Notice that in sparse, high dimensional setting, the distribution of the X 0 can be represented as a convex combination of the Dirac measure at 0 and a measure that doesn't have mass at zero. Let us denote with Δ and U two random variables, each having the two measures above. Let RAMP algorithm 3919 We first discuss the P-LAD estimator. By the state-evolution recursion, (3.2) According to (4.2), Plugging into (7.3) we obtain Substituting (7.1) in (7.2) we obtain By Stein's lemma and some algebra we arrive at the representation of g(τ 2 P-LAD ) and f (W ; τ 2 P-LAD ), as Let us first focus on the case of σ 2 W → 0. By Lemma 5 we conclude that τ 2 P-LAD → 0 and σ 2 P-LAD → 0. Hence, We proceed to show that the last term in the display above is converging to ∞.
Observe that whenever σ 2 W → 0, it holds that Y → σ 2 P-LAD Z and

Jelena Bradic
Furthermore, with σ P-LAD → 0 and b > 0, it holds that ξ(b) → 0. For φ denoting the density of the standard normal, the application of L'Hõpital's rules guarantees → ∞ as σ W → 0. We finish the proof by discussing the P-LS estimator. By Lemma 1 we see that the special case of the RAMP algorithm, when the loss function ρ(x) = (x) 2 is the approximate message passing algorithm of [4]. Hence, results that apply to the algorithm in [4] apply. In particular, a recent work [46] discusses the properties of lim σ 2 Proof of Theorem 7. We will use the notation defined in the proof of Lemma 6. We first discuss the Penalized LAD estimator. Based on the representation proved in Lemma 6 It suffices to discuss the limiting properties of the first, second and the third term in the right hand side above. Let us discuss the last term first. Observe that we can rewrite where in the last step we used the fact that when τ → ∞, Next, we discuss the limit of Ψ α (τ ). Corollary 6 of [46] guarantees that lim τ →∞ Ψ α (τ ) = Eη 2 (Z; α)/δ, that is, Ψ α (∞) = Γ/δ.
In the following, we analyze the limit of as τ → ∞. In view of the fact that, both the numerator and denominator of g(τ ) converge to 0 when τ → ∞, we use the L'Hõpital's rule in determining its limit. Therefore, − σZ)) .

RAMP algorithm 3921
Moreover, the last expression still needs L'Hõpital's rule. Hence, The proof is finalized by the analysis of f (W ; τ )/σ 2 W , when σ 2 W → ∞ and τ → ∞. We begin with the following representation of f (W ; τ ), .
We observe that in the limit when τ → ∞, of the above expression takes the form 0/0; hence, we apply the L'Hõpital's rule to obtain (1) where in the last step we used the change of variables to go from E W to E Z . The last expression converges to zero as both σ → ∞, σ 2 W → ∞. Lemma 5. Let (z * , b * ,x * ) be a fixed point of the RAMP equations (2.3), (2.4) and (2.6), having b * > 0. Then,x * is a solution to the penalized M-estimator problem (1.2) with λ = θ * ω b * δ . Vice versa, any minimizerx(λ) of the problem (1.2) corresponds to one (or more) RAMP fixed points of the form z * , θ * ω λδ ,x * .

Proofs of preliminary statements
Proof of Lemma 4. Let (x,z) be a fixed point of the RAMP algorithm iteration. Then the fixed point conditions at x read as They imply that for all For the middle term, we observe that x = 0, if and only if −θ < A T G(z, b) < θ. Hence, where . Therefore, the correction term defined as the average of the first derivative of η(A T G(z, b) + x; θ), becomes: Proof of Lemma 5. The fixed point condition at z reads Moreover, from Lemma 4 we conclude ∂ 1 η(A T G(z; b) + x; θ) = ω, and hence z = Y − Ax + 1 δ ωG(z; b). By definition of the rescaled effective score G, (z; b). Then, we have that the left hand side of the KKT condition becomes The equations (i) and (ii) are derived from the definition of Φ(z; b), equation (iii) is based upon the proof of Lemma 4, equation (7.7). Hence,

Proofs of section 3.1
Proof of Lemma 1. This is an immediate application of state evolution as defined in [3], which considers general recursions. Hence, it suffices to show that the proposed algorithm is a special case of it. In the original notation of [3], the generalized recursions studied are The two scalars ξ t and λ t are defined as ξ t = g t (b t , w) (7.12) and where · denotes an empirical mean over the entries in a vector and derivatives are with respect to the first argument. According to [3], the state evolution recursion involves two variables: To see that the RAMP algorithm in (2.6), (2.4) and (2.3) is a special case of this recursion, we specify the above components of the general recursion to be with the initial condition being q 0 = x 0 . Now, we verify that the simplification of the above series of equations (7.14)-(7.19) offer the RAMP algorithm iterations. We discuss the first step of the algorithm and then the third, whereas we leave the discussion of the second step as the last. We observe which is the first step of our algorithm. Also, which is the third step of our algorithm. Further, we need to show that h t+1 in the above special recursion satisfies the equation of h t+1 in general AMP, which means we need This equation is only true when ξ t = 1. Moreover, by the definition of G, we conclude that ξ t Therefore, we showed that the RAMP algorithm is a special case of general recursion, and we can conclude that the Theorem 2 of [3] applies and provides The proof is then completed by a simple observation that Z and −Z have the same distribution.
Proof of Lemma 2. The statement of the lemma follows if we successfully show that (a) the total first derivative of V(τ 2 , b(τ ), ατ) is strictly positive for τ 2 large enough; (b) the function V is concave for all smooth loss functions ρ and not for non-smooth loss functions ρ; and (c) the lim τ →∞ V (τ 2 , b(τ ), ατ 2 ) is a strictly decreasing function of α. Part (a). According to the definition of Φ and Condition (R), we can represent ∂V(τ 2 ,b,θ) where P rox is derivative of the P rox function with respect to its first argument, f is the density of W andr ν+1 is such that By integrating the implicit relation above we obtain ∂(τ 2 , θ) . (7.21) Observe that P rox is a strongly convex function with bounded level sets. [4] derive σ to be concave and for large τ 2 strictly increasing. Hence,r ν+1 + σZ can be made large and positive for large values of τ 2 . In turn, E ∂rν+1 ∂(τ 2 ,θ) can be made strictly positive for large values of τ 2 . Together with Condition (D) and convexity of ρ we are ready to conclude that ∂ ∂(τ 2 ,θ) E[Φ(W + σZ; b)] is strictly positive for large τ 2 .
By changing the order of differentiation and expectation (allowed by boundedness of functions considered given by Condition (R)), we obtain that b is defined as a solution to the equation ∂ 1 E[Φ(W + σZ; b)] = ω δ (see Lemma 3 for details). More specifically, Moreover, as before, in Equation (7.20) We focus on the last part of the above display. The total derivative of (7.22) provides the implicit equation for ∂b ∂τ 2 , Next, we observe that υ 1 can be made positive for large τ 2 and that υ 1 is positive. By (7.21) we can see that ∂rν ∂(τ 2 ) 2 (curvature of a convex function decays away from the origin). Moreover, [4] prove that σ is strictly concave for α > 0. All of the above implies that ∂σ(τ 2 ,θ) ∂τ 2 > 0 for large τ 2 ; ∂ 2 σ(τ 2 ,θ) ∂(τ 2 ) 2 ≥ 0. Condition (D) guarantees f > 0. Hence, ∂b ∂τ 2 can be made positive for large τ 2 . Next, it suffices to observe that the total derivative of V(τ 2 , b, θ) is given by the sum of the above marginal derivatives, all of which can be made positive.
Part(b). Careful inspection of the second derivative of V(τ 2 , b, θ) provides details (by the same arguments above) that the second derivative is negative, i.e., that the function V is concave for all smooth ρ and not necessarily negative for all nonsmooth loss functions ρ. We show the analysis for one of the marginals as the analysis for the rest is done equivalently.
Part(c). For part (c), the result of [4] provides that Moreover, they show that σ is strictly concave for α > 0. Hence, σ will converge to some σ min when τ → ∞.
Proof of Lemma 3. This proof relies on Lemma 1 and a simple modification of Theorem 2 of [3]. This theorem provides a state evolution equation for a general recursion algorithm. As Lemma 1 establishes a connections between our algorithm and general recursion, the proof is then a simple application of Theorem 2 of [3], with a simple relaxation of its conditions. Letτ t andσ t be defined by recursion (3.1)-(3.2). By Lemma 1 and with b t i defined therein (7.9), Theorem 2 of [3] states for any pseudo-Lipschitz function ψ : R 2 → R of order k and for all W i with bounded 2k − 2 moments. Careful inspection of the proof of Lemma 5 of [3] shows that if ψ is a function that is uniformly bounded, the restriction on the moments of W i is unnecessary. A version of Hoeffding's inequality suffices, as applied to independent and not-necessarily equally distributed random variables (see Theorem 12.1 in [13]). Next, we split the analysis into two cases: Φ is differentiable and Φ is not differentiable. For the first case, it suffices to observe that by Lemma 1 we have s; b).
and by Condition (R) ψ is a uniformly bounded function. Thus, application of the result above provides The proof then follows by observing that the right hand side is equal to ω/δ by (2.4). Next, we discuss the case of non-differentiable losses ρ. Let h be a bandwidth parameter of an estimator of ν(b). We define for h ∈ [0, C] for some constant C: 0 < C < ∞. We set S o n (h) = S n (h)−ES n (h), for h ∈ [0, C]. Moreover, by Condition (R) (i) Φ(z t i ; b) = bv 1 (P rox(z t i , b)) + bv 2 (P rox(z t i , b)). Absolutely-continuous term ν 1 can be handled as the above case; hence, without loss of generality we can assume it is equal to zero. Hence, ), we know the term above can be further written as Then, by the same arguments as for (7.25) we obtain n −1/2 S o n (h) → N (0, γ 2 (h)), for h ∈ [0, C], in distribution, where for each h ∈ [0, C], and Then, by the arguments of (7.25) we conclude The right hand side of the equality above is finite by the Condition (R). To establish a uniform statement, we need to establish the compactness or tightness of the sequence n −1/2 S n (h) for h ∈ [0, C]. This follows by noticing that the sequence is a sequence of differences of two, univariate, empirical distribution functions, both of which weakly converge to a Wiener function (see Lemma 5.5.1 in [28]). Hence, where τ = 1/2 for continuous ψ and τ = 1/4 for discontinuous ψ. In the display . By the definition ω/δ is the derivative of a consistent estimator of ν(b) = ∂ 1 EΦ(z t , b t ). Because of the equation above, we see that for all consistent estimators of ν(b) with a bandwidth choice of h → 0 and nh → ∞.

Proofs for section 3.2
Proof of Theorem 1. The proof is split into two parts. In the first step, we show that the proposed algorithm belongs to the class of generalized recursions as defined in [3]. The result is presented in Lemma 1.
In the second step, we utilize conditioning technique and the result of Theorem 2 of [3] designed for generalized recursions. For an appropriate sequence of vectors h t i of generalized recursions and a x 0 the true regression coefficient, they show for a pseudo-Lipschitz function ψ. We now proceed to identify x t for a suitable h t i of the proposed RAMP algorithm. By definition of RAMP, where equation (i) is because of the iteration RAMP, the equation (ii) is plus and minus a same term and the equation (iii) is the special choice of h t+1 in equation (7.14). Therefore, combining x t in equation (7.28) and equation (7.27), we obtain Proof of Theorem 2. In order to prove this result we designed a series of Lemmas 4 − 15 provided in the Appendix . The main part of the proof is provided by the results of Lemma 6. In the next steps we apply Lemma 6 to the specific choice of vectors x = x t and r = |x − x t |. We show there exist constants c 1 , ..., c 5 > 0, such that for each > 0 and some iteration t, Conditions (C1) − (C6) of Lemma 6 hold with probability going to 1 as p → ∞.
Moreover, we observe that ||x||2 p < ∞ by assumptions of the Theorem. Condition (C2). By the definition ofx as the minimizer of the L, we conclude that L(x) < L(x) for any x =x and this applies for x = x t . Condition (C3). We need to show ||sg(L, x t )|| 2 ≤ √ p. By the definition of the RAMP iteration This indicates that when x t = 0 and that in cases of x t = 0 Therefore, the subgradient sg(L, x t ) must satisfy . (7.29) Moreover, by equation (2.3) and Lemma 1 Then, (7.30) where we used the fact that z t − Φ(z t−1 , b t−1 ) = P rox(z t , b t ). Adding equation (7.30) to (7.29) and the expression of sg(L, x t ), we conclude Now, we rewrite sg(L, x t ) as follows Then, by the non-negativity of θ t−1 and triangular inequality We consider the bound of B first. Observe that P rox(z t , b t ) = z t − Φ(z t , b t ). Then, utilizing (7.25) and (7.26), we observe that there exists a 0 < q < √ p such that P rox(z t , b t ) + Φ(z t , b t ) − Φ(z t−1 , b t−1 ) 2 ≤ q. We define M ≡ sup z 2 ≤q ρ (z), where ρ (z) = v 1 (z). Then, by and by (7.43), is of the order O P (p(log p/n)). Now, it can be easily shown that for any r 1 and r 2 of distinct points uniformly in r 1 , r 2 , for a constant K: 0 < K < ∞. Also, the boundedness of v 1 we have ≤ K 1 r 1 − r 2 2 (7.50) uniformly in r 1 , r 2 , for a constant K 1 : 0 < K 1 < ∞. With all of the above we conclude To prove the compactness, we shell consider increments of S n (r) over small blocks. For r 2 > r 1 , the increments of S n (·) over the block B = B(r 1 , r 2 ) is The following lemma is a simple modification of Lemma 5.3 of [4]; hence, we omit the proof. We apply this lemma to a specific choice of the set S.
The Lemma 10 and Lemma 11 imply the following important result.