On Bayesian robust regression with diverging number of predictors

This paper concerns the robust regression model when the number of predictors and the number of observations grow in a similar rate. Theory for M-estimators in this regime has been recently developed by several authors [El Karoui et al., 2013, Bean et al., 2013, Donoho and Montanari, 2013]. Motivated by the inability of M-estimators to successfully estimate the Euclidean norm of the coefficient vector, we consider a Bayesian framework for this model. We suggest a two-component mixture of normals prior for the coefficients and develop a Gibbs sampler procedure for sampling from relevant posterior distributions, while utilizing a scale mixture of normal representation for the error distribution . Unlike M-estimators, the proposed Bayes estimator is consistent in the Euclidean norm sense. Simulation results demonstrate the superiority of the Bayes estimator over traditional estimation methods.


Introduction
When fitting a linear regression model to data, estimators robust to outliers are often desired. One popular approach to achieve robustness to outliers is to use M-estimators under a penalty function other than the quadratic one. This methodology is often termed as "robust regression". Classical results for robust regression are that the M-estimator of the coefficients vector is consistent and normally distributed [see Huber, 2011, Chap. 7]. These results were obtained for the case p, the number of predictors, is fixed or grows slowly with the number of observations, n. The case where p grows faster than n have been drawing a lot of a attention. In that scenario, a popular approach is to consider penalization based estimation methods, e.g., the Lasso [Tibshirani, 1996].
We consider a different scenario. Assume that p < n, yet p grows at the same rate as n. That is, p/n → κ for some positive constant κ < 1. This scenario was first recognized as an interesting one by Huber [1973]. It is, however, only with the emergence of "big data" that researchers have begun to investigate the robust regression model under this regime. Asymptotic distribution and variance calculations were recently developed for M-estimators in this regime , El Karoui, 2013, Donoho and Montanari, 2013. Arguably the main result is that while the obtained estimator is normally distributed, its variance differs from Huber's classical results.  have further shown that, unlike the classical p ≪ n scenario, the optimal M-estimator, in terms of efficiency, is not obtained by maximizing the log density of the errors. One more striking result is that for Double-Exponential errors, for κ larger than approximately 0.3, least squares regression is superior to least absolute deviations regression.
We argue in the current paper that M-estimation might be the wrong approach to the robust regression model in the p/n → κ, 0 < κ < 1 regime. Our main motivation is as follows. When using M-estimators, the estimation error of a single coefficient is in the usual order of n −1/2 [El . However, the error accumulated over the coefficient vector does not vanish when n → ∞. We will further argue that when signal and the noise are of the same asymptotic order, only a few of the true coefficient values can be larger (in their absolute value) than n −1/2 . Putting it together, the estimation error of small coefficients, which are the majority, is larger than their actual value (in absolute value) when using M-estimators. Thus, apparent large effects can be actually microscopic.
Recognizing this characteristic of the problem, shrinkage methods may be a better fit in this robust regression model. Shrinkage can be achieved by using either the aforementioned regularization methods or by using a Bayesian approach with appropriate scaling of the prior hyperparameters. In this paper we consider the latter option.
It is well known that shrinkage can be achieved using a Bayesian methodology. Most of the discussion is centered in the normal error model. The James-Stein estimator, [James and Stein, 1961], is an empirical Bayes estimator [Efron and Morris, 1973]; Ridge regression estimator is identical to what we get if assuming the regression coefficients are iid with normal prior, and a maximum a posteriori (MAP) estimator is used; and if we replace the normal prior with Laplace distribution prior we get the Lasso. The Laplace prior for the coefficients is actively researched. An efficient Gibbs sampler was suggested by Park and Casella [2008]. However, the Lasso attracts criticism from a Bayesian point of view, since the full posterior distribution of the coefficients vector does not attain the same risk rate as the posterior mode [Castillo et al., 2015]. Another, more recent, prior proposal is the horseshoe prior [Carvalho et al., 2010[Carvalho et al., , 2009, which has some appealing properties, at least when the design matrix is the identity [van der Pas et al., 2014].
Our description above implies that the coefficients can be separated to two groups: small-value and large-value coefficients (in their absolute size). This perspective aligns with existing Bayesian variable selection literature, where priors are assigned hierarchically. First, coefficients are separated into two groups and then a prior distribution is determined according to the group assignment. Often, though not necessarily, one of the priors is the degenerate distribution at zero. A leading example for this framework is SVSS, Stochastic Search Variable Selection McCulloch, 1993, 1997]. An alternative approach is to have a prior distribution on the number of non-zero coefficients, to choose these coefficients uniformly, and then to have a prior on the non-zero coefficients [Castillo et al., 2012], In this paper we suggest a full Bayesian model for the robust regression model when p/n → κ, 0 < κ < 1. We choose the prior distributions and hyperparameters such that our prior knowledge on the design is taken into account. We then utilize a scale mixture of normal representation of the error distribution to construct a reasonably fast Gibbs Sampler.
The rest of the paper is organized as follows. In Section 2, we present notation and model assumptions, and claim that M-estimation should not be used in this p-close-to-n regime. In Section 3, we introduce an hierarchical Bayesian model and then, in Section 4, we present a Gibbs sampler for parameter estimation. Detailed example is given in Section 5 where we also present simulation results. Section 6 offers conclusion remarks. Proofs are given in Section 7.
2 Achilles heel of M-estimators when p/n → κ ∈ (0, 1) We start with notations. We use · , · 1 and · ∞ for the Euclidean norm, the ℓ 1 norm and the maximum norm of a vector, respectively. Throughout the paper we consider the model where ǫ (n) is a vector of i.i.d random variables with a known density function f ǫ (·; θ) characterized by θ, an unknown parameter, X (n) is a matrix of random predictors, β (n) is an unknown parameter vector we wish to estimate. To improve clarity, we henceforth omit the superscript indicating that all model components given in (1) depends on n (θ excluded). We denote X T i for the i th row of X. X and ǫ are assumed to be independent. We denote β 0 for the true value of β. For a given penalty function ρ, the M-estimator of β,β ρ , defined aŝ If ρ is convex one could alternatively solve the equation Huber's classical result [1973] is that if p 2 /n → 0 then √ n(β ρ − β 0 ) is asymptotically normal with a covariance matrix If it is further assumed that E(ǫ 1 ) = 0, then by using general M-estimation theory it can be shown that this result holds for p/n → 0. Portnoy [1984Portnoy [ , 1985 derived consistency and asymptotic normality of M-estimators in the robust regression model under weaker assumptions. See also Maronna and Yohai [1981]. We now claim that in the model described above, a robust regression model where the number of predictors and the number of observations are similar, Mestimators have undesirable properties. But first, we present our model assumptions: (M1) lim n→∞ p n = κ ∈ (0, 1).
(M2) The rows of X are i.i.d N(0, Σ) for a known covariance matrix sequence Σ = Σ p . Furthermore, assume the eigenvalues of Σ are bounded away from zero, for all p.
(M4) f ǫ is symmetric and for the function g ǫ (u) = f ǫ ( √ u) we have, for u > 0 and Assumption (M4) would be exploited when we will consider the Bayesian formulation of the problem. We note in passing that this assumption is fulfilled by rich family of distributions, such as Student's T distribution, the Laplace distribution, and the more inclusive exponential power family [Andrews andMallows, 1974, West, 1987]. We now take the frequentest point of view, which we will later, in the next section, replace in a Bayesian perspective. When estimating a multidimensional parameter a loss function is needed to aggregate over the different components. A natural loss function is the ℓ 2 of the estimation errors. The following proposition motivates our discussion.
Proposition 1. Let assumptions (M1)-(M3) and the regularity assumptions needed for Result 1 in El  hold. Letβ ρ be the M-estimator defined in (2) with respect to a non-linear convex function ρ. Then β ρ − β 0 = O p (1) A proof using the results of El  is given in Section 7. Proposition 1 implies that the so-called ℓ 2 −consistency cannot be achieved by M-estimators under this regime.
However, we now claim M-estimators might be the wrong approach here. Consider specifically the statistical interesting problem arising when the signal and the noise are of the same asymptotic order, i.e, when X T i β 0 = O p (1). For a moment, let Σ = I. Then, X T i β 0 = O p (1) implies β 0 = O p (1). If Σ = I, then since Σ is known, the last statement holds if takingX i = X i Σ −1/2 instead of X i . Informally, having the number of predictors in the same scale as the number of observations while considering a finite signal-to-noise ratio, that does not vanish as n grows, implies additional assumptions on β 0 structure. For example, not too many components of β 0 can be much larger than n −1/2 (in their absolute value), otherwise the signal would be stronger than assumed. On the other hand, if β 0 is concentrated very close to zero for all n, then the signal is too week as X T i β 0 is too small.
M-estimation is invariant to translation of β 0 , but some β 0 values are less expected then others, as we argued in the preceding paragraph. Therefore, another approach, which exploits that knowledge on the parameter vector β 0 is desirable. Since we know that many of the true coefficients are smaller than n −1/2 (in their absolute value) we could potentially gain better estimates if we shrink some coefficients towards zero. This can be done using regularization based methods, or alternatively, using a Bayesian approach in a way that shrinkage is encouraged by the specified prior distribution. In this paper, we choose to take the latter option, and in the next section we develop such a Bayesian hierarchical model.

A Bayesian Model
A Bayesian model for the robust regression involves at least one more level of parameters. Assume model (1) holds with a density function f ǫ that depends on a parameter θ, and obeys assumptions (M3)-(M4). We start with a prior for β. As we argued in the previous section, we expect many of its components to be small, while some are considerably larger. To accommodate for this, we present a mixture prior for each β j , j = 1, .., p, with two normal mixture components. Denote T = (t 1 , ..., t p ) ∈ {1, 2} p for the vector that indicates for each component j if β j has a large variance (t j = 1) or a small one (t j = 2). We also denote φ = E( p j=1 1{t j = 1}) for the number of "large" components in T . Finally, let δ 2 k , k = 1, 2 be the variance in each of the mixture components. Putting all together, the following assumptions depict our prior distributions.
(P1) The prior for β is of an iid mixture of two zero-mean normal variables Assumption (P2) implies that φ grows with n, but yet it is much smaller than p.
Note that under assumptions (P1)-(P3) we have Working prior distributions can be taken for φ, δ 1 and δ 2 , adding another level of hierarchy to the model. In practice, the parameters of these prior distributions are chosen such that assumptions (P2) and (P3) are fulfilled. In Section 5, we present such an example. Alternatively, φ, δ 1 and δ 2 may be taken as known values that obeys Assumptions (P1)-(P3). The prior distribution for β as specified in Assumptions (P1)-(P3) reflects our knowledge on β when we assume the signal and the noise in our model are of the same order. This prior implies that only a small part of the coefficients can be larger than the n −1/2 threshold. This can be stated formally using Chebyshev's inequality; See Lemma 1 in Section 7. Of course, one can think of other prior distributions for β having this property.
As for θ, we assume a prior distribution q(θ), where with some abuse of notation, q always denotes a density, and the particular relevant density would be clear from its argument. Unlike other model parameters, this prior does not change with n.
Before moving to the estimation procedure, we present a theoretical result showing that unlike M-estimators, a Bayes estimator for β can achieve ℓ 2 −consistency. As stated before, M-estimators in our regime are consistent when considering each coordinate of the vector separately, but not when considering the parameter vector as a whole. The following theorem shows that Bayesian estimator in the discussed model is consistent (in the Euclidean norm sense) for the parameter vector.
where p − → denotes convergence in probability. The proof is given in Section 7.

Sampling from the posterior distribution
Modern Bayesian statistics relies on the ability to sample from the posterior distribution. This may pose a challenge especially if the parameter space is high dimensional. Assumption (M4) aids us in construction of a Gibbs sampler Geman and Geman [1984], with blocked sampling available for the full conditional of β. Under the condition in Assumption (M4), we can write f ǫ as a scale mixture of normal distribution, Andrews and Mallows [1974], where ϕ is the density function of a standard normal random variable. This is the density of independent normally distributed random variables with mean zero and variance σ 2 i . The mixing distribution q(σ 2 |θ) can be identified in some cases [Andrews andMallows, 1974, West, 1987]. If f ǫ is the density of a Laplace distribution, as in the example we present in Section 5, then q(σ 2 |θ) ∝ exp(−θ 2 σ 2 /2).
This representation adds n parameters as an individual σ 2 i is artificially introduced to the model for each observation i. However, it allows for direct sampling from all the full conditionals of all the parameters, and a Gibbs sampler can be used. This Gibbs sampler resembles the one suggested for the Bayesian Lasso [Park and Casella, 2008], especially for the case presented in our example in Section 5 when the errors have a Laplace distribution. Note however that in their case the scale mixture of normals representation is taken for the prior β and the errors are normally distributed, where here we apply this representation to the errors, and the prior for β is a mixture of two normal distributions.
We now present the Gibbs sampler when Φ, δ 1 and δ 2 are known hyperparameters. In Section 5 we demonstrate how standard conjugate priors can be used for these parameters in a concrete example. The Gibbs sampler iterates between the full conditionals of θ, σ 2 1 , ..., σ 2 n , t 1 , ..., t p and β. Starting with θ, one can sample from q(θ|σ 2 1 ..., σ 2 n ) ∝ q(θ) n i=1 q(σ 2 i |θ). The mixing distribution q(σ 2 i |θ) is determined by the error distribution f ǫ (t; θ), and depending on the prior q(θ), we may have a conjugate family (as in our Section 5 example). Alternatively, we may use Metropolis-Hastings step for θ only. Since θ is one dimensional, we expect such a step to marginally affect the computation time of the Bayes estimator.
Moving to each σ 2 i , we have and depending on the mixture distribution this may be a known distribution to sample from (see again Section 5). Next, let Γ = diag(σ 2 1 , σ 2 2 , . . . , σ 2 n ) and let V = diag(δ 2 t 1 , δ 2 t 2 , . . . , δ 2 tp ). The full conditional of β is a multivariate normal with mean µ and variance A −1 where A = X T Γ −1 X + V −1 and µ = A −1 X T Γ −1 Y . This is the main advantage of the scale mixture of normals representation for f ǫ ; block sampling of β can be used. Finally, denote p k = P (t j = k|β j , φ, δ 1 , δ 2 ) for the full conditional of each t j . We have We now comment on alternative posterior sampling strategies, all involve the Gibbs sampler described above. When the mixing distribution q(σ 2 |θ) results in a full conditional q(σ 2 i |Y i , X i , β, θ) with no available direct sampling, the proposed Gibbs sampler must use other MCMC procedures for sampling from the full conditionals of σ 2 i . Moreover, if Assumption (M4) does not hold, then the scale mixture of normal representation (3) cannot be used. Then, sampling from cannot necessarily be done in a direct sampling manner, but additional MCMC procedure, such as Metropoils-Hastings should be used. This is not, however, where most of the problem lies. Block sampling for β would have been based upon high-dimensional full conditional of β q(β|Y, X, T, δ 1 , but since p is large, this could be a too ambitious approach, when using Metropolis-Hastings or other MCMC procedures. So as a last resort, one could use a Gibbs sampler that uses q(β j |Y, X, T, δ 1 , δ 2 , β −j ), where β −j is the β vector without its j-th component. Both the approaches we just described are expected to be more computationally heavy than the procedure we suggested before. We remark that sampling from each q(σ 2 i |Y i , X i , β, θ) can be parallelized, to shorten computation time since the σ 2 i 's are independent given θ and β.

Example
We present in this section a specific example. First we describe the model with more detail, then we move to the estimation procedure and then we present simulation results with concrete numerical values. We assume a Laplace distribution for the errors, that is, This implies that the mixing probability is Exponential, with q(σ 2 |θ) ∝ exp(−θ 2 σ 2 /2) [Andrews and Mallows, 1974]. We take an Exponential prior distribution for θ 2 (a Gamma prior distribution would also work). The resulting posterior for θ 2 is a Gamma distribution with a shape parameter of n/2 + 1 and a scale parameter 1 + ( n i=1 σ 2 i )/2. Moving to δ 1 and δ 2 . Let N k = p j=1 1{t j = k}. Then given a prior distribution q(δ k ), the full conditional for δ k is simplified to We take here a conjugate Inverse-Gamma prior distribution for δ k with a shape parameter α k and a scale parameter γ k . The resulting posterior is Inverse-Gamma with parameters α ′ k = α k + N k /2 and γ ′ k = γ k + 1{t j =k} β 2 j /2. An alternative prior distribution here would be the Jeffreys improper prior q(δ 2 k ) ∝ 1/δ 2 k . Next, a standard Beta(α φ , γ φ ) prior is taken for φ/p so the resulting posterior is Beta(α φ + N 1 , γ φ +N 2 ). We note here that the values taken for α 1 , α 2 , α φ , γ 1 , γ 2 and γ φ should be on the background of Assumptions (P2)-(P3). These are working priors that reflect our assumptions on β. The dimension of β grows with n, while its ℓ 2 norm is assumed to be constant. Therefore, φ, δ 2 1 and δ 2 2 must change with n. This results in a working priors that also change with n.
Sampling from the full conditionals of β and T remains the same as described in Section 3, and is generally the same regardless of f ǫ , as long as the scale mixture of normal distributions representation (3) is used. The same property holds for the sampling from the full conditionals of δ 1 , δ 2 and φ as described above.
Moving to σ 2 i , substituting q(σ 2 |θ) ∝ exp(−θ 2 σ 2 /2) in (4), we have Similarly to Park and Casella [2008], this implies an Inverse-Gaussian distribution for σ −2 i with parameters a = θ 2 and b = θ/|Y i − X T i β|, where the Inverse-Gaussian distribution defined as having the density Chhikara [1988]. The full conditionals we just described are used in the proposed Gibbs sampler. We now turn to the presentation of simulation study results.

Simulation Study
We present simulation results for various κ values and for n = 500, 2500. We used 500 and 1000 iterations for the Gibbs sampler described above. Data were simulated in the following way. First, X was simulated under Σ = I, i.e., each of the iid rows of X was simulated as N p (0, I). Then, we simulated φ, δ 1 and δ 2 . We used the values {α φ = 30, β φ = 30(3κ log(n) − 1)}, {α 1 = 2, γ 1 = log(n)/n} and {α 2 = 2, γ 2 = n −1.5 } for the prior distributions of φ/p, δ 2 1 and δ 2 2 , respectively. These values were chosen so E(φ) = n/(3 log(n)), E(δ 2 1 ) = log(n)/n, E(δ 2 2 ) = n −1.5 and also to have prior distributions that are not too concentrated around their means. Given φ, T was simulated as iid Ber(φ/p). Then, we simulated β as independent normally distributed variables where the prior variance of each β j is δ 2 t j . Next, we simulate θ from an Exponential prior q(θ) = exp(−θ)1{θ > 0}, and simulate ǫ as iid random variables that follow Laplace distribution with parameter θ. Finally, the observed data is {X, Y }, with Y = Xβ + ǫ. We compared the Bayesian estimator to classical M-estimators, namely least squares and least absolute deviations estimators. We also considered standard regularization-based estimators. Let the objective function to be minimized written as where P λ is a regularization function and λ is a tuning constant chosen here by cross-validation. We considered the following four estimators, defined by the choice of ρ and P λ : Least squares Lasso (ρ(x) = x 2 , P λ (β) = β 1 ), least squares Ridge regression (ρ(x) = x 2 , P λ (β) = β 2 ), least absolute deviation Lasso (ρ(x) = |x|, P λ (β) = β 1 ) and least absolute deviation Ridge regression (ρ(x) = |x|, P λ (β) = β 2 ). The first two estimators were obtained using the algorithm described in Friedman et al. [2010] and the latter two estimators were calculated using the algorithm described in Yi and Huang [2015] (hqreg package in R). Table 1 presents medians of β −β 2 across 1000 simulations for the estimators we just described. We report the medians and not the means due to a small number of outliers, observed mainly for the regularization based methods. For each method, the ℓ 2 error grew with κ (or with p) for a fixed n, as one may have expected. The superiority of the proposed Bayes estimator is clearly shown. For n = 500 taking 1000 iterations of the Gibbs sampler resulted only in minor gain comparing to using the Gibbs sampler with 500 iterations. For n = 2500, taking 1000 iterations was essential to improve this estimator's performance.
Considering Proposition 1, the relatively poor performance of the non regularized M-estimators was expected. As expected, for κ > 0.3 the least absolute deviation (LAD) estimator was preferable over the least squares (LS) estimator. Among the regularization based methods, however, taking the least absolute deviation as a penalty function remained superior for all κ values, both for Lasso and Ridge regression. Comparing the Lasso and Ridge regression, the former showed better performance for a fixed penalty function. Bayesian methods are often time   consuming, especially when the target parameter is high dimensional and sampling from full conditionals is performed. The Gibbs sampler presented in this section involve direct sampling for all the parameters, with blocked sampling for β, without using additional MCMC steps. Table 2 compares median computation time in minutes between the different methods. Regularized least squares penalty function methods were considerably faster that the alternatives (excluding simple LS and LAD). However, as Table 1 suggests, they were also inferior to the other methods. The suggested Bayesian Gibbs sampler was comparable, and often preferable, in terms of computation time to the regularized least absolute deviations estimators.

Discussion
This paper provided a Bayesian alternative to frequentist robust regression when the number of predictors and the sample size are of the same order. Standard M-estimators are inconsistent when considering the error accumulated over the vector. If it is further assumed that signal and the noise are of the same asymptotic order, then shrinkage of many coefficients is desirable. We presented an hierarchical prior for model parameters and constructed a Bayes estimator suitable for the problem. A scale mixture of normal distribution representation for the errors' distribution allowed us to build an efficient Gibbs sampler with blocked sampling for the coefficients. Theorem 1 shows that under appropriate conditions, the Bayes estimator in this problem is consistent in the ℓ 2 sense. This property does not hold for M-estimators in this design. The Bayesian estimator also outperforms regularization based methods. Although these methods, at least for the quadratic penalty function, can be computed much faster, they provide point estimate only, where the Gibbs sampler provides the entire posterior distribution that can be used to inference. This issue was also pointed out in Park and Casella [2008]. It was previously shown that the asymptotic distribution of M-estimators in this design is nontrivial , Donoho and Montanari, 2013, El Karoui, 2013. In El Karoui [2013], the distribution of the M-estimator was studied as the limit of a Ridge regularized estimators. As standard asymptotic theory does not apply here, the full distribution of the Bayes estimator in this regime remains as a future challenge. withX T i = Σ −1/2 X i being a mean zero multivariate normal vector with the identity matrix as its variance. Their lemma also asserts that β ρ,simp and u are independent. By Result 1 in El , β ρ,simp has a deterministic limit denoted here (and there) by r ρ (κ). They further show how r ρ (κ) can be found. Next, by a multivariate central limit theorem, for large enough p, √ pΣ −1/2 u is approximately N p (0, Σ −1 ). Thus, for large enough n we have where x ∼ N p (0, Σ −1 ). Let Σ = T ΛT T be the spectral decomposition of Σ. For v ∼ N p (0, I) and v D = v * , we have where ≥ st. symbolizes larger in the stochastic sense, λ min is the minimal eigenvalue of Σ, and χ 2 p is a Chi-squared random variable with p degrees of freedom. Therefore, the right hand side of (5) is O p (n) and we are done.

Lemma 1: Presentation and proof
Denote B C,η n := 1 p j 1 |β j | > Cn −η/2 for the proportion of coordinates of β that are of order larger than Cn −η/2 . The following lemma ensures us that if the prior distribution of β admits Assumptions (P1)-(P3) then B C,η n is not far from φ/p, and consequently, B C,η n = O p (φ/p) = o p (log(n) −1 ). Thus, the proportion of "large" coefficient values out of the total number of coefficients goes to zero with probability that goes to one as n grows.
Lemma 1. Let assumptions (M1), (P1) and (P2) hold. Assume (P3) holds with some ξ. Then, for all 1 ≤ η < ξ, for all ζ > 0 and for any constant C we have Proof. First note that B C,η n is a mean of p independent Bernoulli random variables with success probability of with Φ being the CDF of a standard normal random variable. The second equality results from n η δ 2 2 = n η−ξ (Assumption (P3)) and since η < ξ. Now, this problem is symmetric, so it is suffice to show lim n→∞ P B C,η n > φ p (1 + ζ) = 0.
By Chebyshev's inequality we have and the last expression goes to zero as n → ∞.

Proof of Theorem 1
For the simplicity of the proof we will assume that Σ = I.
Letβ M AP = arg max β q(β|Y, X) be the MAP estimator. First we will show that β M AP − β 0 = o p (1), and then that β M AP −β ⋆ = o p (1). We Starting withβ M AP , will first show the weaker result β M AP − β 0 = O p (1). We will then build on this result to strengthen the conclusion. For any vector β, we partition β to two subvectors β M and β M c , were M is the subset of relatively large coefficients. We will then use the fact that β M AP By Taylor expansion, the log-posterior of β can be written as log q(β|Y, X) for some α β ∈ [0, 1]. Since the empirical distribution function converges to the cumulative distribution function there are M < ∞ and γ < 1 such that P ( 1(|ǫ i | > M) < γn) → 1.
Let U n ≈ V n if U n = O p (V n ) and V n = O p (U n ). Now Hence n −1 n i=1 ℓ(ǫ i ) p → Eℓ(ǫ) > −∞. Thus, ℓ(ǫ i ) ≈ n. Sinceβ M AP is the maximizer we have that q(β M AP |Y, X) − q(β 0 |Y, X) > 0. Consider this difference and recall also line (6). Consider first the difference in the prior contributions. If t j = 2 (see Assumption (P1)) then the contribution of the prior can be made larger by making β j smaller, but then |β j − β 0 j | = O p (n −ξ/2 ). If t j = 1 we can increase the prior contribution, but there are only O p (n/ log n) such terms. Thus, | p j=1 log(q β M AP Having established that the MAP estimator is in o(1) neighborhood of the truth, we consider nowβ * . Again, consider the partition to β M and β M c . The log-prior of any β j is log c 1/2 1 φ 3/2 p e −c 1 φβ 2 j + c 1/2 2 n ξ/2 e −c 2 n ξ β 2 j , where c 1 , c 2 = O(1). Thus, for |β M AP j | < L n n −ξ/2 , for L n → ∞ slowly, the central component of the prior dominates, and the a posteriori mass of this ball is negligible. Forβ M AP j ≈ φ −1/2 , we argued that the peak at 0 is much smaller than the value atβ M AP j . Since the radius of this peak is of order n −ξ . It can be ignored. Ignoring this neighborhood of 0, the a posteriori is strictly concave with maximum atβ M AP j , and curvature of order n and henceβ * j −β M AP j = O p (n −1/2 ).