Hierarchical Bayes, maximum a posteriori estimators, and minimax concave penalized likelihood estimation

: Priors constructed from scale mixtures of normal distributions have long played an important role in decision theory and shrinkage estimation. This paper demonstrates equivalence between the maximum aposteri- ori estimator constructed under one such prior and Zhang’s minimax concave penalization estimator. This equivalence and related multivariate gen- eralizations stem directly from an intriguing representation of the minimax concave penalty function as the Moreau envelope of a simple convex func- tion. Maximum aposteriori estimation under the corresponding marginal prior distribution, a generalization of the quasi-Cauchy distribution pro- posed by Johnstone and Silverman, leads to thresholding estimators having excellent frequentist risk properties.


Introduction
Most penalized likelihood estimators have a formal Bayesian interpretation. In particular, the estimates obtained from maximizing a penalized likelihood function may be viewed as the posterior mode, or maximum aposteriori (MAP) estimator, under a (possibly improper) prior distribution implied by the choice of penalty. For example, in the case of the LASSO with design matrix given by the identity, the minimization with respect to θ of 1 2 Z − θ 2 + λ||θ|| 1 , for ||θ|| 1 = p i=1 |θ i | and ||θ|| = (θ ′ θ) 1/2 , is observed to be equivalent to computing the MAP estimator of θ under the model specification Z ∼ N p (θ, I p ), where θ has a prior distribution satisfying θ i iid ∼ DoubExp(λ), i = 1 . . . p for a constant λ > 0. The solution to this optimization problem is known to be θ i (Z) = sign(Z i )(|Z i | − λ) + , i = 1 . . . p [36]. The hyperparameter λ, held fixed for the purposes of estimating θ, is usually estimated in some adhoc manner (e.g., cross validation), resulting in an estimator with an empirical Bayes flavor.
The double exponential prior implicit in the LASSO penalization has broader connections to estimation under hierarchical prior specifications involving scale mixtures of normal distributions. For example, the double exponential distribution is an obvious special case of π p (θ|λ) ∝ λ p exp {−λ θ } , p ≥ 1, a class of densities that is itself a special case of a very general class of normal scale mixtures known as the multivariate exponential power distributions [cf. 19,Thm. 2.1]. Treating λ as a fixed hyperparameter and considering the corresponding generalization of the LASSO penalization, computation of the resulting MAP estimator under the likelihood specification Z ∼ N p (θ, I p ) reduces to determining the value of θ that minimizes 1 2 Z − θ 2 + λ θ . The resulting estimator is easily shown to beθ GL (Z) = 1 − λ Z −1 + Z, an estimator that (i) coincides with the solution to the canonical version of the grouped LASSO problem involving a single group of parameters [38]; and, (ii) for p = 1, reduces to the soft-thresholding operator θ(Z) = sign(Z)(|Z| − λ) + .
The marginal prior on θ is again observed to correspond to a scale mixture of normal distributions. However, in contrast to the LASSO and grouped LASSO estimators above, the minimax estimatorθ JS+ (Z) is obtained as a MAP estimator when the posterior distribution under the Takada prior is maximized jointly in both θ and κ, not in θ alone for a fixed value of κ. It is interesting to note thatθ GL (Z) provides an empirical Bayes interpretation forθ JS+ (Z) upon replacing λ inθ GL (Z) withλ = (p − 2)/ Z ; see [34] for further discussion.
Remarkably, upon setting p = 1, the marginal prior (1.2) can also be derived directly from the prior class a result easily proved by integrating out ψ (i.e., instead of γ) in (1.1). The conditional prior π(θ|γ, α) takes the same form as π p (θ|λ); in view of the fact that α = ∞ corresponds to placing a point mass at γ = λ, the prior (1.3) contains π p (θ|λ), hence the double exponential density with scale parameter λ, as special cases. The results of [35], combined with this observation, motivate [34] to propose jointly maximizing the posterior distribution of (θ, γ)|Z induced by the likelihood specification Z ∼ N p (θ, I p ) and modified prior distribution (1.4) where α, λ > 0 are fixed hyperparameters. Comparing (1.3) and (1.4), the difference lies in replacing the proper prior π(γ|α, λ) with the improper prior π(γ|α, λ). The nature of the modification that leads from (1.3) to (1.4) is analogous to the use of the factor (1 − κ) p/2 in the Takada prior π T (κ). The prior distribution (1.4) is evidently improper; however, the corresponding posterior distribution for (θ, γ)|Z remains proper and can thus be used as a statistical model for the purposes of estimation and inference. In particular, the MAP estimator for (θ, γ) under (1.4) is obtained by jointly minimizing for θ ∈ R p and γ > 0. The solution obtained evidently corresponds to solving the LASSO (p = 1) or grouped LASSO (p > 1) problem with an additional penalty on LASSO penalty parameter γ.
In this article, we focus on the implications of using the priors (1.3) and (1.4) in two MAP estimation problems when p = 1: assuming Z ∼ N (θ, 1), derive (i) the MAP estimator for (θ, γ) under the improper prior (1.4) by minimizing (1.5) for p = 1 in both θ and γ; and, (ii) the MAP estimator for θ under the proper marginal prior (1.3), or equivalently, (1.2). We show that the optimization problem in (i) results in an estimator for θ that coincides with the univariate version of the MCP threshold estimator [39]. We further show that the MCP penalty function can be characterized as the Moreau envelope of a simple convex function and subsequently generalize these observations to a much broader class of estimation problems. Finally, considering (ii), we establish a new class of thresholding estimators with desirable frequentist properties, as developed in [2] and [16].
As noted above, (2.1) is formally motivated as the posterior distribution of (θ, γ) under the hierarchically specified improper prior (1.4). It is well known that improper priors must be used with care in Bayesian estimation and inference problems, as these can lead to marginalization paradoxes and other problems; for example, see Kass and Wasserman [24] and Robert [28, p. 29]. Such concerns are less relevant in settings where the focus lies on the corresponding posterior distribution, provided this posterior distribution is well defined and generates a useful statistical model [e.g. 28, p. 128]. Indeed, the literature on Bayesian inference is rich with examples of improper priors that lead to useful estimators with good properties [e.g., 6,7,17]. The following result shows that the objective function (2.1) is derived from a valid posterior distribution; hence, the MAP for (θ, γ) is well-defined from a generalized Bayes perspective. The proof of this result may be found in Appendix A; the argument used there can be extended in a straightforward manner to similarly justify (1.5) for general p and vectors Z and θ.
Then, m(z) < ∞ for each z ∈ R, implying the existence of the posterior distribution Returning to the problem of minimizing (2.1), results in Strawderman and Wells [34, Thm. 4.2] imply (2.1) is strictly convex for (θ, γ) ∈ R × R + , with unique solution for θ given by The estimator (2.3) is equivalent to the univariate MCP threshold estimator, derived as the minimizer of in θ only for fixed λ > 0 and a >  (2.4) corresponds to a profiled version of (2.1) with γ set equal to its minimizerγ = (λ − |θ| a ) + in Theorem 2.3 and t = |θ| ≥ 0; see Schifano [31,Ch. 6]. As a result, in the given form, (2.4) does not correspond to the estimator of θ that would be obtained using the marginal prior "density" of θ derived from (1.4). This fact can also be seen directly from the proof of Theorem 2.1, where an expression for the marginal prior "density" of θ is obtained in (A.2).
The appearance of (2.3) as the solution for θ when jointly minimizing (2.1) is very surprising and suggests a direct connection between the respective optimization problems associated with (2.1) and (2.4). Theorem 2.3 establishes the exact nature of this connection. Theorem 2.3. Let λ > 0 and a > 0. Then, for any t ≥ 0, andγ λ = λ(1 − t/(aλ)) + is the unique solution to the right hand side of (2.5).
A proof of this result is provided in Appendix B. Remarkably, the equivalence result in Theorem 2.3 can also be established using results from convex analysis. The minimization problem on the right-hand side of (2.5) is equivalent to where ι C (γ) is zero for γ ∈ C and infinity for γ ∈ C; see, for example, Rockafellar and Wets [29, p. 7] In view of (2.1), the recovery of the LASSO solution (i.e., soft thresholding) at a = ∞ is initially surprising, for (2.1) reduces to the LASSO objective function if one sets a = 0 and treats γ as a fixed nonnegative constant. However, careful inspection of (2.1) shows that any value of γ other than γ = λ will result in an infinite objective function as a → ∞; for γ = λ, we evidently recover the LASSO objective function and corresponding solution at a = ∞.
The recovery of soft thresholding at a = ∞ also has an interesting quasi-Bayesian interpretation. Considering (1.4) for λ > 0 and recalling that a = 2α, the prior distribution on γ (i.e.,π(γ|α, λ)) evidently becomes increasingly concentrated about γ = λ as α → ∞. Asserting that this also implies π(θ, γ|α, λ) → π(θ|γ = λ, α, λ) as α → ∞, the MAP estimation problem leading to (2.1) again reduces to that for the univariate LASSO problem. It is interesting that the hard thresholding estimator arises as the solution when a = 1, the boundary where (2.1) transitions from being jointly convex to non-convex. However, in contrast to the case where a = ∞, a similarly intuitive Bayesian interpretation of this result does not appear to be available. Nevertheless, the indicated formulation of MCP shows that a and λ may be reasonably interpreted as hyperparameters, suggesting new methods of tuning parameter selection.

Thresholding rules derived as MAP estimators under marginal priors
The estimator (2.3) is derived as a MAP estimator through jointly optimizing the negative log-posterior (2.1) in θ and γ. From a frequentist perspective, such a procedure is sensible; one merely transfers the need to estimate the LASSO penalty parameter γ to the need to estimate (or otherwise specify) the hyperparameter λ. However, from a Bayesian point of view, γ is really a nuisance parameter and it is arguably more natural to derive the MAP estimator, as well as other Bayesian estimators (e.g., posterior means and medians), using the marginal prior distribution for θ [e.g., 20,3,27]. In the case of (1.1), equivalently (1.3), the relevant proper marginal prior is given by (1.2). It is straightforward to verify that this marginal prior distribution reduces to a double exponential density as α → ∞,; hence, as in the previous section, the MAP estimator under (1.2) also converges to the soft-thresholding estimator as α → ∞. More generally, for s ≥ 0, define where α, λ > 0 are constants and c α,λ = log 1 + λ(2α) 1/2 M {−λ(2α) 1/2 } ensures that p λ,α (s) ≥ 0 whenever s ≥ 0. Then, under the normal model of Section 2.1, and with θ|α, λ having the prior distribution (1.2), the computation of the MAP estimator for θ is equivalent to minimizing 1 2 (θ − z) 2 + p λ,α (|θ|) for θ ∈ R, the constant c α,λ being irrelevant.
Properties of (2.7), presented with a view towards deriving thresholding rules satisfying the requirements of Antoniadis and Fan [2], are recorded in the theorem below; proof may be found Appendix C.

MAP estimation and MCP: Extensions for multivariate problems
Theorem 2.3 confirms that minimizing (2.4) for θ ∈ R is equivalent to joint minimization of (2.1) for θ ∈ R and γ ≥ 0. Existence and uniqueness, hence equivalence, of the minimizers of (2.1) and (2.4) are ensured by the strict convexity of these objective functions that results from imposing the additional condition that a > 1. Under suitable regularity conditions, this result can be generalized in a very substantial way. As a prelude to this result, we note the following corollary to Theorem 2.3.
Since L i=1 w i γ i t i + a 2 (γ i − λ) 2 is a separable sum of positive functions, the proof of this result is an easy consequence of Theorem 2.3. The equivalence result in (2.8) is also what is needed to prove the following general result for MCP estimators; see Appendix D for proof.
Theorem 2.7. Let θ ∈ Ω ⊂ R p and, for i = 1 . . . L, write θ = (θ ′ 1 , . . . , θ ′ L ) ′ , where θ i has dimension p i ≤ p and L i=1 p i = p. Suppose g(θ) is a strictly convex, twice continuously differentiable function on Ω, where Ω is a bounded convex set. Let H(θ) denote the Hessian matrix of g(θ) and assume λ min > 0, where λ min is the minimum eigenvalue of H(θ) over θ ∈ Ω. Finally, let w i > 0, i = 1 . . . L be constants and define w max = max i=1...n w i . Then, for λ > 0, a > w max /λ min > 0, it follows that is equivalent to (2.10) The stated equivalence holds in much greater generality, in the sense that the set of global minima for general g(θ) and/or a > 0 failing to satisfy the indicated eigenvalue constraint also coincide with each other [e.g., 29,Prop. 1.35].
The class of problems represented by (2.9), hence (2.10), includes many interesting special cases. For example, with p i = 1 for each i, hence L = p, taking g(θ) = 1 2 y − Xθ 2 for a response vector y and design matrix X corresponds to minimax concave penalized estimation for a linear model [39]. Similarly, with p i ≥ 1 for each i and p j > 1 for at least one j, we have L < p and hence a "grouped" version of minimax concave penalization; see, for example, Breheny and Huang [9]. Taking g(θ) to be the negative log-likelihood function for a generalized linear model extends this framework to a much wider class of problems. While a geometric-based interpretation and algorithm exists for fitting linear models with the minimax concave penalty [39], the fitting of generalized linear models with the minimax concave penalty relies heavily on iterative optimization algorithms. From an algorithmic point of view, (2.10) provides a direct route for solving such problems iteratively, proceeding by estimating θ for fixed γ and then estimating γ for fixed θ, the latter existing in closed form (see Theorem 2.3). In the case where L = p, p i = 1 for each i, and g(θ) corresponds to the negative log-likelihood function for a generalized linear regression model, the representation (2.10) directly justifies the use of the local linear approximation algorithm suggested in [41] for minimax concave penalization, and can be fruitfully combined with majorization-minimization algorithms [e.g., 32] and related coordinate-wise optimization methods [e.g., 10]. For example, the MIST algorithm builds directly on the work of Zou and Li [41] using a suitable modification of the majorization-minimization algorithm that facilitates coordinatewise optimization. In the case of a generalized linear model, the majorization function used in MIST is exactly of the form (2.10); see Schifano, Strawderman and Wells [32] for comparisons to the original LLA algorithm in the case of the minimax concave penalty. Coordinate descent methods for solving (2.9) have been proposed for linear models with the minimax concave penalty [25]; we are not currently aware of published work that extends these results to directly solving (2.9) in the case of generalized linear models. It would also be interesting to study how such an algorithm would perform in comparison to a coordinate descent algorithm specifically designed for (2.10).
The results and discussion above are focused solely on the equivalence of the optimization problems (2.9) and (2.10). In order to interpret (2.10) as a MAP estimation problem in a generalized Bayes context, the function g(θ) should correspond to a suitable negative loglikelihood function for the response Z and, similarly to Theorem 2.1, we must show that m(Z) < ∞, where Calculations similar to those used to establish (A.2) may be used to prove that hence, m(Z) < ∞ and propriety of the resulting posterior distribution follows if Because r wia/2,λ (w i θ i ) > −λ(w i a) 1/2 for all θ ∈ Ω, 0 < M (s) < ∞ for all s > −∞, and M (s) is strictly decreasing, a sufficient condition for (2.11) to be finite is that for some cumulative distribution function F (·), provided F (·) has at least k = dim(θ) moments, the design matrix X has full column rank, and a certain verifiable constraint involving the binary response vector Z and X holds. Under related conditions on the design matrix and sample configuration, Chen, Ibrahim and Shao [13, Thms. 1 & 2] establish propriety of the posterior for Cox's regression model [15] under a flat prior on the regression parameter θ using either the partial likelihood function or the full likelihood function, the latter additionally involving a gamma process prior specification on the cumulative hazard function.

Discussion
In various forms, the class of normal scale mixture priors (1.1) has long played an important role in decision theory and shrinkage estimation. This paper demonstrates that several interesting thresholding estimators with good frequentist properties are connected either directly or indirectly to this class of priors, including the minimax concave penalized estimator of [39] (i.e., derived as a MAP under the improper modification (1.4)), thresholding estimators based on generalizations of the quasi-Cauchy prior of [23] (i.e., MAP estimators derived using the marginal prior (1.2)), and a wide class of multivariate generalizations of these results.
Theorem 2.3 further shows that penalties constructed via infimal convolutions of convex functions have natural connections to MAP estimators derived under the hierarchical prior (1.4). Certainly, not all penalty functions of interest can be represented in this way. Moreover, not all penalty functions that can be represented in this way will necessarily have an attractive Bayesian interpretation. For example, consider for t = |θ| ≥ 0 the penalty as discussed in Fan and Li [16], this "smoothly clipped absolute deviation" penalty [16] is designed to exhibit certain sparsity, continuity, and unbiasedness properties. However, beyond these key features, the motivation underlying p λ,a (·) is largely heuristic. Define the function g t (γ) = c t γ 2 +d t γ +0.5(a+1)(γ − λ) 2 for d t = (λ + t) 1 [t≥aλ] and Arguments like those used to prove Theorem 2.3 show min γ≥0 g t (γ) = p λ,a (t) for t ≥ 0. Formulated similarly to (1.4), these results suggest a joint prior distribution of the form π(θ, γ|a, λ) ∝ exp{−(c |θ| γ 2 +d |θ| γ)−0.5(a+1)(γ −λ) 2 }. Inspecting c |θ| and d |θ| , this prior is observed to have a discontinuity at |θ| = aλ and respectively approaches zero as |θ| ↑ aλ and a density proportional to exp{−0.5(a + 1)(γ 2 + λ 2 )} as |θ| ↓ aλ. Such behavior presents an interesting contrast to the minimax concave penalty, which has a simpler and arguably more attractive Bayesian motivation while sharing the same sparsity, continuity, and unbiasedness properties. Hierarchical priors often lead to Bayes estimators with good robustness and frequentist properties [e.g., admissibility and minimaxity; 5]. However, the implications of using penalty functions constructed from hierarchical priors have only received limited attention in the literature on penalized estimation. This paper has demonstrated strong links between hierarchical priors associated with Bayesian procedures having good decision-theoretic and shrinkage properties and frequentist sparse regularization procedures exhibiting good risk performance. The connections outlined here, including those with convex regularization and proximal operator theory, extend well beyond the univariate setting (e.g., Section 2.3) and suggest several novel avenues of investigation in penalized estimation problems.

Acknowledgement
We thank two referees, the associate editor and editor for their helpful comments and encouraging feedback. Let z ∈ R and λ > 0 be finite and assume a > 1. We wish to prove that (2.2) is finite; that is, m(z) < ∞, where m(z) = (a/2) 1/2 We may write m(z) = R φ(z − θ)π(θ)dθ, wherẽ For any finite a, straightforward computations show that where r a/2,λ (s) is defined as in (1.2). Letting a → ∞, it can be shown that (A.2) converges to √ π e −λ|θ| .
Supposing first that a → ∞, it is immediately clear that m(z) is finite. Assume now that a is finite. Observe that r a/2,λ (|θ|) > 0 if and if only |θ| > aλ.
Defining C = {θ : |θ| ≤ aλ} and c a,λ = φ(r a/2,λ (0)) √ π, we may rewrite (A.1) as the sum of three terms: The functionπ(θ) is evidently bounded on C; hence, term (I) is finite. For term (II), we may use the fact that M {r a/2,λ (s)} is a strictly decreasing function (e.g., see Lemma C.1 in Appendix C) to conclude that Arguing similarly for term (III), As all three terms are finite, it follows that m(z) < ∞, completing the proof.