J an 2 01 1 MM Algorithms for Minimizing Nonsmoothly Penalized Objective Functions

The use of regularization, or penalization, has become incr easingly common in highdimensional statistical analysis over the past decade, whe re a common goal is to simultaneously select important variables and estimate their e ffects. It has been shown by several authors that these goals can be achieved by minimizing some parameter-depende nt “goodness of fit” function (e.g., a negative loglikelihood) subject to a penalization that pr omotes sparsity. Penalty functions that are nonsmooth (i.e. not di fferentiable) at the origin have received substantial attent ion, arguably beginning with LASSO (Tibshirani, 1996). The current literature tends to focus on specific combinatio s f smooth data fidelity (i.e., goodness-of-fit) and nonsmooth penalty functions. One resu lt of this combined specificity has been a proliferation in the number of computational algorithms d e igned to solve fairly narrow classes of optimization problems involving objective functions that are not everywhere continuously di fferentiable. In this paper, we propose a general class of algorith ms for optimizing an extensive variety of nonsmoothly penalized objective functions that satisfy ce rtain regularity conditions. The proposed framework utilizes the majorization-minimization (MM) al gorithm as its core optimization engine. The resulting algorithms rely on iterated soft-thresholdi ng, implemented componentwise, allowing for fast, stable updating that avoids the need for any high-d imensional matrix inversion. We establish a local convergence theory for this class of algorithms unde r weaker assumptions than previously considered in the statistical literature. We also demonstr ate he exceptional e ffectiveness of new acceleration methods, originally proposed for the EM algorit hm, in this class of problems. Simulation results and a microarray data example are provided to demons trate the algorithm’s capabilities and versatility.


Introduction
Variable selection is an important and challenging issue in the rapidly growing realm of highdimensional statistical modeling. In such cases, it is often of interest to identify a few important variables in a veritable sea of noise. Modern methods, increasingly based on the principle of penalized likelihood estimation applied to high dimensional regression problems, attempt to achieve this goal through an adaptive variable selection process that simultaneously permits estimation of regression effects. Indeed, the literature on the penalization of a "goodness of fit" function (e.g., negative loglikelihood), with a penalty singular at the origin, is quickly becoming vast, proliferating in part due to the consideration of specific combinations of data fidelity (i.e., goodness-of-fit) and penalty functions, the associated statistical properties of resulting estimators, and the development of several combination-specific optimization algorithms, (e.g., Tibshirani, 1996;Zou, 2006;Zou and Hastie, 2005;Zou and Zhang, 2009;Fan and Li, 2001;Park and Hastie, 2007;Friedman et al., 2008).
In this paper, we propose a unified optimization framework that appeals to the Majorization-Minimization (MM) algorithm (Lange, 2004) as the primary optimization tool. The resulting class of algorithms is referred to as MIST, an acronym for Minimization by Iterative Soft Thresholding. The MM algorithm has been considered before for solving specific classes of singularly penalized likelihood estimation problems (e.g., Daubechies et al., 2004;Hunter and Li, 2005;Zou and Li, 2008); to a large extent, this work is motivated by these ideas. A distinct advantage of the proposed work is the exceptional versatility of the class of MIST algorithms, their associated ease of implementation and numerical stability, and the development of a fixed point convergence theory that permits weaker assumptions than existing papers in this area. We emphasize here the focus of this paper is on the development of a stable and versatile class of algorithms applicable to a wide variety of singularly penalized estimation problems. In particular, the consideration of asymptotic and oracle properties of estimators derived from particular combinations of fidelity and penalty functions, as well as methods for effectively choosing associated penalty parameters, are not focal points of this paper. A comprehensive treatment of these results may be found in Johnson et al. (2008), where asymptotics and oracle properties for estimators derived from a general class of penalized estimating equations are developed in some detail.
The paper is organized as follows. In Section 2, we introduce notation and provide sufficient conditions for local convergence of the MM algorithm applied to a large class of data-fidelity and non-smooth penalty functions. In Section 3, we present a specialized version of this general algorithm, demonstrating in particular how the minimization step of the MM algorithm can be carried out using iterated soft-thresholding. In its most general form, iterated soft-thresholding is required at each minimization step; we further demonstrate how to carry out this minimization step in one iteration through a judicious choice of majorization function. As a consequence, we present a simplified class of iterative algorithms that are applicable to a wide class of singularly penalized, generalized linear regression models.
Simulation results are provided in Section 4, while an application in survival analysis to Diffuse Large B Cell Lymphoma expression data (Rosenwald et al., 2002) is presented in Section 5. We conclude with a discussion in Section 6. Proofs and other relevant results are collected in the Appendix.
Provided that the objective function, its surrogate and the mapping M(·) satisfy certain regularity conditions, one can establish convergence of this algorithm to a local or global solution. Lange (2004, Ch. 10) develops such a theory assuming that the objective functions ξ(β) and ξ S UR (β, α) are twice continuously differentiable. For problems that lack this degree of smoothness (e.g., all singularly penalized regression problems, including lasso, Tibshirani (1996); adaptive lasso, Zou (2006); and SCAD, Fan and Li (2001)), a more general theory of local convergence is required. One such theory is summarized in Appendix A.1; related results for the EM algorithm may be found in Wu (1983), Tseng (2004) and Chrétien and Hero (2008). Let · denote the usual Euclidean vector norm. Based on the theory summarized in Appendix A.1, we propose a new and general class of algorithms for minimizing penalized objective functions of the form ξ(β) = g(β) + p(β; λ) + λε β 2 , λ > 0, ε ≥ 0 ( 2) where g(β) and p(β; λ) are respectively data fidelity (e.g., negative loglikelihood) and penalty functions that satisfy regularity conditions to be delineated below. As will be shown later, the class of problems represented by (2) contains all of the penalized regression problems commonly considered in the current literature. It also covers numerous other problems by expanding the class of permissible fidelity and penalty functions in a substantial way. We assume throughout that g(β) is convex and coercive for β ∈ B, where B is an open convex subset of R p . We further assume that where the vector λ = (λ T 1 , . . . , λ T p ) T and λ j denotes the block of λ associated with β j . It is assumed that each λ j has dimension greater than or equal to one, that all blocks have the same dimension, and that the λ j1 = λ for each j ≥ 1. Evidently, the case where dim(λ j ) = 1 for j ≥ 1 simply corresponds to the setting in which each coefficient is penalized in exactly the same way; permitting the dimension of λ j to exceed one allows the penalty to depend on additional parameters (e.g., weights, such as in the case of the adaptive lasso considered in Zou (2006)). We are interested in problems with penalization; therefore, λ is assumed bounded and strictly positive throughout this paper. Several specific examples will be discussed below. For any bounded θ with λ > 0 as the first element, and the remainder of θ collecting any additional parameters used to define the penalty, the scalar functionp(r; θ) is assumed to satisfy the following condition: (P1)p(r; θ) > 0 for r > 0;p(0; θ) = 0;p(r; θ) is a continuously differentiable concave function with p ′ (r; θ) ≥ 0 for r > 0, and,p ′ (0+; θ) ∈ [M −1 θ , M θ ] for some finite M θ > 0.
Evidently, (P1) implies thatp ′ (r; θ) > 0 for r ∈ (0, K θ ), where K θ > 0 may be finite or infinite. The combination of the concavity and nonnegative derivative conditions thus imply that the penalty increases away from the origin, but with a decreasing rate of growth that may become zero. The case where (3) is identically zero for r > 0 is ruled out by the positivity of the right derivative at the origin imposed in (P1); similarly, the concavity assumption also rules out the possibility of a strictly convex penalty term. Neither of these restrictions is particularly problematic. Our specific interest lies in the development of algorithms for estimation problems subject to a penalty singular at the origin. Were (3) absent, or replaced by a strictly convex penalty term, the convexity of g(β) implies (2) can be minimized directly using any suitable convex optimization algorithm, such as that discussed in Theorem 3.2 below. Theorem 2.1 establishes local convergence of the indicated class of MM algorithms for minimizing objective functions of the form (2). A proof is provided in Appendix A.2, where it is shown that conditions imposed in the statement of the theorem are sufficient conditions for the application of the general MM local convergence theory summarized in Appendix A.1.
Condition (iii) of Theorem 2.1 establishes convergence under the assumption that ξ S UR (β, α) strictly majorizes ξ(β) and has a unique minimizer in β for each α. Such a uniqueness condition is shown by Vaida (2005) to ensure convergence of the EM and MM algorithms to a stationary point under more restrictive differentiability conditions. Importantly, the assumption of globally strict majorization is only a sufficient condition for convergence; this condition is only important insofar as it guarantees a strict decrease in the objective function at every iteration. As can be seen from the proof, it is possible to relax this condition to locally strict majorization, in which ξ S UR (β, α) majorizes ξ(β), with strict majorization being necessary only in an open neighborhood containing M(α).
The use of the MM algorithm and selection of (4) are motivated by the results Zou and Li (2008); we refer the reader to Remark 3.1 below for further comments in this direction. The assumptions on g(β) clearly cover the case of the linear and canonically parameterized generalized linear models upon setting g(β) = −ℓ(β), where ℓ(β) denotes the corresponding loglikelihood function. Estimation under the semiparametric Cox regression model (Cox, 1972) and accelerated failure time models are also covered upon setting g(β) to be either the negative logarithm of the partial likelihood function (e.g., Andersen et al., 1993, Thm VII.2.1) or the Gehan objective function (e.g., Fygenson and Ritov, 1994;Johnson and Strawderman, 2009).
Remark 2.2. The SCAD and MCP penalties are not strictly concave and lead to surrogate majorizers that fail to satisfy the globally strict majorization condition in (iii) of Theorem 2.1 unless h(β, α) is strictly positive whenever β α; see Remark 3.1 for further discussion and also Theorem 3.4 below.

The MIST algorithm
In general, the statistical literature on penalized estimation has proposed optimization algorithms tailored for specific combinations of fidelity and penalty functions. The class of MM algorithms suggested by Theorem 2.1 provides a very general and useful framework for proposing new algorithms, the key to which is a methodology for solving the minimization problem (1), a step repeated with each iteration of the MM algorithm. In this regard, it is helpful to note that the problem of minimizing ξ S UR (β, α) for a given α is equivalent to minimizing in β. In particular, if g(β) + λε β 2 + h(β, α) is strictly convex for each bounded α, which clearly occurs if both g(β) and h(β, α) are convex in β and at least one is strictly convex, then (10) is also strictly convex and the corresponding minimization problem has a unique solution.
Remark 3.1. For ε = h(β, α) = 0 and g(β) = −ℓ(β) for ℓ(β) = n i=1 ℓ i (β) with ℓ i (β) a twice continuously differentiable loglikelihood function, the majorizer used by the MM algorithm induced by the surrogate function (10) corresponds (up to sign) to the minorizer employed in the LLA algorithm of Zou and Li (2008), an improvement on the so-called LQA algorithm proposed in Hunter and Li (2005). Zou and Li (2008, Proposition 1) assert convergence of their LLA algorithm under imprecisely stated assumptions and are additionally unclear as to the nature of convergence result actually estabished. For example, while Zou and Li (2008, Theorem 1) demonstrate that the LLA algorithm does indeed have an ascent property, their result appears to be insufficient to ensure that the proper analog of condition Z3(ii) of Theorem A.1 holds in the case of the SCAD penalty. As a consequence, it is unclear whether even weak "subsequence" convergence results (cf. Wu, 1983) hold with useful generality in the case of the LLA algorithm. In contrast, Theorem 2.1 shows that strict majorization, under a few precisely stated conditions, is sufficient to ensure local convergence of the resulting MM algorithm to a stationary point of (2) In Section 3.2, it is further demonstrated how a particular choice of h(β, α) yields a strict majorizer that permits both closed form minimization and componentwise updating at each step of the MM algorithm, even in the case of penalties that fail to be strictly concave.
Numerous methods exist for minimizing a differentiable convex objective function (e.g., Boyd and Vandenberghe, 2004). However, because (10) is not differentiable, such methods do not apply in the current setting. Specialized methods exist for nonsmooth problems of the form (10) in settings where g(β) has a special structure; a well-known example here is LARS (Efron et al., 2004), which can be used to efficiently solve lasso-type problems in the case where g(β) is replaced by a least squares objective function. Recently, Combettes and Wajs (2005, Proposition 3.1; Theorem 3.4) proposed a very general class of fixed point algorithms for minimizing f 1 (h) + f 2 (h), where f i (·), i = 1, 2 are each convex and h takes values in some real Hilbert space H. Hale et al. (2008, Theorem 4.5) specialize the results of Combettes and Wajs (2005) to the case where H is some subset of R p and f 2 (h) = p j=1 |h i |. The collective application of these results to the problem of minimizing (10) generates an iterated soft-thresholding procedure with an appealingly simple structure. Theorem 3.2, given below, states the algorithm, along with conditions under which the algorithm is guaranteed to converge; a proof is provided in Appendix A.3. The resulting class of procedures, that is, MM algorithms in which the minimization of (10) is carried out via this iterated soft-thresholding procedure, is hereafter referred to as MIST, an acronym for (M)inimization by (I)terated (S)oft (T)hresholding. Two important and useful features of MIST include the absence of high-dimensional matrix inversion and the ability to update each individual parameter separately.
Theorem 3.2. Suppose the conditions of Theorem 2.1 hold. Let m(β) = g(β) + h(β, α) + λǫ β 2 be strictly convex with a Lipschitz continuous derivative of order L −1 > 0 for each bounded α. Then, for any such α and a constant ̟ ∈ (0, 2L), the unique minimizer of (10) can be obtained in a finite number of iterations using iterated soft-thresholding: e j denotes the j th unit vector for R p , is the univariate soft-thresholding operator, and 4. Stop if converged; else, set n = n + 1 and return to Step 2.
The proposed algorithm, as originally developed in Combettes and Wajs (2005), is suitable for minimizing the sum of a differentiable convex function m(·) and another convex function; hence, under similar conditions, one could employ this algorithm directly to minimize (2) in cases where the penalty (3) is derived from some convex functionp(·; θ). Theorem 3.4 of Combettes and Wajs (2005) further shows that the update in Step 3 can be generalized to where, for every n, ̟ n ∈ (0, 2L) and δ n ∈ (0, 1] is a suitable sequence of relaxation constants. Judicious selection of these constants, possibly updated at each step, may improve the convergence rate of this algorithm. Theorem 3.2 imposes the relatively strong condition that the gradient of m(β) is L −1 -Lipschitz continuous. The role of this condition, also imposed in Combettes and Wajs (2005, Proposition 3.1; Theorem 3.4), is to ensure that the update at each step of the proposed algorithm is a contraction, thereby guaranteeing its convergence to a fixed point. To see this, note that the update from b (n) to b (n+1) in the algorithm of Theorem 3.2 involves the mapping S (b − ̟∇m(b); ̟τ) . For any bounded b and a, it is easily shown that When ∇m(b) = ∇m(a), the right-hand side reduces to b − a , and the resulting mapping is only nonexpansive (i.e., not necessarily contractive). However, under strict convexity, this situation can occur only if b = a. In particular, suppose that b (n) b (n−1) ; then, ∇m(b (n) ) ∇m(b (n−1) ) and, using the mean value theorem, The assumption that the gradient of m(β) is L −1 -Lipschitz continuous now implies that choosing ̟ as indicated guarantees I − ̟H(b (n) , b (n−1) ) < 1, thereby producing a contraction.
In view of the generality of the Contraction Mapping Theorem (e.g., Luenberger and Ye, 2008, Thm. 10.2.1), it is possible to relax the requirement that ∇m(β) is globally L −1 -Lipschitz continuous provided that one selects a suitable starting point. The relevant extension is summarized in the corollary below; one may prove this result in a manner similar to Theorem 4.5 of Hale et al. (2008). Corollary 3.3. Let the conditions of Theorem 2.1 hold. Suppose α is a bounded vector and assume that m(β) = g(β) + h(β, α) + λǫ β 2 is strictly convex and twice continuously differentiable. Then, for a given bounded α, there exists a unique minimizer β * . Let Ω be a bounded convex set containing β * and define λ max (β) to be the largest eigenvalue of ∇ 2 m(β). Then, the algorithm of Theorem 3.2 converges to β * in a finite number of iterations provided that b (0) ∈ Ω, λ * = max β∈Ω λ max (β) < ∞, and ̟ ∈ (0, 2/λ * ).
Some useful insight into the form of the proposed thresholding algorithm can be gained by considering the behavior of the penalty derivative termp ′ (r; θ). As suggested earlier, (P1) implies thatp ′ (r; θ) decreases from its maximum value towards zero as r moves away from the origin. For some penalties (e.g., SCAD, MCP), this derivative actually becomes zero at some finite value of r > 0, resulting in situations for which τ j =p ′ (|α j |; λ j ) = 0 for at least one j. If this occurs for component j, then j th component of the vector S b (n) − ̟∇m(b (n) ); ̟τ simply reduces to the j th component of the argument vector b (n) − ̟∇m(b (n) ). In the extreme case where τ = 0, the proposed update reduces to b (n+1) = b (n) − ̟∇m(b (n) ), an inexact Newton step in which the inverse hessian matrix is replaced by ̟I p , I p denoting the p × p identity matrix, and with step-size chosen to ensure that this update yields a contraction. Hence, if each of the components of b (n) − ̟∇m(b (n) ) are sufficiently large in magnitude, the proposed algorithm simply takes an inexact Newton step towards the solution; otherwise, one or more components of this Newton-like update are subject to soft-thresholding.

Penalized estimation for generalized linear models
The combination of Theorems 2.1, 3.2 and Corollary 3.3 lead to a useful and stable class of algorithms with the ability to deal with a wide range of penalized regression problems. In settings where g(β) is strictly convex and twice continuously differentiable, one can safely assume that h(β, α) = 0 for all choices of β and α provided thatp ′ (r; θ) in (P1) is strictly positive for r > 0; important examples of statistical estimation problems here include many commonly used linear and generalized linear regression models, semiparametric Cox regression (Cox, 1972), and smoothed versions of the accelerated failure time regression model (cf. Johnson and Strawderman, 2009). The SCAD and MCP penalizations, as well as other penalties havingp ′ (r; θ) ≥ 0 for r > 0, can also be used; however, additional care is required. In particular, and as pointed out in an earlier remark, if one sets h(β, α) = 0 for all β and α then convergence of the resulting algorithm to a stationary point is no longer guaranteed by the above results due to the resulting failure of these penalties to induce strict majorization.
The need to use an iterative algorithm for repeatedly minimizing (10) is not unusual for the class of MM algorithms. However, it turns out that for certain choices of g(β), a suitable choice of h(β, α) in Theorem 3.2 guarantees both strict majorization and permits one to minimize (10) in a single iteration, resulting in a single soft-thresholding update at each iteration. Below, we demonstrate how the MIST algorithm simplifies in settings where g(β) corresponds to the negative loglikelihood function of a canonically parameterized generalized linear regression model having a bounded hessian function. The result applies to all penalties satisfying condition (P1), including SCAD and MCP. A proof is provided in Appendix A.4.

For each boundedα,
(a) the minimizerβ * of ξ S UR glm (β,α) is unique and satisfies In view of previous results, the result in # 3 of Theorem 3.4 shows that the resulting MM algorithm takes a very simple form: given the current iterate β (n) , 1. update the unpenalized intercept β (n) 0 : 2. update the remaining parameters β (n) : where The specific choice of function h( β, α) clearly serves two useful purposes: (i) it leads to componentwise-soft thresholding; and, (ii) it leads to strict majorization, as is required in condition (iii) of Theorem 2.1, allowing one to establish the convergence of MIST for SCAD and other penalties havingp ′ (r, θ) = 0 at some finite r > 0.
Evidently, the algorithm above immediately covers the setting of penalized linear regression. For example, suppose that y has been centered to remove β 0 from consideration and that the problem has also been rescaled so that X, which is now N × p, satisfies the indicated conditions. Then, the results of the Theorem 3.4 apply directly with where ̟ is defined as in Corollary 3.3. For the class of adaptive elastic net penalties (i.e.,p(r; λ j ) = λω j r in (3)), the resulting iterative scheme is exactly that proposed in (De Mol et al., 2008, pg. 17), specialized to the setting of a Euclidean parameter. In particular, we have τ j = λω j and γ j = 0 in Theorem 3.4, and the proposed update reduces to where ν = 2̟ −1 . Setting ν = 1 and ǫ = 0 yields the iterative procedure proposed in Daubechies et al. (2004), provided that X ′ X is scaled such that I − X ′ X is positive definite. The proposed MIST algorithm extends these iterative componentwise soft-thresholding procedures to a much wider class of penalty and data fidelity functions. The restriction to canonical generalized linear models in Theorem 3.4 is imposed to ensure strict convexity of the negative loglikelihood. Our results are easily modified to handle non-canonical generalized linear models, provided the negative loglikelihood remains strictly convex in β and the hessian can be appropriately bounded. Interestingly, not all canonically parameterized generalized linear models satisfy the regularity conditions of Theorem 3.4. One such important class of problems is penalized likelihood estimation for Poisson regression models. For example, in the classical setting of N independent It is easy to see that ∇ℓ( β), hence ∇m( β), is locally but not globally Lipschitz continuous; hence, it is not possible to choose a matrix C = ̟ −1 I such that (14) everywhere majorizes ξ glm ( β). Nevertheless, progress remains possible. For example, Corollary 3.3 implies that that one can still use a single update of the form (17) provided that a suitable Ω, hence C and β (0) , can be identified. Alternatively, using results summarized in Becker et al. (1997), one can instead majorize −ℓ( β) for any bounded α using where, for every i, θ i j ≥ 0 are any sequence of constants satisfying p j=0 θ i j = 1 and θ i j > 0 if x i j 0. Of importance here is the fact k j (β j ; α j ) is a strictly convex function of β j and does not depend on β k for k j. One may now take h( β, α) in Theorem 2.1 as being equal to k( β, α) + ℓ( β), leading to the minimization of In particular, componentwise soft-thresholding is replaced by componentwise minimization of (18), the latter being possible using any algorithm capable of minimizing a continuous nonlinear function of one variable.
Remark 3.5. The Cox proportional hazards model (Cox, 1972), while not a generalized linear model, shares the essential features of the generalized linear model utilized in Theorem 3.4. In particular, the negative log partial likelihood, say g(β) = −ℓ p (β), is strictly convex, twice continuously differentiable, and has a bounded hessian (e.g., Bohning and Lindsay, 1988;Andersen et al., 1993). Consequently, Theorem 3.4 and its proof are easily modified for this setting upon taking g(β) as indicated, setting

Accelerating Convergence
Similarly to the EM algorithm, the stability and simplicity of the MM algorithm frequently comes at the price of a slow convergence rate. Numerous methods of accelerating the EM algorithm have been proposed in the literature; see McLachlan and Krishnan (2008) for a review. Recently, Varadhan and Roland (2008) proposed a new method for EM called SQUAREM, obtained by "squaring" an iterative Steffensen-type (STEM) acceleration method. Both STEM and SQUAREM are structured for use with iterative mappings of the form θ n+1 = M(θ n ), n = 0, 1, 2, . . . , hence applicable to both the EM and MM algorithms. Specifically, the acceleration update for SQUAREM is given by where r n = M(θ n ) − θ n and v n = (M(M(θ n )) − M(θ n )) − r n for an adaptive steplength γ n . Varadhan and Roland (2008) suggest several steplength options, with preference for choice γ n = − r n / v n . Roland and Varadhan (2005)

Simulation Results
The simulation results summarized below are intended to compare the estimates of β obtained from existing methods to those obtained using the simplified MIST algorithm of Theorem 3.4. In particular, we consider the performance of MIST for the class of penalized linear and generalized linear models, demonstrating its capability of recovering the solutions provided by existing algorithms when both algorithms are forced to use the same set of "tuning" parameters (i.e., penalty parameter(s), plus any additional parameters required to define the penalty itself). In cases where multiple local minima can arise, we further show that the MIST algorithm often tends to find solutions with lower objective function evaluations in comparison with existing algorithms, provided these algorithms utilize the same choice of starting value.

Example 1: Linear Model
Let 1 m and 0 m respectively denote m-dimensional vectors of ones and zeros. Then, following Zou and Zhang (2009), we generated data from the linear regression model , and x follows a p-dimensional multivariate normal distribution with zero mean and covariance matrix Σ having elements Penalized least squares estimation is considered for five popular choices of penalty functions, all of which are currently implemented in the R software language (R Development Core Team, 2005): LAS, ALAS, EN, AEN, and SCAD. The LAS, ALAS, EN and AEN penalties are all convex and lead to unique solutions under mild conditions; the SCAD penalty is concave and the resulting minimization problem may have multiple solutions. In each case, we used existing software for computing solutions subject to these penalizations and compared those results to the solutions computed using the MIST algorithm.
Regarding existing methods, we respectively used the lars (Hastie and Efron, 2007) and elasticnet (Zou and Hastie, 2008) packages for computing solutions in the case of the LAS and EN penalties. For the ALAS and AEN penalties, we used software kindly provided by Zou and Zhang (2009) that also makes use of the elasticnet package. The weights for the AEN penalty are computed using ω j = |β EN j | −γ , j = 1, . . . , p, whereβ EN is an EN estimator and γ is a positive constant. Using EN-based weights in the AEN fitting algorithm necessitates tuning parameter specification for both EN and AEN. As in Zou and Zhang (2009), the ℓ 1 parameters λ (λ 1 in their notation) are allowed to differ, whereas the ℓ 2 parameters ǫ (λ 2 in their notation) are forced to be the same. Evidently, setting ǫ = 0 (λ 2 = 0) results in the ALAS solution. For the SCAD penalty, we considered the estimator of Kim et al. (2008) (HD), as well the one-step SCAD (1S) and LLA estimators of Zou and Li (2008). The code for the first two methods was kindly provided by their respective authors; the LLA estimator was computed using the R package SIS. The choice a = 3.7 was used for all implementations of SCAD. We considered finding solutions using penalties in the set Λ = {0.1, 1, 5, 10, 20, 100}. In particular, for LAS and SCAD, λ = λ 1 ∈ Λ. For EN, both λ = λ 1 ∈ Λ and λǫ = λ 2 ∈ Λ. For simplicity, we fixed the weights for AEN for a given λ 2 by selecting the 'best'β EN among the six estimators involving λ = λ 1 ∈ Λ based on a BIC-like criteria. Likewise for ALAS, the weights were computing using the 'best'β LAS among the six estimators involving λ = λ 1 ∈ Λ. The parameter γ for the ALAS and AEN penalties was respectively set to three and five for p = 35 and p = 81. For the strictly convex objective functions associated with the LAS, ALAS, EN, and AEN penalties, we simply used a starting value of β (0) = 0 p . For SCAD, three different starting values for the MIST, HD, and LLA SCAD algorithms were considered: β (0) = 0 p , β (0) = β ml (i.e., the unpenalized least squares estimate), and β (0) = β 1S ,λ (i.e., the one-step estimate computed using the penalty λ). As in Zou and Li (2008), the one-step estimator 1S is computed using β ml , an appropriate choice when N > p.
The convergence criteria used by the existing software packages were used without alteration in this simulation study. The convergence criteria used for MIST were as follows: the algorithm stopped if either (i) the normed difference of successive iterates was less than 10 −6 (convergence of coefficients); or, (ii) the difference of the objective function evaluated at successive iterates was less than 10 −6 and the number of iterations exceeded 10 6 (convergence of optimization). Due to the large number of comparisons and highly intensive nature of the computations, we ran B = 100 simulations for each choice of ρ, σ, and p. We report the results for the convex penalties in Table 1 and those for the SCAD penalty in Tables 2 and 3.
In Table 1, we summarize the average normed difference between the solution obtained using existing software and that obtained using MIST, β exist −β mist , over the B = 100 simulations; in particular, we report in the two leftmost panels the maximum value of this difference, computed across all combinations of tuning parameters. These maximum differences (all of which are multiplied by 10 5 ) are remarkably small for all (A)LAS and (A)EN penalties, indicating that MIST recovers the same (unique) solutions as the existing algorithms. Interestingly, the values for LAS are slightly larger than the rest, where the maximum differences all resulted from the smallest value of λ considered (λ = 0.1). In these cases, the algorithm tended to stop using the objective function criteria rather than the (stricter) coefficient criteria, resulting in slightly larger differences on average.
The results for SCAD are reported in Tables 2 (p = 35) and 3 (p = 81) and display (i) the average normed differences, multiplied by 10 3 , for each combination of λ, ρ, σ, p and starting value; and, (ii) the proportion of simulated datasets in which the MIST solution yields a lower evaluation of the objective function in comparison with the solution obtained using another method for the indicated choice of starting value. The results for λ = 100 are not shown, as the solution was 0 p in all cases. In comparison with the convex penalties, larger normed differences are observed, even when controlling for the use of the same starting value. Such differences are a result of two important features of the SCAD optimization problem: (i) the possible existence of several local minima; and, (ii) the fact that the MIST, HD, and LLA algorithms each take a different path from a given starting value towards one of these solutions. For example, while each of the LLA, MIST, and HD algorithms involve majorization of the objective function using a lasso-type surrogate objective function, both the majorization and minimization of the resulting surrogate function are carried out differently in each case. In particular, the LLA algorithm, as implemented in SIS, majorizes only the penalty term and adapts the lasso code in glmpath in order to minimize the corresponding surrogate objective function at each step. The HD algorithm is similar in spirit, but instead decomposes the penalty term into a sum of a concave and convex function and utilizes the the algorithm of Rosset and Zhu (2007) to minimize the corresponding surrogate objective function. The MIST algorithm instead uses the same penalty majorization as the LLA algorithm, but additionally majorizes the negative loglikelihood term in a way that permits minimization of the surrogate function in a single soft-thresholding step. Each procedure therefore takes a different path towards a solution, even when given the same starting value. We remark here that differences must also expected between any of LLA, HD, MIST and the onestep solution 1S; from an optimization perspective, the one-step estimate is the result of running just one iteration of the LLA algorithm, starting from the unpenalized least squares estimator β ml (Zou and Li, 2008), and only provides an approximation the solution to the desired minimization problem. All other methods (LLA, MIST, HD) iterate until some local minimizer (or stationary point) is reached. For example, when using either β ml or β 1S ,λ as the starting value, MIST always found a solution that produced a lower evaluation of the objective function in comparison to β 1S ,λ . However, when using the null starting value of 0 p , the one-step estimator did occasionally result in a lower objective function evaluation in cases involving smaller values of λ. This behavior is not terribly surprising; with small λ, the one-step solution should generally be close to the unpenalized least squares solution, as the objective function itself is likely to be dominated by the least squares term.
Of all the SCAD algorithms considered here, MIST and LLA tended to find the most similar solu-tions (i.e., have the smallest normed differences). For the cases in which the LLA solution had lower objective function evaluations, all of the MIST solutions were also LLA solutions; i.e, when starting the LLA algorithm with the MIST solution, the algorithm terminated at the starting value (i.e., the LLA solution coincides with the MIST solution). With the exception of three of these cases, starting the MIST algorithm with the LLA solution also resulted in the same behavior. For the most part, the HD and MIST algorithms also gave similar results, with one source of difference being the respective stopping criteria used. The stopping criteria for HD, assessed in order, are as follows: (1) 'convergence of optimization': stop if the absolute value of the difference of the objective evaluated at successive iterates is less than 1e-6; (2) 'convergence of penalty gradient': stop if the sum of the absolute value of the differences of the derivative of the centered penalty evaluated at successive iterates is less than 1e-6, (3) 'convergence of coefficients:' stop if the sum of the absolute value of the differences of successive iterates is less than 1e-6, and (4) 'jump-over' criteria: stop if the objective at the previous iterate plus 1e-6 was less than the objective at the current iterate. After careful analysis of the results, we can assert the following: • The MIST solution usually has the same or a lower evaluation of the objective function in comparison with HD, regardless of starting value.
• HD tends to have the greatest difficulty in cases of high correlation between predictors, a likely result of the fact that this algorithm relies on the variance of the unpenalized least squares estimator, hence matrix inversion, to take steps towards solution. In contrast, MIST requires no matrix inversion.
On balance, the MIST algorithm performs as well or better than LLA and HD in locating minimizers in nearly all cases. As suggested above, variation in the solutions found can be traced to the path each algorithm takes towards a solution and differences in stopping criteria. Remarkably, in cases when the correlation among predictors was low, the choice of starting value made little difference for MIST; either the same solution was found for all starting values or none of the starting values dominated in terms of finding the lower or equivalent objective evaluations. In settings involving higher correlation, however, using either 0 p or the 1S starting values tended to result in the lower evaluations of the objective function in comparison with using the unpenalized least squares solution. Similar behavior was observed for the LLA algorithm. In contrast, the choice of starting value had a much larger impact on the performance of the HD estimator; in particular, the use of 0 p as a starting value typically resulted in the lowest objective function evaluations when compared to using a non-null starting value.

Example 2: Binary Logistic Regression
As in Example 1, we considered the LAS, ALAS, EN, AEN, and SCAD penalties. There are at least two R packages that allow penalization using the LAS and EN penalties: glmpath (Park and Hastie, 2007), which handles binomial and poisson regression using a "predictor-corrector" method, and glmnet (Friedman et al., 2008), which handles binomial and multinomial regression using cyclical coordinate descent. Both methods can be tuned to find the same solutions, so for ease of presentation we only consider the results of glmnet for comparison in the tables and analysis below. The SIS package (Fan et al., 2009a) permits computations with the ALAS, AEN, and SCAD penalties using modifications of the Park and Hastie (2007) code. For SCAD, we compared the results of MIST to the results from the onestep (1S) algorithm (GLM version, Zou and Li, 2008) using the code provided from the authors and the LLA algorithm as implemented in Fan et al. (2009a). Table 2: Algorithm performance in Example 1 (LM: p = 35, N = 100) for SCAD penalty. The column 'avg' is the average normed differences ×10 3 between the MIST solution and the existing method's solution ; 'prop' is the proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing method's solution.  Table 3: Algorithm performance in Example 1 (LM: p = 81, N = 100) for SCAD penalty. The column 'avg' is the average normed differences (×10 3 ) between the MIST solution and the existing method's solution ; 'prop' is the proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing method's solution. As before, we only considered comparing solutions that use the same combination of tuning parameters; for the present example, the set considered here is Λ = {0.001, 0.01, 0.05, 0.1, 0.2, 1.00}, reflecting a need to accommodate the different scaling of the problem. The data generation scheme for this example was loosely based on the simulation study found in Friedman et al. (2008). Binary response data were generated according to a logistic (rather than linear) regression model using , 75}, and the remaining 100 − q components set to zero. Here, x i follows a p-dimensional multivariate normal distribution with zero mean and covariance Σ = 3 −2 P where correlation matrix P is such that each pair of predictors has the same population correlation ρ. We considered three such correlations, ρ ∈ {0.0, 0.5, 0.75}.
For the B = 100 simulations, the maximum (across different tuning parameters) average normed difference between the existing and proposed methods, multiplied by 10 5 , are reported for each of the strictly convex cases in the right-most panel of Table 1. As before, these maximums are generally remarkably small, indicating that MIST can recover the same (unique) solutions as the existing algorithms. The results for SCAD are reported in Table 4, which displays the same information as in the corresponding tables from Example 1; the HD comparisons are omitted here as the methodology and code were only developed for the case of penalized least-squares. In the GLM setting, the 1S estimator is computed by applying the LARS (Efron et al., 2004) algorithm to a quadratic approximation of the negative loglikelihood function evaluated at the MLE. Thus, 1S takes 'one step' towards minimizing the objective function; in contrast, both MIST and LLA iterate until a stationary point, usually a local minimizer, is found. As in the linear model case, LLA uses glmpath to minimize the surrogate at each step, whereas the MIST algorithm uses a single application of the soft thresholding operator to minimize the surrogate at each step.
In this example, the starting value carried even greater importance in comparison with the linear model setting. In particular, in the case of MIST, the combination of a 0 p starting value and small penalty parameter led to solutions with objective function evaluations that were substantially larger in comparison with those obtained using either β ml and β 1S ,λ . Such behavior may be directly attributed to the fact that the ML and 1S starting values either minimize or nearly minimize the negative loglikelihood portion of the objective function, the dominant term in the objective function when λ is "small." In contrast, a 0 p starting value led to the best minimization performance for "large" λ; upon reflection, this is also not very surprising, since large penalties induce greater sparsity and 0 p is the sparsest possible solution.
There were a few settings in which the 1S estimator resulted in a lower objective function evaluation in comparison with applying MIST started at β ml . This reflects the fact that several local minima can exist for non-convex penalties like SCAD. In addition, and as was observed before, using the 1S solution as a starting value always led to MIST finding a solution with a lower evaluation of the objective function in comparison with that provided by the 1S solution. Regarding the use of LLA, which also requires a starting value specification, we again examined the cases for which LLA resulted in lower objective function evaluations. For these cases, all MIST solutions were LLA solutions, and all LLA solutions were MIST solutions with the exception of one. Hence, both methods find valid, if often different, solutions, a behavior that we again attribute to the differences in paths taken towards a solution. Table 4: Algorithm performance in Example 2 (GLM) for SCAD penalty. The column 'avg' is the average normed differences (×10 3 ) between the MIST solution and the existing method's solution ; 'prop' is the proportion of MIST solutions whose objective function evaluation was less than or equal to that of the existing method's solution.

Effectiveness of SQUAREM 2
We explored the effectiveness of SQUAREM 2 , defined in Section 3.3, when applied to several simulated datasets taken from the previous two simulation studies. Table 5 indicates the relative reduction in elapsed time ('RRT') and numbers of MM updates, i.e., invocations of mapping M(·), required for the original and SQUAREM 2 -accelerated algorithms to converge for five randomly chosen simulation datasets across the five penalty functions. The SQUAREM 2 algorithm converged without dif- ficulty in these cases and required substantially fewer MM updates than the original algorithm; the percent reduction in time was as high as 96%. We remark here that the regularity conditions imposed in Roland and Varadhan (2005) and Varadhan and Roland (2008), particularly smoothness conditions, are not satisfied in this particular class of examples. Hence, while the simulation results are certainly very promising, the question of convergence (and its associated rate) of SQUAREM 2 in this class of problems continues to remain an interesting open problem.

Example: Identifying genes associated with the survival of lymphoma patients
Diffuse large-B-cell lymphoma (DLBCL) is an aggressive type of non-Hodgkins lymphoma and is one of the most common forms of lymphoma occurring in adults. Rosenwald et al. (2002) utilized Lymphochip DNA microarrays, specialized to include genes known to be preferentially expressed within the germinal centers of lymphoid organs, to collect and analyze gene expression data from 240 biopsy samples of DLBCL tumors. For each subject, 7399 gene expression measurements were obtained. The expression profiles along with corresponding patient information can be downloaded from their supplemental website http://llmpp.nih.gov/DLBCL/. Since the original profiles had some missing expression measurements, we used the dataset subsequently analyzed by Li and Gui (2004) which estimated the missing values using a nearest neighbor approach. During the time of followup, 138 patient deaths were observed with median death time of 2.8 years. Rosenwald et al. (2002) used hierarchical clustering to group the genes into four gene-expression signatures: Proliferation (PS), which includes cell-cycle control and checkpoint genes, and DNA synthesis and replication genes; Major Histocompatibility Complex ClassII (MHC), which includes genes involved in antigen presentation; Lymph Node (LNS), which includes genes encoding for known markers of monocytes, macrophages, and natural killer cells; and Germinal Center B (GCB), which includes genes that are characteristic of germinal center B cells; see Alizadeh et al. (2000) for more information on gene signatures. Based on the gene clusters, they built a Cox proportional hazards model (Cox, 1972(Cox, , 1975 to predict survival outcomes in the DLBCL patients. Subsequently, this dataset has been analyzed numerous times, typically to evaluate methods related to subgroup identification and/or survival prediction (e.g., Li and Gui, 2004;Gui and Li, 2005b,a;Li and Luan, 2005;Annest et al., 2009;Engler and Li, 2009;Tibshirani, 2009).
Here, we instead focus on the performance of two different penalties, namely SCAD and MCP, with regard to the identification of genes associated with DLBCL survival. The simulation results of Zhang (2007) suggest that MCP has superior selective accuracy over the SCAD penalty, at least for the case of a linear model. There, selection accuracy was measured as the proportion of simulation replications with correct classification of both the zero and non-zero coefficients, with MCP outperforming SCAD in all simulation settings. To illustrate the utility and flexibility of the MIST algorithm, we reanalyzed the DLBCL data, fitting a penalized Cox regression model respectively using SCAD and MCP penalty functions, and running these procedures in combination with the Iterative Sure Independence Screening procedure (ISIS, Fan et al., 2009b) in order to ensure that the dimension of the parameter space was maintained at a manageable level. For SCAD, we considered both the 1S and LLA estimators. The existing optimization functions provided in the SIS package for the ISIS procedure were used for the 1S estimator, whereas relevant modifications to the ISIS code were made in order to accommodate the fully iterative LLA and MCP estimators. Optimization at each step of the ISIS algorithm in the case of the MCP penalty utilized the MIST algorithm, as we are aware of no other algorithm capable of fitting the Cox regression model subject to MCP penalization. The default settings in the SIS package were used to determine the maximum number of predictors ([ n 4 log n ] = 10) and to define the additional ISIS parameters (e.g., use of the unpenalized MLE as a starting value, ranking method, tuning parameter selection) for all three analyses (1S-SCAD, LLA-SCAD, MIST-MCP). The parameter a was set to 3.7 for all analyses; hence, only the selection of λ required any tuning. Table 6 displays the 11 genes identified by at least one of the three analyses. The x's in a given column indicate the genes with non-zero coefficients resulting from the corresponding penalization. The final column provides references for genes previously linked to DLBCL in the literature. Genes belonging to the original Rosenwald et al. (2002) gene expression signatures are indicated with parenthetical initials. Note that the references provided are not meant to be an exhaustive list, but instead to demonstrate the relevance of certain genes and/or their altered expression levels in DLBCL survival.
Interestingly, the LLA-SCAD and MIST-MCP penalizations selected the same subset of genes, with a nearly a complete overlap with those selected from the 1S-SCAD penalization. The number of genes selected in each case is 10, the maximum specified by ISIS; 9 of these were shared across the three penalizations. According to NCBI Entrez Gene search (http://www.ncbi.nlm.nih.gov/), many of these genes are biologically relevant. For example, CDK7 codes for a protein that regulates cell cycle progression and is represented in the Proliferation Signature, although reported under a different Lymphochip ID as this gene was spotted multiple times on the array. Also members of the Proliferation Signature are SEPT1, coding for a protein involved in cytokinesis, and BUB3, coding for a mitotic checkpoint protein. DNTTIP2 regulates transcriptional activity of DNTT, a gene for a protein expressed in a restricted population of normal and malignant pre-B and pre-T lymphocytes during early differentiation. HLA-DRA, a member of the MHC Signature, plays a central role in the immune system and is expressed in antigen presenting cells, such as B lymphocytes, dendritic cells, macrophages. From the GCB Signature, the ESTs weakly similar to thyroxine-binding globulin precursor is highly cited. Additionally, RFTN1 plays a pivotal role in regulating B-cell antigen receptor-mediated signaling (Saeki et al., 2003).
A description of AI568329 was not provided in the original dataset, thus its function is unknown. Similarly, although cited at least twice, a description for AA830781 was also not provided in the original dataset. However, both of these may be related to lymphoma or risk of death from lymphoma, as it is possible that these genes (and potentially others) were selected because of coexpression or correlation with other relevant genes.
Interestingly the two genes not commonly identified across the three penalizations were both cited in Martinez-Climent et al. (2003). They found altered gene expression of TSC22D3 and ITGAL (both involved in a variety of immune phenomena) in one case who initially presented with follicle center lymphoma and subsequently transformed to DLBCL.   Gui and Li (2005a,b), Sohn et al. (2009) Binder andSchumacher (2009) 24376 ESTs, Weakly similar to A47224 x x x Rosenwald et al. (2002) (GCB), Ando et al. (2003) thyroxine-binding globulin precursor Gui and Li (2005a,b), Li and Luan (2005) Annest et al. (2009), Sohn et al. (2009) Binder and Schumacher (2008, 2009 Gui and Li (2005b), Sha et al. (2006) Zhang and Zhang (2007), Annest et al. (2009) Binder and Schumacher (2008, 2009 The results of this analysis demonstrate equivalence in selection performance between MCP and LLA-SCAD for the case of Cox proportional hazards model. Increasing the maximum number of predictors to 21 again resulted in equivalent selection performance between MCP and LLA-SCAD, with 21 predictors ultimately selected (results not shown). The 1S estimator also resulted in the selection of 21 predictors, but with increased dissimilarity between MCP/LLA-SCAD and 1S: only 13 of the 21 genes were selected by all three methods. It should be noted that Zhang (2007) did not use any form iterative variable selection (e.g., ISIS) in his comparisons between SCAD and MCP for the case of the linear model; in addition, Zhang (2007) fixed values for both penalty parameters in his simulations and also did not use a = 3.7. Thus, use of the ISIS procedure, the particular method used for selecting λ, and the use of a = 3.7 (as suggested in Fan et al. (2009b)) in both the MCP and SCAD penalties may all play a role in the results summarized above.

Discussion
This paper proposed a versatile and general algorithm capable of dealing with a wide variety of nonsmoothly penalized objective functions, including but not limited to all presently popular combinations of data fidelity and penalty functions. We established a suitable convergence theory, as well as new results on the convergence of general MM algorithms. We also demonstrated the remarkable effectiveness of the SQUAREM 2 acceleration procedure in these problems as tool for accelerating the slow but steady convergence of the proposed class of MM algorithms. Beyond specification of the penalty parameter(s) λ, virtually no effort was expended in tuning or otherwise specializing the MIST algorithm for solving a given problem. Thus, at the expense of greater analytical work, the convergence rate of the MIST algorithm can likely be improved. Through the use relaxation techniques and other methods for controlling the step-size behavior (e.g., line-searches) of MIST, we further expect that the local nature of the convergence theory presented here can be made global in nature.
The simulation results of this paper highlight the fact that nonconvex penalties tend to endow the corresponding objective function with multiple local minima. The resulting sensitivity of computational algorithms to the choice of starting value, while known, has arguably been deemphasized in the current literature. In this regard, the one-step method of Zou and Li (2008) provides a meritorious choice of starting value for fully iterative SCAD-based algorithms. In addition to being unique under mild regularity conditions, it is easily generalized to other nonconvex penalties, such as MCP. Unfortunately, the utility of this approach for identifying starting values is also limited to settings where N > p, for the justification of the 1S estimator relies heavily on the use of the unpenalized MLE as its starting value.
The simulated examples in this paper only consider settings with N > p, mainly to ensure that m(β) is strictly convex. Specifying ǫ > 0 in the ridge-like penalty term ensures that m(β) is strictly convex provided only that g(β) + h(β, α) is convex, as might be encountered in cases where N < p. Thus, for example, one might consider combining the ridge term with any penalty satisfying condition (P1) (e.g., SCAD), providing alternatives to the elastic net penalty; our results on the convergence of the proposed algorithms to some stationary point of the objective function would continue to apply in this setting. It would be interesting to investigate the statistical properties of estimators derived under such combinations in settings where p > N but p 0 ≪ N, with p 0 denoting the number of "important" predictors.

A Appendix
This appendix is divided into several sections. Section A.1 reviews and extends the convergence theory for the EM algorithm established in Wu (1983); the extension utilizes results of Meyer (1976) to establish stronger convergence results for general MM algorithms. Section A.2 contains the proof of Theorem 2.1 and makes direct use of these new convergence results. Finally, Sections A.3 and A.4 respectively contain the proofs of Theorems 3.2 and 3.4, establishing the convergence of iterated soft thresholding when used to minimize (10) and convergence of the proposed class of MIST algorithms in the case of the generalized linear model.

A.1 Local convergence of MM algorithms in nonsmooth problems
Using convergence theory for algorithms derived from point-to-set maps developed by Zangwill (1969), Wu (1983) established the convergence of the EM algorithm assuming twice differentiability of the loglikelihood function. In what follows, we first restate the key convergence result of Zangwill (1969); this result, given in Theorem A.1 and adapted from Wu (1983), is stated in a form convenient for use with the MM algorithm and provides for a very general (and comparatively weak) form of convergence. We then draw on stronger convergence results due to Meyer (1976) in order to establish a more useful convergence theory for MM algorithms designed to minimize nondifferentiable objective functions; this result is stated in Theorem A.3. Finally, we provide a set of sufficient regularity conditions that ensure the validity of the conditions of both theorems in a wide class of statistical estimation problems.
Let ξ(β) be the real-valued function to be minimized, where β ∈ B and B is some convex subset of R p . Let M : B → B be the minimization map (1), where ξ S UR (·, ·) is any function that majorizes ξ(β) for β ∈ B. In general, M(·) is a point-to-set map, and therefore a set. We say thatβ is a generalized fixed point of M(·) ifβ ∈ M(β); we say thatβ is a fixed point of M(·) if M(β) = {β} (i.e., a singleton).
The main result of Zangwill (1969, Theorem A), also utilized in Wu (1983), is stated below.

Remark A.2. Assumptions [Z1]-[Z3] are imposed in
Z3 ′ . For each α ∈ B 0 and β ∈ M(α) : , a strict decrease occurs at points α that are not generalized fixed points); , if α is a generalized fixed point, it is possible to observe no change in the objective function).
The above theorem essentially guarantees convergence of subsequences, but not global convergence of the iteration sequence itself. Subsequential convergence permits, for example, oscillatory behavior in the limit sequence. Meyer (1976Meyer ( , 1977 offers several refinements of Theorem A.1, strengthening the statements of convergence. His results, adapted for the MM algorithm, follow below; in particular, see Theorem 3.1, Corollary 3.2, and Theorems 3.5 and 3.6 of Meyer (1976 Proposition A.5. Suppose β (n) is bounded for a given n ≥ 0. Then, β (n+1) = M(β (n) ) exists, is bounded and is unique. In addition, for n ≥ 0, and where the second inequality is strict unless β (n+1) = M(β (n) ) = β (n) .

A.2 Proof of Theorem 2.1
The assumptions stated in the theorem immediately yield that ξ(β) is locally Lipschitz continuous and coercive for each bounded λ > 0, hence (i) is satisfied.
In order to establish the convergence of the corresponding MM algorithm in (iii), it suffices to prove that the assumptions of the theorem and consequent assertions established thus far are sufficient to ensure that Conditions R1-R5 of Appendix A.1 are met, in which case Theorem A.3 applies directly. The result (i), combined with the assumption that the stationary points are all isolated, immediately establishes Condition R1; as proved above, conditions R2 and R3 also hold. If ψ(β, α) = h(β, α)+q(β, α; λ)− p(β; λ) is continuous in α and β and locally Lipschitz continuous in β near α, then (i) implies that R4 also holds. By assumption, h(β, α) is continuous in α and continuously differentiable in β, hence locally Lipschitz in β. Continuity of q(β, α; λ) − p(β; λ) in both α and β is also immediate. Hence, R4 holds provided that q(β, α; λ) − p(β; λ) is locally Lipschitz continuous in β near α. To see that this is the case, we note that (24) is a linear combination of functions in β j of the formp ′ (|α j |; λ j )|β j | −p(|β j |; λ j ), where | · | and −p(·; λ) are both convex, hence locally Lipschitz. Since both the sum and composition of two locally Lipschitz functions are locally Lipschitz, the result now follows. Finally, R5 is ensured by R1-R4 and the condition in (iii) that ξ S UR (β, α) is uniquely minimized in β for each α.