Convex and non-convex regularization methods for spatial point processes intensity estimation

This paper deals with feature selection procedures for spatial point processes intensity estimation. We consider regularized versions of estimating equations based on Campbell theorem derived from two classical functions: Poisson likelihood and logistic regression likelihood. We provide general conditions on the spatial point processes and on penalty functions which ensure consistency, sparsity and asymptotic normality. We discuss the numerical implementation and assess finite sample properties in a simulation study. Finally, an application to tropical forestry datasets illustrates the use of the proposed methods.


Introduction
Spatial point pattern data arise in many contexts where interest lies in describing the distribution of an event in space. Some examples include the locations of trees in a forest, gold deposits mapped in a geological survey, stars in a cluster star, animal sightings, locations of some specific cells in the retina, or road accidents (see e.g. Møller and Waagepetersen, 2004;Illian et al., 2008;Baddeley et al., 2015). Interest in methods for analyzing spatial point pattern data is rapidly expanding across many fields of science, notably in ecology, epidemiology, biology, geosciences, astronomy, and econometrics.
One of the main interests when analyzing spatial point pattern data is to estimate the intensity which characterizes the probability that a point (or an event) occurs in an infinitesimal ball around a given location. In practice, the intensity is often assumed to be a parametric function of spatial covariates (e.g. Waagepetersen, 2007;Møller and Waagepetersen, 2007;Waagepetersen, 2008;Waagepetersen and Guan, 2009;Guan and Shen, 2010;Coeurjolly and Møller, 2014). In this paper, we assume that the intensity function ρ is parameterized by a vector β and has a log-linear specification where z(u) = {z 1 (u), . . . , z p (u)} are the p spatial covariates measured at coordinate u, β = {β 1 , . . . , β p } is a real p-dimensional parameter, D is the domain of observation, and d represents the state space of the spatial point processes (usually d = 2, 3). Methods to estimate β when p is reasonable are now quite standard. Instead of the maximum likelihood estimation which is computationally expensive (Møller and Waagepetersen, 2004), standard methods are based on estimating equation derived from the Campbell theorem and include the Poisson likelihood (e.g. Waagepetersen, 2007) and logistic regression likelihood methods (e.g. Baddeley et al., 2014) (see Appendix A for the details on these methods). An important advantage of such methods is their simple implementation. From a numerical point of view, it has been demonstrated (see e.g. Baddeley et al., 2015) that the Poisson likelihood and logistic regression likelihood can be efficiently approximated by a generalized linear model (more precisely a weighted quasi-Poisson regression for the first one and a logistic regression for the second one). GLM software can, therefore, be adapted to accurately estimate β. This is exactly what is proposed by the R package spatstat (Baddeley et al., 2015) devoted to the analysis of spatial point patterns.
In recent decades, with the advancement of technology and huge investment in data collection, many applications for estimating the intensity function which involves a large number of covariates are rapidly available (e.g. Hubbell et al., 2005;Renner and Warton, 2013;Thurman et al., 2015). When the intensity is a function of many variables, covariates selection becomes inevitable. Variable selection in the context of spatial point processes is a recent topic. Thurman and Zhu (2014) focus on using adaptive lasso to select variables for inhomogeneous Poisson point processes. This study is extended to clustered spatial point processes by Thurman et al. (2015) who establish asymptotic properties (consistency, sparsity, and asymptotic normality distribution) of the estimates. Yue and Loh (2015) consider modeling spatial point data with Poisson, pairwise interaction point processes, and Neyman-Scott cluster models, incorporated lasso, adaptive lasso, and elastic net regularization methods. The latter work does not provide any theoretical result.
In this paper, we intend to extend from a theoretical point of view the previous papers by considering more methods, more penalties. We propose regularized versions of either the Poisson or logistic regression likelihoods to estimate the intensity of the spatial point processes. The penalty functions we consider are either convex or non-convex. We provide general conditions on the characteristics of the spatial point process (finite moments, mixing conditions) and on the penalty function to ensure an oracle property and a central limit theorem. It is also to be noted that our theoretical results hold under less restrictive assumptions on the model and on the asymptotic covariance matrix than the ones required by Thurman et al. (2015) (see Remark 3). Since we outline the link between the criteria we maximize and penalized generalized linear models, our work is mainly based on the pioneering paper by Fan and Li (2001). Our contribution is to exploit and extend this paper: First, the asymptotic we consider is an increasing domain asymptotic, i.e. the domain of observation, say D n ⊂ R d , increases to R d with n (so |D n | the volume of D n plays the same role as n in standard literature); Second, unlike the work by Fan and Li (2001) which assumes the independence of observations, our results can be applied to spatial point processes which exhibit dependence (e.g. Neyman-Scott processes, log-Gaussian Cox processes).
From a numerical point of view, we are led to implement regularization methods for generalized linear models. This is quite straightforward since we only need to combine the spatstat R package with the two R packages implementing penalized estimation for generalized linear models, glmnet (Friedman et al., 2010) and ncvreg (Breheny and Huang, 2011).
The rest of the paper is organized as follows. Section 2 gives necessary background on spatial point processes, details briefly how a parametric intensity function is classically estimated and formulates the problem we tackle. This section is quite short but the non expert readers can find more details in Appendices A-D. Our main contribution is to obtain asymptotic properties for various spatial point processes models, estimation methods, and penalty functions. These results are detailed in Section 3. Section 4 investigates the finitesample properties of the proposed method in simulation study, followed by an application to tropical forestry datasets in Section 5, and finished by conclusion and discussion in Section 6. Proofs of the main results are postponed to Appendices E-G.

Spatial point processes and intensity functions
Let X be a spatial point process on R d . Let D ⊂ R d be a compact set of Lebesgue measure |D| which will play the role of the observation domain. We view X as a locally finite random subset of R d , i.e. the random number of points of X in B, N (B), is almost surely finite whenever B ⊂ R d is a bounded region. A realization of X in D is thus a set x = {x 1 , x 2 , . . . , x m }, where x i ∈ D and m is the observed number of points in D. Note that m is obtained from the realization of a random variables and 0 ≤ m < ∞.
Suppose X has intensity function ρ and second-order product density ρ (2) . Campbell theorem (see e.g. Møller and Waagepetersen, 2004) states that, for any function k : In particular, Campbell theorem provides an intuitive interpretation of ρ and ρ (2) . We may interpret ρ(u)du as the probability of occurrence of a point in an infinitesimally small ball with center u and volume du. In the same way, ρ (2) (u, v)dudv is the probability for observing a pair of distinct points from X occurring jointly in each of two infinitesimally small balls with centers u, v and volume du, dv. Without entering into details, we can define ρ (k) the k-th order intensity function (see Møller and Waagepetersen, 2004, for more details). For further background materials on spatial point processes, see for example Møller and Waagepetersen (2004); Illian et al. (2008). In order to study whether a point process deviates from independence (i.e., Poisson point process), we often consider the pair correlation function given by when both ρ and ρ (2) exist with the convention 0/0 = 0. For a Poisson point process (Appendix B.1), we have ρ (2) (u, v) = ρ(u)ρ(v) so that g(u, v) = 1. If, for example, g(u, v) > 1 (resp. g(u, v) < 1), this indicates that pair of points are more likely (resp. less likely) to occur at locations u, v than for a Poisson point process with the same intensity function as X. If for any u, v, g(u, v) depends only on u−v, the point process X is said to be second-order reweighted stationary.

Parametric intensity estimation
In our study, we assume that the intensity function depends on a vector of parameters β, i.e. ρ(·) = ρ(·; β). As outlined in the introduction, maximum likelihood estimation is almost unfeasible for general spatial point processes models. Instead of this method, Campbell formula provides a nice tool for defining estimating equations based methods. These methods are now standard in the context of spatial point processes but we refer the reader to Appendix A for a more detailed presentation. The standard parametric methods for estimating β are obtained by maximizing the weighted Poisson likelihood (e.g. Guan and Shen, 2010) or the logistic regression likelihood (e.g. Baddeley et al., 2014) given respectively by where w(·) is a non-negative weight function depending on the first and the second-order characteristics of X and δ(·) is a non-negative real-valued function. Appendix A reminds the pertinence of (2.3)-(2.4): Campbell theorem shows that the gradient vector of (2.3)-(2.4) constitute unbiased estimating equations. The solution obtained by maximizing (2.3) (resp. (2.4)) is called Poisson estimator (resp. the logistic regression estimator). We refer readers to Appendix A for further details on the weight function w(·) and for the role of the function δ(·). From a numerical point of view, it has been demonstrated that (2.3) and (2.4) can be efficiently approximated by a weighted generalized linear model (more precisely a weighted quasi-Poisson regression for the first one and a logistic regression for the second one). GLM software can therefore be adapted to accurately estimate β. More details about this numerical implementation can be found in Appendices C.1 and C.2 respectively.

Regularization techniques
Regularization techniques are introduced as alternatives to stepwise selection for variable selection and parameter estimation. In general, a regularization method attempts to maximize the penalized likelihood function (θ) − η p j=1 p λj (|θ j |), where (θ) is the likelihood function of θ, η is the number of observations, and p λ (·) is a nonnegative penalty function parameterized by a real number λ ≥ 0. The same general strategy is adopted here in the context of spatial point processes.
Let (w; β) be either the weighted Poisson likelihood function (2.3) or the weighted logistic regression likelihood function (2.4). In a similar way, we define the penalized weighted likelihood function given by where |D| is the volume of the observation domain, which plays the same role as the number of observations η in our setting, λ j is a nonnegative tuning parameter corresponding to β j for j = 1, . . . , p, and p λ is a penalty function which we now describe. For any λ ≥ 0, we say that p λ (·) : R + → R is a penalty function if p λ is a nonnegative function with p λ (0) = 0. Examples of penalty function are the 1216 A. Choiruddin et al.
The first and second derivatives of the above functions are given in Table 1. It is to be noticed that p λ is not differentiable at θ = λ, γλ (resp. θ = γλ) for SCAD (resp. for MC+) penalty.
Penalty functions give rise to specific well-known methods which are summarized in Table 2. More details can be found in Appendix D.
The solution obtained by maximizing (2.5) is called either regularized Poisson or logistic estimator. From the previous section, the numerical implementation of the maximization of (2.5) can be done using procedures which estimate a penalized weighted generalized linear model. This is now quite standard for instance in R with packages such as glmnet and ncvreg. More details about this can be found in Appendix C.3.
What is expected from maximizing (2.5) is that the procedure correctly selects the true covariates and that the estimate is consistent and still satisfies a central limit theorem. To obtain such properties when the observation domain increases to R d , specific conditions on the point process, the covariates, the regularity of the penalty function and most of all on the tuning parameters λ j are required. This is investigated in the next section. Table 2 Details of some regularization methods.

Asymptotic theory
In this section, we present the asymptotic results for the regularized Poisson estimator when considering X as a d-dimensional point process observed over a sequence of observation domain D = D n , n = 1, 2, . . . which expands to R d as n → ∞. The regularization parameters λ j = λ n,j for j = 1, . . . , p are now indexed by n. For sake of conciseness, we do not present the asymptotic results for the regularized logistic estimator. The results are very similar. The main difference is lying in the conditions (C.6) and (C.7) for which the matrices A n , B n , and C n have a different expression (see Remark 2). So, from now on, we let n = n,PL and Q = Q n be indexed by n.
We define the p × p matrices A n (w; β 0 ), B n (w; β 0 ), and C n (w; β 0 ) by In what follows, for a squared symmetric matrix M n , ν min (M n ) denotes the smallest eigenvalue of M n . Consider the following conditions (C.1)-(C.8) which are required to derive our asymptotic results.: (C.4) There exists an integer t ≥ 1 such that for k = 2, . . . , 2 + t, the product density ρ (k) exists and satisfies ρ (k) < ∞. (C.5) For the strong mixing coefficients (3.1), we assume that there exists somẽ (C.8) The penalty function p λ (·) is nonnegative on R+, satisfies p λ (0) = 0 and is continuously differentiable on R + \{0} with derivative p λ assumed to be a Lipschitz function on R + \ {0}. Furthermore, given (λ n,j ) n≥1 , for j = 1, . . . , s, we assume that there exists (r n,j ) n≥1 , where |D n | 1/2r n,j → ∞ as n → ∞, such that, for n sufficiently large, p λn,j is thrice continuously differentiable in the ball centered at |β 0j | with radiusr n,j and we assume that the third derivative is uniformly bounded.
Under the condition (C.8), we define the sequences a n , b n and c n by where K 1 is any positive constant. These sequences a n , b n and c n , detailed in Table 3 for the different methods considered in this paper, play a central role in our results. Even if this will be discussed later in Section 3.3, we specify right now that we require that a n |D n | 1/2 → 0, b n |D n | 1/2 → ∞ and c n → 0.

Main results
We state our main results here. Proofs are relegated to Appendices E-G.
We first show in Theorem 1 that the regularized Poisson estimator converges in probability and exhibits its rate of convergence.
This implies that, if a n = O(|D n | −1/2 ) and c n = o(1), the regularized Poisson estimator is root-|D n | consistent. Furthermore, we demonstrate in Theorem 2 that such a root-|D n | consistent estimator ensures the sparsity ofβ; that is, the estimate will correctly set β 2 to zero with probability tending to 1 as n → ∞, andβ 1 is asymptotically normal.

Remark 1.
For lasso and adaptive lasso, Π n = 0. For other penalties, because c n = o(1), then Π n = o(1). Since A n,11 (w; β 0 ) = O(|D n |) from conditions (C.2) and (C.3), |D n | Π n is asymptotically negligible with respect to A n,11 (w; β 0 ) . Remark 2. Theorems 1 and 2 remain true for the regularized logistic estimator if we replace in the expression of the matrices A n , B n , and C n , w(u) by w(u)δ(u)/(ρ(u; β 0 ) + δ(u)), u ∈ D n and extend the condition (C.3) by adding The proofs of Theorems 1 and 2 for this estimator are slightly different mainly because unlike the Poisson likelihood for which we have  (2) n (w; β)) = −A n (w; β). Despite the additional difficulty, we maintain that no additional assumption is required.
Remark 3. We want to highlight here the main theoretical differences with the work by Thurman et al. (2015). First, the methodology and results are available for the logistic regression likelihood. Second, we consider very general penalty function while Thurman et al. (2015) only consider the adaptive lasso method. Third, Thurman et al. (2015) assume that that |D n | −1 M n → M as n → ∞, where M n is A n , B n or C n , and where M, i.e. either A, B or C, are positive definite matrices. Instead we assume the sharper condition lim inf n→∞ ν min (|D n | −1 M n ) > 0, where M n is either A n or B n + C n . The latter point makes the proofs a little bit more technical.

Discussion of the conditions
We split the conditions we assume into two different categories: conditions (C.1)-(C.7) and condition (C.8) combined with the assumptions on the behavior of the sequences a n , b n and c n .
Conditions (C.1)-(C.7) are standard in the literature, see e.g. Coeurjolly and Møller (2014). Essentially, these assumptions ensure that when there is no regularization, the estimateβ is consistent and satisfies a central limit theorem. To help the reader, we reproduce comments that can be done on these assumptions. In condition (C.1), the assumption that E contains o in its interior can be made without loss of generality. If instead u is an interior point of E, then condition (C.1) could be modified to that any ball with centre u and radius r > 0 is contained in D n = nE for all sufficiently large n. Condition (C.3) is quite standard. From conditions (C.2)-(C.5), the matrices A n (w; β 0 ), B n (w; β 0 ) and C n (w; β 0 ) are bounded by |D n | (see e.g. Coeurjolly and Møller, 2014). As mentioned, conditions (C.1)-(C.6) are used to establish a central limit theorem for |D n | −1/2 (1) n (w; β 0 ) using a general central limit theorem for triangular arrays of nonstationary random fields obtained by Karácsony (2006), which is an extension from Bolthausen (1982), then later extended to nonstationary random fields by Guyon (1995). As pointed out by Coeurjolly and Møller (2014), condition (C.6) is a spatial average assumption. This assumption is similar for linear models to an assumption like ν min (n −1 X X) where n would play the role of the number of observations and X would represent the design matrix.
Conditions (C.6)-(C.7) ensure that the matrix Σ n (w; β 0 ) is invertible for sufficiently large n. We refer the reader to e.g. Coeurjolly and Møller (2014) where these conditions are shown to hold for large class of models including Poisson and Cox processes discussed in Appendix B.
Condition (C.8) controls the higher order terms in Taylor expansion of the penalty function. Roughly speaking, we expect the penalty function to be at least Lipschitz and thrice differentiable in a neighborhood of the true parameter vector. As it is, the condition looks technical, however, it is obviously satisfied for ridge, lasso, elastic net (and the adaptive versions). According to the choice of λ n , it is satisfied for SCAD and MC+ when |β 0j |, for j = 1, . . . , s, is not equal to γλ n and/or λ n .
As a consequence of the previous discussion, the main assumptions we require in this paper are the ones related to the sequences a n , b n and c n . We require that a n |D n | 1/2 → 0, b n |D n | 1/2 → ∞ and c n → 0 as n → ∞ simultaneously. For the ridge regularization method, b n = 0, preventing from applying Theorem 2 for this penalty. For lasso and elastic net, a n = K 2 b n for some constant K 2 > 0 (K 2 =1 for lasso). The two conditions a n |D n | 1/2 → 0 and b n |D n | 1/2 → ∞ as n → ∞ cannot be satisfied simultaneously. This is different for the adaptive versions where a compromise can be found by adjusting the λ n,j 's, as well as the two non-convex penalties SCAD and MC+, for which λ n can be adjusted. For the regularization methods considered in this paper, the condition c n → 0 is implied by the condition a n |D n | 1/2 → 0 as n → ∞.

Simulation study
We conduct a simulation study with three different scenarios, described in Section 4.1, to compare the estimates of the regularized Poisson likelihood (PL) and that of the regularized weighted Poisson likelihood (WPL). We also want to explore the behavior of the estimates using different regularization methods. Empirical findings are presented in Section 4.2. Furthermore, we compare, in Section 4.3, the regularized Poisson and logistic estimators.

Simulation set-up
The setting is quite similar to that of Waagepetersen (2007) and Thurman et al. (2015). The spatial domain is D = [0, 1000] × [0, 500]. We center and scale the 201 × 101 pixel images of elevation (x 1 ) and gradient of elevation (x 2 ) contained in the bei datasets of spatstat library in R (R Core Team, 2016), and use them as two true covariates. In addition, we create three different scenarios to define extra covariates: Scenario 1. We generate eighteen 201 × 101 pixel images of covariates as standard Gaussian white noise and denote them by x 3 , . . . , x 20 . We define z(u) = {1, x 1 (u), . . . , x 20 (u)} as the covariates vector. The regression coefficients for z 3 , . . . , z 20 are set to zero. Scenario 2. First, we generate eighteen 201 × 101 pixel images of covariates as in Scenario 1. Second, we transform them, together with x 1 and x 2 , to have multicollinearity. In particular, we define z . . , 20, except (Ω) 12 = (Ω) 21 = 0, to preserve the correlation between x 1 and x 2 . The regression coefficients for z 3 , . . . , z 20 are set to zero. Scenario 3. We consider a more complex situation. We center and scale the 13 50 × 25 pixel images of soil nutrients covariates obtained from the study in tropical forest of Barro Colorado Island (BCI) in central Panama (see Condit, 1998;Hubbell et al., 1999Hubbell et al., , 2005, convert them to be 201×101 pixel images as x 1 and x 2 , and use them as the extra covariates. Together with x 1 and x 2 , we keep the structure of the covariance matrix to preserve the complexity of the situation. In this setting, we have z(u) = {1, x 1 (u), . . . , x 15 (u)} . The regression coefficients for z 3 , . . . , z 15 are set to zero.
The different maps of the covariates obtained from Scenarios 2 and 3 are depicted in Appendix H. Except for z 3 which has high correlation with z 2 , the extra covariates obtained from Scenario 2 tend to have a constant value ( Figure 3). This is completely different from the ones obtained from Scenario 3 ( Figure 4).
The mean number of points over the domain D, μ, is chosen to be 1600. We set the true intensity function to be ρ(u; β 0 ) = exp{β 0 + β 1 z 1 (u) + β 2 z 2 (u)}, where β 1 = 2 represents a relatively large effect of elevation, β 2 = 0.75 reflects a relatively small effect of gradient, and β 0 is selected such that each realization has 1600 points in average. Furthermore, we erode regularly the domain D such that, with the same intensity function, the mean number of points over the new domain D R becomes 400. The erosion is used to observe the convergence of the procedure as the observation domain expands. We consider the default number of dummy points for the Poisson likelihood, denoted by nd 2 , as suggested in the spatstat R package, i.e. nd 2 ≈ 4m, where m is the number of points. With these scenarios, we simulate 2000 spatial point patterns from a Thomas point process (see Appendix B.2) using the rThomas function in the spatstat package. We also consider two different κ parameters (κ = 5×10 −4 , κ = 5×10 −5 ) as different levels of spatial interaction and let ω = 20. For each of the four combinations of κ and μ, we fit the intensity to the simulated point pattern realizations. We also fit the oracle model which only uses the two true covariates.
All models are fitted using modified internal function in spatstat (Baddeley et al., 2015), glmnet (Friedman et al., 2010), and ncvreg (Breheny and Huang, 2011). A modification of the ncvreg R package is required to include the penalized weighted Poisson and logistic likelihoods.

Simulation results
To better understand the behavior of Thomas processes designed in this study, Figure 1 shows the plot of the four realizations using different κ and μ. The smaller value of κ, the tighter the clusters since there are fewer parents. When μ = 400, i.e. by considering the realizations observed on D R, the mean number of points over the 2000 replications and standard deviation are 396 and 47 (resp. 400 and 137) when κ = 5 × 10 −4 (resp. κ = 5 × 10 −5 ). When μ = 1600, the mean number of points and standard deviation are 1604 and 174 (resp. 1589 and 529) when κ = 5 × 10 −4 (resp. κ = 5 × 10 −5 ). Tables 4 and 5 present the selection properties of the estimates using the penalized PL and the penalized WPL methods. Similarly to Bühlmann and Van De Geer (2011), the indices we consider are the true positive rate (TPR), the false positive rate (FPR), and the positive predictive value (PPV). TPR corresponds to the ratio of the selected true covariates over the number of true covariates, while FPR corresponds to the ratio of the selected noisy covariates over the number of noisy covariates. TPR explains how the model can correctly select both z 1 and z 2 . Finally, FPR investigates how the model incorrectly select among z 3 to z p (p = 20 for Scenarios 1 and 2 and p = 15 for Scenario 3). PPV corresponds to the ratio of the selected true covariates over the total number of selected covariates in the model. PPV describes how the model can approximate the oracle model in terms of selection. Therefore, we want to find the methods which have a TPR and a PPV close to 100%, and a FPR close to 0.
Generally, for both the penalized PL and the penalized WPL methods, the best selection properties are obtained for a larger value of κ which shows weaker spatial dependence. For a more clustered one, indicated by a smaller value of κ, it seems more difficult to select the true covariates. As μ increases from 400 (Table 4) to 1600 (Table 5), the TPR tends to improve, so the model can select both z 1 and z 2 more frequently.
Ridge, lasso, and elastic net are the regularization methods that cannot satisfy our theorems. It is firstly emphasized that all covariates are always selected by the ridge so that the rates are never changed whatever the method used. For the penalized PL with lasso and elastic net regularization, it is shown that they tend to have quite large values of FPR, meaning that they wrongly keep the noisy covariates more frequently. When the penalized WPL is applied, we gain smaller FPR, but we suffer from smaller TPR at the same time. This smaller TPR actually comes from the unselection of z 2 which has smaller coefficient than that of z 1 .
When we apply adaptive lasso, adaptive elastic net, SCAD, and MC+, we achieve better performance, especially for FPR which is closer to zero which automatically improves the PPV. Adaptive elastic net (resp. elastic net) has slightly larger FPR than adaptive lasso (resp. lasso). Among all regularization methods considered in this paper, adaptive lasso seems to outperform the other ones.
Considering Scenarios 1 and 2, we observe best selection properties for the penalized PL combined with adaptive lasso. As the design is getting more complex for Scenario 3, applying the penalized PL suffers from much larger FPR, indicating that this method may not be able to overcome the complicated situation. However, when we use the penalized WPL, the properties seem to be more stable for the different designs of simulation study. One more advantage when considering the penalized WPL is that we can remove almost all extra covariates. It is worth noticing that we may suffer from smaller TPR when we apply the penalized WPL, but we lose the only less informative covariates. From Tables 4 and 5, when we are faced with a complex situation, we would recommend the use of the penalized WPL method with adaptive lasso penalty if the focus is on selection properties. Otherwise, the use of the penalized PL combined with adaptive lasso penalty is more preferable.
Tables 6 and 7 give the prediction properties of the estimates in terms of biases, standard deviations (SD), and square root of mean squared errors (RMSE), some criteria we define by whereÊ(β j ) andσ 2 j are respectively the empirical mean and variance of the estimatesβ j , for j = 1, . . . , p, where p = 20 for Scenarios 1 and 2, and p = 15 for Scenario 3.
In general, the properties improve with larger value of κ and μ due to weaker spatial dependence and larger sample size. For the oracle model where the model contains only z 1 and z 2 , the WPL estimates are more efficient than the PL estimates, particularly in the more clustered case, agreeing with the findings by Guan and Shen (2010) in the unregularized setting.
When the regularization methods are applied, the bias increases in general, especially when we consider the penalized WPL method. The regularized WPL has a larger bias since this method does not select z 2 much more frequently. Furthermore, weighted method seems to introduce extra bias, even though the regularization is not considered as in the oracle model. For a low clustered process, the SD using the penalized WPL is similar to that of the penalized PL which may be because of the weaker dependence represented by larger κ, making weight surface w(·) closer to 1. However, a larger RMSE is obtained from the penalized WPL. When we observe the more clustered process, we obtain smaller SD using the penalized WPL which explains why in some cases (mainly Scenario 3) the RMSE gets smaller.
For the ridge method, the bias is closest to that of the oracle model, but it has the largest SD. Among the regularization methods, the adaptive lasso method has the best performance in terms of prediction. Considering Scenarios 1 and 2, we obtain best properties when we apply the penalized PL with adaptive lasso penalty. As the design is getting much more Table 7 Empirical prediction properties (Bias, SD, and RMSE)   complex for Scenario 3, when we use the penalized PL with adaptive lasso, the SD is doubled and even quadrupled due to the over selection of many unimpor-tant covariates. In particular, for the more clustered process, the better properties are even obtained by applying the regularized WPL combined with adaptive lasso. From Tables 6 and 7, when the focus is on prediction properties, we would recommend to apply the penalized WPL combined with adaptive lasso penalty when the observed point pattern is very clustered and when covariates have a complex structure of covariance matrix. Otherwise, the use of the penalized PL combined with adaptive lasso penalty is more favorable. Our recommendations in terms of prediction support as what we recommend in terms of selection.

Logistic regression
Our concern here is to compare the regularized Poisson estimator to the regularized logistic estimator with a different number of dummy points. We remind that the number of dummy points comes up when we discretize the integral terms in (2.3) and in (2.4). We consider three different numbers of dummy points denoted by nd 2 . By these different numbers of dummy points, we want to observe the properties with three different situations: (a) nd 2 < m, (b) nd 2 ≈ m, and (c) nd 2 > m, where m is the number of points. In the following, m ≈ 1600 and nd 2 = 400, 1600, and 6400. Note that the choice by default from the Poisson likelihood in spatstat corresponds to case (c). Baddeley et al. (2014) show that for datasets with very large number of points and for very structured point processes, the logistic likelihood method is clearly preferable as it requires a smaller number of dummy points to perform quickly and efficiently. We want to investigate a similar comparison when these methods are regularized.
We only repeat the results for κ = 5×10 −5 and μ = 1600, and for Scenarios 2 and 3. We use the same selection and prediction indices examined in Section 4.2 and consider only the adaptive lasso method. Table 8 presents selection properties for the regularized Poisson and logistic estimators using adaptive lasso regularization. For unweighted versions of the procedure, the regularized logistic method outperforms the regularized Poisson method when nd = 20, i.e. when the number of dummy points is much smaller than the number of points. When nd 2 ≈ m or nd 2 > m, the methods tend to have similar performances. When we consider weighted versions of the methods, the results do not change that much with nd and the regularized Poisson likelihood slightly outperforms the regularized logistic likelihood. In addition, for Scenario 3 which considers a more complex situation, the methods tend to select the noisy covariates much more frequently.
Empirical biases, standard deviation and the square root of mean squared errors are presented in Table 9. We include all empirical results for the standard Poisson and logistic estimates (i.e. no regularization is considered). Let us first consider the unweighted methods with no regularization. The logistic method clearly has a smaller bias, especially when nd = 20, which explains why in most situations the RMSE is smaller. However, for the weighted methods, although the logistic method has a smaller bias in general, it produces much larger SD, leading to larger RMSE for all cases. When we compare the weighted and the unweighted methods for logistic estimates, in general, not only do we fail to reduce the SD, but we also have a larger bias. When the adaptive lasso regularization is considered, combined with the unweighted methods, we can preserve the bias in general and in parallel improve the SD, and hence improve the RMSE. The logistic likelihood method slightly outperforms the Poisson likelihood method. When the weighted methods are considered, we obtain smaller SD, but we have a larger bias. For weighted versions of the Poisson and logistic likelihoods, the results do not change that much with nd and the weighted Poisson method slightly outperforms the weighted logistic method. From Tables 8 and 9, when the number of dummy points can be chosen as nd 2 ≈ m or nd 2 > m, we would recommend to apply the Poisson likelihood method. When the number of dummy points should be chosen as nd 2 < m, the logistic likelihood method is more favorable. Our recommendations regarding whether weighted or unweighted methods follow the ones as in Section 4.2.

Application to forestry datasets
In a 50-hectare region (D = 1, 000m×500m) of the tropical moist forest of Barro Colorado Island (BCI) in central Panama, censuses have been carried out where all free-standing woody stems at least 10 mm diameter at breast height were identified, tagged, and mapped, resulting in maps of over 350,000 individual trees with more than 300 species (see Condit, 1998;Hubbell et al., 1999Hubbell et al., , 2005. It is of interest to know how the very high number of different tree species continues to coexist, profiting from different habitats determined by e.g. topography or soil properties (see e.g. Waagepetersen, 2007; Waagepetersen and Guan, 2009). In particular, the selection of covariates among topological attributes and soil minerals as well as the estimation of their coefficients are becoming our most concern.
We are particularly interested in analyzing the locations of 3,604 Beilschmiedia pendula Lauraceae (BPL) tree stems. We model the intensity of BPL trees as a log-linear function of two topological attributes and 13 soil properties as the covariates. Figure 2 contains maps of the locations of BPL trees, elevation, slope, and concentration of Phosphorus. BPL trees seem to appear in greater abundance in the areas of high elevation, steep slope, and low concentration of Phosphorus. The covariates maps are depicted in Figure 4.
We apply the regularized Poisson and logistic likelihoods, combined with adaptive lasso regularization to select and estimate parameters. Since we do not deal with datasets which have a very large number of points, we can set the default number of dummy points for Poisson likelihood as in the spatstat package, i.e. the number of dummy points can be chosen to be larger than the number of points, to perform quickly and efficiently. It is worth emphasizing that we center and scale the 15 covariates to observe which one has the largest effect on the intensity. The results are presented in Table 10: 12 covariates for 1232 A. Choiruddin et al.

Fig 2. Maps of locations of BPL trees (top left), elevation (top right), slope (bottom left), and concentration of Phosphorus (bottom right).
the Poisson likelihood and 11 for the logistic method are selected out of the 15 covariates using the unweighted methods while only 5 covariates (both for the Poisson and logistic methods) are selected using the weighted versions. The unweighted methods tend to overfit the model by over selecting unimportant covariates.
The weighted methods tend to keep out the uninformative covariates. Both Poisson and logistic estimates own similar selection and estimation results. First, we find some differences in estimation between the unweighted and the weighted methods, especially for slope and Manganese (Mn), for which the weighted methods have approximately two times larger estimators. Second, we may lose some nonzero covariates when we apply the weighted methods, even though it is only for the covariates which have relatively small coefficient. Borron (B) has a high correlation with many of the other covariates, particularly with them which are not selected. This is possibly why Boron which is selected and may have a nonnegligible coefficient in the unweighted methods is not chosen in the model. This may explain why the weighted methods introduce extra biases. However, since the situation appears to be quite close to the Scenario 3 from the simulation study, the weighted methods are more favorable in terms of both selection and prediction.
In this application, we do not face any computational problem. Nevertheless, if we have to model a species of trees with much more points, the default value for nd will lead to numerical problems. In such a case, the logistic likelihood would be a good alternative.
These results suggest that BPL trees favor living in areas of higher elevation and slope. Further, higher levels of Manganese (Mn) and lower levels of both Phosphorus (P) and Zinc (Zn) concentrations in soil are associated with higher appearance of BPL trees. pH -0.14 -0.14 0 0 Nb of cov. 12 11 5 5

Conclusion and discussion
We develop regularized versions of estimating equations based on Campbell theorem derived from the Poisson and the logistic regression likelihoods. Our procedure is able to perform covariates selection for modeling the intensity of spatial point processes. Furthermore, our procedure is also generally easy to implement in R since we need to combine spatstat package with glmnet and ncvreg packages. We study the asymptotic properties of both regularized weighted Poisson and logistic estimates in terms of consistency, sparsity, and asymptotic normality. We find that, among the regularization methods considered in this paper, adaptive lasso, adaptive elastic net, SCAD, and MC+ are the methods that can satisfy our theorems. We carry out some scenarios in the simulation study to observe selection and prediction properties of the estimates. We compare the penalized Poisson likelihood (PL) and the penalized weighted Poisson likelihood (WPL) with different penalty functions. From the results, when we deal with covariates having a complex covariance matrix and when the point pattern looks quite clustered, we recommend to apply the penalized WPL combined with adaptive lasso regularization. Otherwise, the regularized PL with the adaptive lasso is more preferable.
The further and more careful investigation to choose the tuning parameters may be needed to improve the selection properties. We note the bias increases quite significantly when the regularized WPL is applied. When the penalized WPL is considered, a two-step procedure may be needed to improve the prediction properties: (1) use the penalized WPL combined with the adaptive lasso to choose the covariates, then (2) use the selected covariates to obtain the estimates. This post-selection inference procedure has not been investigated in this paper.
We also compare the estimates obtained from the Poisson and the logistic likelihoods. When the number of dummy points can be chosen to be either similar to or larger than the number of points, we recommend the use of the Poisson likelihood method. Nevertheless, when the number of dummy points should be chosen to be smaller than the number of points, the logistic method is more favorable.
A further work would consist in studying the situation when the number of the covariates is much larger than the sample size. In such a situation, the coordinate descent algorithm used in this paper may cause some numerical troubles. The Dantzig selector procedure introduced by Candes and Tao (2007) might be a good alternative as the implementation for linear models (and for generalized linear models) results in a linear programming. It would be interesting to bring this approach to spatial point process setting.
Another direction could consist in extending the intensity model itself to get more flexibility, for instance using single-index type models. Such models have already been proposed for spatial point processes by Fang and Loh (2017) with moderate number of covariates. Using e.g. Zhu et al. (2011), combining such models and regularization techniques for inhomogeneous spatial point processes seems feasible. Kernel-type regression methods could also appear as interesting perspectives and the work by Crawford et al. (2018) could serve as a basis to investigate such methods for spatial point processes feature selection problem.

Appendix A: Parametric intensity estimation
One of the standard ways to fit models to data is by maximizing the likelihood of the model for the data. While maximum likelihood method is feasible for parametric Poisson point process models (Appendix A.1), computationally intensive Markov chain Monte Carlo (MCMC) methods are needed otherwise (Møller and Waagepetersen, 2004). As MCMC methods are not yet straightforward to implement, estimating equations based on Campbell theorem have been developed (see e.g. Waagepetersen, 2007;Møller and Waagepetersen, 2007;Waagepetersen, 2008;Guan and Shen, 2010;Baddeley et al., 2014). We review the estimating equations derived from the Poisson likelihood in Appendix A.2-A.3 and from the logistic regression likelihood in Appendix A.4.

A.1. Maximum likelihood estimation
For an inhomogeneous Poisson point process with intensity function ρ parameterized by β, the likelihood function is and the log-likelihood function of β is where we have omitted the constant term D 1du = |D|. As the intensity function has log-linear form (1.1), (A.1) reduces to (1994) show that the maximum likelihood estimator is consistent, asymptotically normal and efficient as the sample region goes to R d .

A.2. Poisson likelihood
Let β 0 be the true parameter vector. By applying Campbell theorem (2.1) to the score function, i.e. the gradient vector of (β) denoted by (1) (β), we have The properties of the Poisson estimator have been carefully studied. Schoenberg (2005) shows that the Poisson estimator is still consistent for a class of spatio-temporal point process models. The asymptotic normality for a fixed observation domain is obtained by Waagepetersen (2007) while Guan and Loh (2007) establish asymptotic normality under an increasing domain assumption and for suitable mixing point processes.
Regarding the parameter ψ (see Appendix B.2), Waagepetersen and Guan (2009) study a two-step procedure to estimate both β and ψ, and they proved that, under certain mixing conditions, the parameter estimates (β,ψ) enjoy the properties of consistency and asymptotic normality.

A.3. Weighted Poisson likelihood
Although the estimating equation approach derived from the Poisson likelihood is simpler and faster to implement than maximum likelihood estimation, it potentially produces a less efficient estimate than that of maximum likelihood (Waagepetersen, 2007;Guan and Shen, 2010) because information about interaction of events is ignored. To regain some lack of efficiency, Guan and Shen (2010) propose a weighted Poisson log-likelihood function given by where w(·) is a weight surface. By regarding (A.2), we see that a larger weight w(u) makes the observations in the infinitesimal region du more influent. By Campbell theorem, (1) (w; β) is still an unbiased estimating equation. In addition, Guan and Shen (2010) prove that, under some conditions, the parameter estimates are consistent and asymptotically normal. Guan and Shen (2010) show that a weight surface w(·) that minimizes the trace of the asymptotic variance-covariance matrix of the estimates maximizing (A.2) can result in more efficient estimates than Poisson estimator. In particular, the proposed weight surface is is the pair correlation function. For a Poisson point process, note that f (u) = 0 and hence w(u) = 1, which reduces to maximum likelihood estimation. For general point processes, the weight surface depends on both the intensity function and the pair correlation function, thus incorporates information on both inhomogeneity and dependence of the spatial point processes. When clustering is present so that g(v −u) > 1, then f (u) > 0 and hence the weight decreases with ρ(u). The weight surface can be achieved by settingŵ(u) = {1+ρ(u)f (u)} −1 . To get the estimateρ(u), β is substituted byβ given by Poisson estimates, that is,ρ(u) = ρ(u;β). Alternatively, ρ(u) can also be computed non parametrically by kernel method. Furthermore, Guan and Shen (2010) suggest to approximate f (u) by K(r) − πr 2 , where K(·) is the Ripley's K−function estimated bŷ  extend the study by Guan and Shen (2010) and consider more complex estimating equations. Specifically, w(u)z(u) is replaced by a function h(u; β) in the derivative of (A.2) with respect to β. The procedure results in a slightly more efficient estimate than the one obtained from (A.2). However, the computational cost is more important and since we combine estimating equations and penalization methods (see Section 2.3), we have not considered this extension.

A.4. Logistic regression likelihood
Although the estimating equations discussed in Appendices A.2 and A.3 are unbiased, these methods do not, in general, produce unbiased estimator in practi-cal implementations. Waagepetersen (2008) and Baddeley et al. (2014) propose another estimating function which is indeed close to the score of the Poisson log-likelihood but is able to obtain less biased estimator than Poisson estimates. In addition, their proposed estimating equation is in fact the derivative of the logistic regression likelihood.
Following Baddeley et al. (2014), we define the weighted logistic regression log-likelihood function by where δ(u) is a nonnegative real-valued function. Its role as well as an explanation of the name 'logistic method' will be explained further in Appendix C.2. Note that the score of (A.3) is an unbiased estimating equation. Waagepetersen (2008) shows asymptotic normality for Poisson and some clustered point processes for the estimator obtained from a similar procedure. Furthermore, the methodology and results are studied by Baddeley et al. (2014) considering spatial Gibbs point processes.
To determine the optimal weight surface w(·) for logistic method, we follow Guan and Shen (2010) who minimize the trace of the asymptotic covariance matrix of the estimates. We obtain the weight surface defined by where ρ(u) and f (u) can be estimated as in Appendix A.3.

Appendix B: Examples of spatial point processes models with prescribed intensity function
We discuss spatial point process models specified by deterministic or random intensity function. Particularly, we consider two important model classes, namely Poisson and Cox processes. Poisson point processes serve as a tractable model class for no interaction or complete spatial randomness. Cox processes form major classes for clustering or aggregation. For conciseness, we focus on the two later classes of models. We could also have presented determinantal point processes (e.g. Lavancier et al., 2015) which constitute an interesting class of repulsive point patterns with explicit moments. This has not been further investigated for the sake of brevity. In this paper, we focus on log-linear models of the intensity function given by (1.1).

B.1. Poisson point process
A point process X on D is a Poisson point process with intensity function ρ, assumed to be locally integrable, if the following conditions are satisfied: 1. for any B ⊆ D with 0 ≤ μ(B) < ∞, N (B) ∼ P oisson(μ(B)), 2. conditionally on N (B), the points in X ∩ B are i.i.d. with joint density proportional to ρ(u), u ∈ B.
A Poisson point process with a log-linear intensity function is also called a modulated Poisson point process (e.g. Møller and Waagepetersen, 2007;Waagepetersen, 2008). In particular, for Poisson point processes,

B.2. Cox processes
A Cox process is a natural extension of a Poisson point process, obtained by considering the intensity function of the Poisson point process as a realization of a random field. Suppose that Λ = {Λ(u) : u ∈ D} is a nonnegative random field. If the conditional distribution of X given Λ is a Poisson point process on D with intensity function Λ, then X is said to be a Cox process driven by Λ (see e.g. Møller and Waagepetersen, 2004). There are several types of Cox processes. Here, we consider two types of Cox processes: a Neyman-Scott point process and a log Gaussian Cox process.

Neyman-Scott point processes.
Let C be a stationary Poisson process (mother process) with intensity κ > 0. Given C, let X c , c ∈ C, be independent Poisson processes (offspring processes) with intensity function ρ c (u; β) = exp(β z(u))k(u − c; ω)/κ, where k is a probability density function determining the distribution of offspring points around the mother points parameterized by ω. Then X = ∪ c∈C X c is a special case of an inhomogeneous Neyman-Scott point process with mothers C and offspring X c , c ∈ C. The point process X is a Cox process driven by Λ(u) = exp(β z(u)) c∈C k (u − c, ω)/κ (e.g. Waagepetersen, 2007;Coeurjolly and Møller, 2014) and we can verify that the intensity function of X is indeed ρ(u; β) = exp(β z(u)).
One example of Neyman-Scott point process is the Thomas process where is the density for N d (0, ω 2 I d ). Conditionally on a parent event at location c, children events are normally distributed around c. Smaller values of ω correspond to tighter clusters, and smaller values of κ correspond to fewer number of parents. The parameter vector ψ = (κ, ω) is referred to as the interaction parameter as it modulates the spatial interaction (or, dependence) among events.
Log Gaussian Cox process. Suppose that log Λ is a Gaussian random field. Given Λ, the point process X follows Poisson process. Then X is said to be a log Gaussian Cox process driven by Λ (Møller and Waagepetersen, 2004). If the random intensity function can be written as where φ is a zero-mean stationary Gaussian random field with covariance function c(u, v; ψ) = σ 2 R(v−u; ζ) which depends on parameter ψ = (σ 2 , ζ) (Møller and Waagepetersen, 2007;Coeurjolly and Møller, 2014). The intensity function of this log Gaussian Cox process is indeed given by ρ(u; β) = exp(β z(u)).
One example of correlation function is the exponential form (e.g. Waagepetersen and Guan, 2009) Here, ψ = (σ 2 , ζ) constitutes the interaction parameter vector, where σ 2 is the variance and ζ is the correlation scale parameter.

Appendix C: Numerical methods
We present numerical aspects in this section. For nonregularized estimation, there are two approaches that we consider. Weighted Poisson regression is explained in Appendix C.1, while logistic regression is reviewed in Appendix C.2. Penalized estimation procedure is done by employing coordinate descent algorithm (Appendix C.3). We separate the use of the convex and non-convex penalties in Appendices C.3.1 and C.3.2. Berman and Turner (1992) develop a numerical quadrature method to approximate maximum likelihood estimation for an inhomogeneous Poisson point process. They approximate the likelihood by a finite sum that had the same analytical form as the weighted likelihood of generalized linear model with Poisson response. This method is then extended to Gibbs point processes by Baddeley and Turner (2000). Suppose we approximate the integral term in (A.1) by Riemann sum approximation

C.1. Weighted Poisson regression
where u i , i = 1, . . . , M are points in D consisting of the m data points and M −m dummy points. The quadrature weights v i > 0 are such that i v i = |D|. To implement this method, the domain is firstly partitioned into M rectangular pixels of equal area, denoted by a. Then one dummy point is placed in the center of the pixel. Let Δ i be an indicator of whether the point is an event of point process (Δ i = 1) or a dummy point (Δ i = 0). Without loss of generality, let u 1 , . . . , u m be the observed events and u m+1 , . . . , u M be the dummy points. Thus, the Poisson log-likelihood function (A.1) can be approximated and rewritten as where w i is the value of the weight surface at point i. The estimateŵ i is obtained as suggested by Guan and Shen (2010). The similarity between (C.1) and (C.2) allows us to compute the estimates using software for generalized linear model as well. This fact is in particular exploited in the ppm function in the spatstat R package (Baddeley and Turner, 2005;Baddeley et al., 2015) with option method="mpl". To make the presentation more general, the number of dummy points is denoted by nd 2 for the next sections.

C.2. Logistic regression
To perform well, the Berman-Turner approximation often requires a quite large number of dummy points. Hence, fitting such generalized linear models can be computationally intensive, especially when dealing with a quite large number of points. When the unbiased estimating equations are approximated using deterministic numerical approximation as in Appendix C.1, it does not always produce unbiased estimator. To achieve unbiased estimator, we estimate (A.3) by where D is dummy point process independent of X and with intensity function δ. The form (C.3) is related to the estimating equation defined by Baddeley et al. (2014, eq. 7). Besides that, we consider this form since if we apply Campbell theorem to the last term of (C.3), we obtain which is exactly what we have in the last term of (A.3). In addition, conditional on X ∪ D, (C.3) is the weighted likelihood function for Bernoulli trials, y(u) = 1{u ∈ X} for u ∈ X ∪ D, with Precisely, (C.3) is a weighted logistic regression with offset term − log δ. Thus, parameter estimates can be straightforwardly obtained using standard software for generalized linear models. This approach is in fact provided in the spatstat package in R by calling the ppm function with option method="logi" (Baddeley et al., 2014(Baddeley et al., , 2015. In spatstat, the dummy point process D generates nd 2 points in average in D from a Poisson, binomial, or stratified binomial point process. Baddeley et al. (2014) suggest to choose δ(u) = 4m/|D|, where m is the number of points (so, nd 2 = 4m). Furthermore, to determine δ, this option can be considered as a starting point for a data-driven approach (see Baddeley et al., 2014, for further details).

C.3. Coordinate descent algorithm
LARS algorithm (Efron et al., 2004) is a remarkably efficient method for computing an entire path of lasso solutions. For linear models, the computational cost is of order O(Mp 2 ), which is the same order as a least squares fit. Coordinate descent algorithm (Friedman et al., 2007(Friedman et al., , 2010 appears to be a more competitive algorithm for computing the regularization paths by costs O(Mp) operations. Therefore we adopt cyclical coordinate descent methods, which can work really fast on large datasets and can take advantage of sparsity. Coordinate descent algorithms optimize a target function with respect to a single parameter at a time, iteratively cycling through all parameters until convergence criterion is reached. We detail this for some convex and non-convex penalty functions in the next two sections. Here, we only present the coordinate descent algorithm for fitting generalized weighted Poisson regression. A similar approach is used to fit penalized weighted logistic regression.

C.3.1. Convex penalty functions
Since (w; β) given by (C.2) is a concave function of the parameters, the Newton-Raphson algorithm used to maximize the penalized log-likelihood function can be done using the iteratively reweighted least squares (IRLS) method. If the current estimate of the parameters isβ, we construct a quadratic approximation of the weighted Poisson log-likelihood function using Taylor expansion: where C(β) is a constant, y * i are the working response values and ν i are the weights, Regularized Poisson linear model works by firstly identifying a decreasing sequence of λ ∈ [λ min , λ max ], for which starting with minimum value of λ max such that the entire vectorβ = 0. For each value of λ, an outer loop is created to compute Q (w; β) atβ. Secondly, a coordinate descent method is applied to solve a penalized weighted least squares problem The coordinate descent method is explained as follows. Suppose we have the estimateβ l for l = j, l, j = 1, . . . , p. The method consists in partially optimizing (C.5) with respect to β j , that is min βj Ω(β 1 , . . . ,β j−1 , β j ,β j+1 , . . . ,β p ). Friedman et al. (2007) provide the form of the coordinate-wise update for several penalized regression estimators. For instance, the coordinate-wise update for the elastic net, which embraces the ridge and lasso regularization by setting respectively γ to 0 or 1, is , (C.6) whereỹ (j) i =β 0 + l =j z ilβl is the fitted value excluding the contribution from covariate z ij , and S(z, λ) is the soft-thresholding operator with value The update (C.6) is repeated for j = 1, . . . , p until convergence. Coordinate descent algorithm for several convex penalties is implemented in the R package glmnet (Friedman et al., 2010). For (C.6), we can set γ = 0 to implement ridge and γ = 1 to lasso, while we set 0 < γ < 1 to apply elastic net regularization. For adaptive lasso, we follow Zou (2006), take γ = 1 and replace λ by λ j = λ/|β j | τ , whereβ is an initial estimate, sayβ(ols) orβ(ridge), and τ is a positive tuning parameter. To avoid the computational evaluation for choosing τ , we follow Zou (2006, Section 3.4) and Wasserman and Roeder (2009) who also considered τ = 1, so we choose λ j = λ/|β j (ridge)|, whereβ(ridge) is the estimates obtained from ridge regression. Implementing adaptive elastic net follows along similar lines.

C.3.2. Non-convex penalty functions
Breheny and Huang (2011) investigate the application of coordinate descent algorithm to fit penalized generalized linear model using SCAD and MC+, for which the penalty is non-convex. Mazumder et al. (2011) also study the coordinate-wise optimization algorithm in linear models considering more general non-convex penalties. Mazumder et al. (2011) conclude that, for a known current estimateθ, the univariate penalized least squares function Q u (θ) = 1 2 (θ −θ) 2 +p λ (|θ|) should be convex to ensure that the coordinate-wise procedure converges to a stationary point. Mazumder et al. (2011) find that this turns out to be the case for SCAD and MC+ penalty, but it cannot be satisfied by bridge (or power) penalty and some cases of log-penalty. Breheny and Huang (2011) derive the solution of coordinate descent algorithm for SCAD and MC+ in generalized linear models cases, and it is implemented in the ncvreg package of R. Letβ l be a vector containing estimates β l for l = j, l, j = 1, . . . , p, and we wish to partially optimize (C.5) with respect to β j . If we defineg ij , the coordinate-wise update for SCAD is for any γ > max j (1 + 1/η j ). Then, for γ > max j (1/η j ) and the same definition ofg j andη j , the coordinate-wise update for MC+ is where S(z, λ) is the soft-thresholding operator given by (C.7).

C.4. Selection of regularization or tuning parameter
It is worth noticing that coordinate descent procedures (and other computation procedures computing the penalized likelihood estimates) rely on the tuning parameter λ so that the choice of λ is also becoming an important task. The combination between 1 and 2 penalties. This method is particularly useful when the number of predictors is much larger than the number of observations since it can select or eliminate the strongly correlated predictors together. The lasso procedure suffers from nonnegligible bias and does not satisfy an oracle property asymptotically (Fan and Li, 2001). Fan and Li (2001) and Zhang (2010), among others, introduce non-convex penalties to get around these drawbacks. The idea is to bridge the gap between 0 and 1 , by trying to keep unbiased the estimates of nonzero coefficients and by shrinking the less important variables to be exactly zero. The rationale behind the non-convex penalties such as SCAD and MC+ can also be understood by considering its first derivative (see Table 1). They start by applying the similar rate of penalization as the lasso, and then continuously relax that penalization until the rate of penalization drops to zero. However, employing non-convex penalties in regression analysis, the main challenge is often in the minimization of the possible non-convex objective function when the non-convexity of the penalty is no longer dominated by the convexity of the likelihood function. This issue has been carefully studied. Fan and Li (2001) propose the local quadratic approximation (LQA). Zou and Li (2008) propose a local linear approximation (LLA) which yields an objective function that can be optimized using least angle regression (LARS) algorithm (Efron et al., 2004). Finally, Breheny andHuang (2011) andMazumder et al. (2011) investigate the application of coordinate descent algorithm to non-convex penalties.
In (2.5), it is worth emphasizing that we allow each direction to have a different regularization parameter. By doing this, the 1 and elastic net penalty functions are extended to the adaptive lasso (e.g. Zou, 2006) and adaptive elastic net (e.g. Zou and Zhang, 2009). Table 2 details the regularization methods considered in this study.
Proof. We now focus on the proof of Theorem 2. Since Theorem 2(i) is proved by Lemma 2, we only need to prove Theorem 2(ii), which is the asymptotic normality ofβ 1 . As shown in Theorem 1, there is a root-|D n | consistent local maximizer β of Q n (w; β), and it can be shown that there exists an estimatorβ 1 in Theorem 1 that is a root-(|D n |) consistent local maximizer of Q n w; (β 1 , 0 ) , which is regarded as a function of β 1 , and that satisfies ∂Q n (w;β) ∂β j = 0 for j = 1, . . . , s, andβ = (β 1 , 0 ) .