Sparse Poisson Regression with Penalized Weighted Score Function

We proposed a new penalized method in this paper to solve sparse Poisson Regression problems. Being different from $\ell_1$ penalized log-likelihood estimation, our new method can be viewed as penalized weighted score function method. We show that under mild conditions, our estimator is $\ell_1$ consistent and the tuning parameter can be pre-specified, which shares the same good property of the square-root Lasso.


Introduction
Poisson regression is a special generalized linear model (Nelder and Baker, 1972) which is widely used to model count data. Let (x 1 , y 1 ), · · · , (x n , y n ) be independent pairs of observed data which are realizations of random vector (X, Y ), with p-dimensional covariates X ∈ R p and univariate response variable Y ∈ R. (X, Y ) is assumed to satisfy the conditional distribution Y |X = x ∼ Poisson(µ(x)) with log(µ(x)) = x T β * , where β * ∈ R p is an unknown parameter vector to be estimated.
In this paper, we are concerned with a sparse Poisson regression problem when the number of covariates (or predictors) is much larger than the number of observations, i.e. p n, which is a variable selection (or model selection) problem for high-dimensional data. For linear models, now researchers have developed several methods such as Lasso (Tibshirani, 1996), adaptive Lasso (Zou, 2006), SCAD (Fan and Li, 2001) and so on. Lasso is a very popular method not only due to its interpretability and prediction performance (Tibshirani, 1996), but also because it is a convex problem and can be computed easily and fast (Friedman et al., 2010). It is well known that when incoherent condition (or irrepresentable condition) holds, Lasso estimator is sign consistent (Zhao and Yu, 2006;Zou, 2006;Jia et al., 2013). When a restricted eigenvalue condition holds, Lasso estimator can be 2 consistent (Bickel et al., 2008). Because incoherent condition might not hold, more steps are used to relax this condition such as adaptive Lasso (Zou, 2006) and Puffer transformation (Jia and Rohe, 2015). Other non-convex penalized methods like SCAD (Fan and Li, 2001) and MCP (Zhang, 2010) can also be used to study sparse models.
Variable selection problems for generalized linear models have also gained great attentions in recent years. For instance, Ravikumar et al. (2010) studied 1 regularized logistic regression models, while Li and Cevher (2015) analyzed the consistency of 1 penalized Poisson regression models. Moreover, Raginsky et al. (2010) studied the performance bounds for compressed sensing (CS) under Poisson models, Ivanoff et al. (2016) also considered a data-driven tuning method for sparse and structure sparse functional Poisson regression models.
It is now well known that the Lasso problem in linear regression models, a good tuning parameter choice depends on the unknown parameter σ 2 which is the homogeneous noise variance in linear models (Bickel et al., 2008). To solve this problem, Belloni et al. (2011) proposed squareroot Lasso, which alternatively replaces the original score function in (Bickel et al., 2008) by the square root of this function. In previous studies (Raginsky et al., 2010;Li and Cevher, 2015;Ivanoff et al., 2016), variable selections for sparse Poisson models are obtained via penalized loglikelihood methods, which have the same problem as the Lasso problem. Moreover, Poisson noises are not homogeneous any more, a unique penalty for all of the different coefficient is not a good choice (Ivanoff et al., 2016). In this paper, we propose a new penalized weighted score function method to study sparse Poisson regression, and show that it gives consistent estimator of the parameters in sparse Poisson models and provides a direct choice for the tuning parameter. The simulations show that our proposed method is much more robust than traditional sparse Poisson models using 1 penalized log-likelihood method.
The rest of the paper is arranged as follows. In Section 2 we first review square root Lasso and explain why it could be viewed as a penalized weighted score function. Then we apply this idea to sparse Poisson models and propose our method. Section 3 provides finite sample and asymptotic bounds for our new estimator. In Section 4, we conduct experiments to show the robustness of our method. Section 5 gives the detailed proofs for our theoretical results.

Penalized weighted score function
We first briefly give a few notations used in this paper.
Denote by {a n } n i=1 and {b n } n i=1 two sequences, the notation b n = O(a n ) means that there exists a constant C > 0 such that b n ≤ Ca n for all n ≥ 1 and the notation b n = o(a n ) means that lim n→∞ bn an = 0. If f is a function, we denote by ∇f the gradient of f .

Square root Lasso revisited
We now review square root Lasso and treat it as a penalized weighted score function method. The Lasso is defined as follows. min where Y ∈ R n×1 , X ∈ R n×p and β ∈ R p×1 . The solution of the Lasso satisfies KKT conditions defined as follows: where s j is a subgradient of |β j | which is the sign of β j if β j = 0 and can be any value belonging to [0, 1] when β j = 0. From Equation (2.2), we see Note that X T (Y − Xβ) is the score function for linear model with Gaussian noises, that is (X, Y ) follows Y = Xβ * + , with ∼ N (0, σ 2 I n ). To have a good estimator ( 2 consistent for example), the choice of λ should satisfy the condition λ > c X T (Y − Xβ * ) ∞ for some positive constant c (Bickel et al., 2008). The score function evaluated at β * is X T (Y − Xβ * ) which has a multi-normal distribution with mean 0 and co-variance matrix σ 2 X T X, this suggests that the choice of λ also depends on the unknown parameter σ 2 . One way to remove this unknown parameter is to use a weighted (or scaled) score function defined as whose distribution at β * does not depend on σ 2 any more. Setting the penalized weighted score function to be 0, i.e.
leads to the following optimization problem: which is the square root Lasso defined in Belloni et al. (2011). This viewpoint of treating square root Lasso as a penalized weighted score function method could be applied to other problems especially for heteroscedastic models. In the next subsection, we give details on how to apply this idea to sparse Poisson models.

Penalized weighted score function method for sparse Poisson regression
Let (x 1 , y 1 ), · · · , (x n , y n ) be independent pairs of observed data which are realizations of random vector (X, Y ), with p-dimensional covariates X ∈ R p and univariate response variable Y ∈ R. (X, Y ) is assumed to satisfy the conditional distribution Y |X = x ∼ Poisson(µ(x)) with log(µ(x)) = x T β * , where β * ∈ R p is an unknown parameter vector to be estimated. Denoting x i = (x i1 , · · · , x ip ) T , without loss of generality, we assume Under the above settings, the negative loglikelihood (up to a scale and a constant shift) is defined as follows: (2.5) Sparse Poisson regression model could be gained via 1 penalized loglikelihood defined as follows: where the penalty levelλ > 0. From the KKT optimality condition, we getλ ≥ ∇ (β) ∞ . To get a good estimator, usually we require that the inequalityλ ≥ c ∇ (β * ) ∞ for some constant c > 1 holds with high probability (Li and Cevher, 2015). However, for the score function valued at β = β * : the random part y i − e x T i β * has variance e x T i β * , which is also the rate parameter of the Poisson random variable Y i |X i = x i . If the rate parameter is very large, the penalty coefficient λ will be very sensitive.
In this paper, we apply the idea from square root Lasso to solve the above problem on choosing penalty level, let us briefly introduce our new method as follows. We give weights to the items in score function and develop an 1 penalized weighted score function method, which solves the following equation: where s is the sub-gradient of β 1 . By a careful derivation, we found that the solution to Equation (2.6) is equivalent to solve the following convex optimization problem: where λ > 0 is a new penalty level which will not depend on any rate parameter e x T i β * .
and denote H = ∇f (β * ) ∞ . We will choose a suitable λ such that it is greater than cH with high probability and with such a choice of λ, the estimator is good in the sense that β − β * 1 is bounded by a small value which goes to 0 when n → ∞ under mild conditions.
In the next section, we study the statistical performance of our proposed method, including how to select a good λ such that our estimator has small errors.
3 Finite-sample and asymptotic bounds on β − β * 1 We first show that when the tuning parameter λ defined in our new penalized method (2.7) is greater than cH = c ∇f (β * ) ∞ for some c > 1, the estimation error can be bounded. Based on this result, we will prove a finite sample result for selecting a good tuning parameter λ such that the estimator has very small error with high probability. For our theoretical analysis, we give a few regularity conditions as the following: (II) n, p satisfy that n ≤ p ≤ o(e n 1/5 ) and p/α > 8 for α ∈ (0, 1).
Conditions (I) and (II) are considerably mild, while (III) is a restricted eigenvalue condition (Bickel et al., 2008), similar to compatibility condition (Geer, 2007) and restricted strong convexity conditions (Negahban and Yu, 2010). Although these conditions usually cannot be verified from data, researchers found that they are not strong for Lasso problems in the linear model case and hold with high probability when the elements of design matrix are randomly from Gaussian distributions (Raskutti et al., 2010). For Poisson regression models, there are a few literatures on when condition (III) holds, we conjecture that it holds with high probability under very mild conditions and leave this to future study. Now we give a deterministic result on when we can have a good estimator in the sense that the error could be bounded.
Theorem 3.1. Letβ be the estimator defined by (2.7). Suppose that the assumptions (I), (II) are satisfied. For some constant c > 1, assumption The estimatorβ is also 2 consistent.
From Theorem 3.1, we see that if we can choose a λ such that λ > cH = c ∇f (β * ) ∞ and λs ≤ 2κ 2 3L(1+L)R holds with high probability, then conclusions (3.1) and (3.2) hold with high probability. This motivates us on how to select a good tuning parameter λ.
is a random variable. Define by H(1 − α|X) the 1 − α quantile of H|X for α ∈ (0, 1). If we choose λ as follows, it is easy to know that P(λ ≥ cH) ≥ 1 − α with λ choosing by (3.3). By a careful analysis, we can prove that λ in (3.3) is order of log (p/α) n , which means that λ ≤ const. × log (p/α) n for some const. > 0. We shall prove in the appendix the following lemma: (ii) Suppose the assumption (I) and (II) are satisfied. Then, λ ≤ const. × log (p/α) n for some const > 0.
Although the λ defined in Equation (3.3) satisfies good property that λ > cH and λs ≤ 2κ 2 3L(1+L)R , we notice that this λ can not be determined from data in practice because the distribution of H still depends on β * . Note that the quantity y i −e x T i β * √ e x T i β * , i = 1, 2, . . . , n are i.i.d random variables with mean 0 and variance 1, one can approximate choice of λ by λ = cH(1−α|X), wherẽ from N (0, 1). H should have a limiting distribution that is the same asH under mild conditions. Motivated by the limiting normal distributions, we can give an asymptotic choice of λ such that λ ≥ cH with high probability when n → ∞ as the following: where Φ(·) is the cumulative distribution function of the standard norm random variable and Φ −1 (·) is its inverse function. This choice of λ has the following properties: Lemma 3.5. (i) If λ is chosen as (3.4), where b = 6C 1 K 1 log p/p 3 with some positive constants C 1 and K 1 . In particular, as n, p → ∞, Together with Theorem 3.1 and Lemmas 3.3 and 3.5, we have the following non-asymptotic results. and (3.2) hold for some constant C ∈ (2, 3]. By Theorem 3.6, we can obtain the following asymptotic results (or consistency results) immediately as stated in Corollary 3.7.

Experiments
We use the R package "lbfgs" to solve 1 penalized convex optimization problems (Coppola and Stewart). The lbfgs package implements both the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithms. The L-BFGS algorithm solves the problem of minimizing an objective, given its gradient, by iteratively computing approximations of the inverse Hessian matrix. The OWL-QN algorithm finds the optimum of an objective plus the L1-norm of the problem's parameters. The package offers a fast and memory-efficient implementation of these optimization routines, which is particularly suited for high-dimensional problems. We first use simulations to show that our proposed method is much more robust than traditional sparse Poisson models using 1 penalized log-likelihood method. For this purpose, we first generate a design matrix X ∈ R n×p with n = 500, p = 20 and each element X ij i.i.d. from the standard normal distribution. Then we do centralization and normalization X such that n j=1 X ij = 0, and 1 n n j=1 X 2 ij = 1, i = 1, 2, . . . , n. We set the number of nonzero elements of β * as 5 and each element randomly from N (0, 1). Y i ∼ P oisson(exp{ 5 j=1 X ij β * j }). We first use R package "glmnet" to solve the sparse Poisson regression which returns 1 regularized log-likelihood estimator. For our proposed method, we set λ = c 2 ( We repeat this simulation 100 times and find that there are about 20 times glmnet does not converge and gives warning message or error messages, while our proposed method always converges. If we increase β * , glmnet fails more. Below, we provide a plot showing successful convergence rates of glmnet and our proposed method, L1-penalized weighted score (LPWS) method, when the 5 nonzero coefficients are generated from N (0, 1) × c and c varies in the set of {0.1, 0.2, 0.5, 1.0, 1.5, 2.0, 3.0}. From Figure 1, we see clearly that our proposed method is much more robust in the sense that it always converges.
To validate the solution of our proposed method using "lbfgs" package gives good estimator, we use the above simulation settings and choose the nonzero elements of β * from N (0, 1). For glmnet, we use cross validation to set the tuning parameter and for our proposed method (LPWS), . The solutions and the real coefficients are plotted in Figure  2, from which we see that our new estimator is also a good one. To evaluate the accuracy of our new estimator we do more simulation experiments below.
Finally, we compare different ways of tuning parameter selection for our proposed method. Recall that H = ∇f (β * ) ∞ . We compare THREE different ways of selecting λ. (1) As defined in (3.3), λ = H(1−α). This tuning depends on the real β * , which is unknown when we analyze real data, but we still list it here as a benchmark. (2) As defined in (3.4), we choose λ = ( x i z i with z i , i = 1, 2, . . . , n i.i.d. from N (0, 1), and define λ =H(1 − α). This is an approximation of the exact selection of tuning parameter defined in (3.3). For comparison, we also calculate the solution of glmnet with λ selected using cross validation. In this simulation study, the simulate procedure is almost the same as the previous examples, but here we choose n = 100, p = 1000 and s = 10. We repeat the simulation 100 times. For each time, we choose each non-zero element of β * from N (0, 1) * 0.5 to make sure that glmnet converges. We calculate the 1 estimation error defined as β − β * 1 . The errors are reported in Figure  3, from which we see again that our proposed method outperforms the traditional 1 penalized loglikelihood method for sparse Poisson regression, at the same time, our new method does not need heavy procedure like cross-validation. Hence, our pre-specified tuning parameter works.

Proofs
Proof of Theorem 3.1. Let δ =β − β * . Recall that T = {j : β * j = 0}. By definition ofβ, we have where the last inequality used the choice of λ such that λ > cH = c ∇f (β * ) ∞ . Combining (5.1) and (5.2), we obtain that Defining a new functionf (t) = f (β * + tv) from R to R for any vector v ∈ R p , we compute the second and third order derivatives below It is easy to obtain that |f (t)| ≤ 1 2 sup Setting v = δ =β − β * , we have (5.3) Figure 3: The errors for different methods. "Err glmnet" denotes the errors for estimators with glmnet and tuning parameter is selected via cross-validation. "Err new" is for our proposed method with λ defined as λ = ( √ n) −1 Φ −1 (1 − α 2p ). "Err opt" is for our proposed method with exact selection of λ and "Err approx" is for the new proposed method with an approximate of the exact selection.
(ii). If there exists t n = O( log (p/α) n ) such that P(H > t) < α, then by definition of quantile we , it suffices to prove that It is obvious t n = O( log (p/α) n ). Then we shall show that as n, p → ∞ with n ≤ p ≤ o(e n 1/5 ).
x ijˆ i | > √ na) and P 2 = P(sup i∈ [n] | i | > M ), then the above inequality can be written as (5.14) By inequality (5.13) with M = 3K 1 log p, we obtain that To estimate the P 1 , we need the following Sakhanenko type moderate deviation theorem of Sakhanenko (1991), i.e.
Since E( i ) = E(ˆ i ) + E(ˇ i ) = 0, then it is easy to obtain that Furthermore, with assumption A1 we have