AIC for the Lasso in generalized linear models

: The Lasso is a popular regularization method that can simul- taneously do estimation and model selection. It contains a regularization parameter, and several information criteria have been proposed for selecting its proper value. While any of them would assure consistency in model selection, we have no appropriate rule to choose between the criteria. Meanwhile, a ﬁnite correction to the AIC has been provided in a Gaussian regression setting. The ﬁnite correction is theoretically assured from the viewpoint not of the consistency but of minimizing the prediction error and does not have the above-mentioned diﬃculty. Our aim is to derive such a criterion for the Lasso in generalized linear models. Towards this aim, we derive a criterion from the original deﬁnition of the AIC, that is, an asymptotically unbiased estimator of the Kullback-Leibler divergence. This becomes the ﬁnite correction in the Gaussian regression setting, and so our criterion can be regarded as its generalization. Our criterion can be easily obtained and requires fewer computational tasks than does cross-validation, but simula- tion studies and real data analyses indicate that its performance is almost the same as or superior to that of cross-validation. Moreover, our criterion is extended for a class of other regularization methods.


Introduction
The Lasso (Tibshirani 1996) is a regularization method that imposes an 1 penalty term λ||β|| 1 on an estimating function with respect to an unknown parameter vector β = (β 1 , . . . , β p ) T , where λ (> 0) is the regularization parameter. Ifβ λ = (β λ,1 , . . . ,β λ,p ) T is the estimator of β by the Lasso, several of its components will be shrunk to exactly 0 when λ is not close to 0, which means that the Lasso can simultaneously do estimation and model selection. In addition, the Lasso is computationally feasible in general, and so it is one of key methods in current and future research directions in the area of statistics and machine learning. On the other hand, as mentioned in Meinshausen and Bühlmann (2010), a remaining challenge is to select the proper value for the regularization parameter λ. The estimated parameters continuously shrink toward 0 as λ increases, and so the selection of λ is important for the selection of an appropriate model.
One of the simplest methods for selecting λ is to use cross-validation (CV; Stone 1974). On the other hand, Meinshausen and Bühlmann (2010) proposed a stability selection method based on subsampling in order to avoid problems caused by selecting a model based on only one value of λ. Their method of stability selection avoids the problem they discussed, and it is a new and attractive model selection scheme; however, it requires a considerable number of computational tasks, comparable to the number required for CV.
As analytical methods for selecting λ, in general, there are two approaches. The first obtains a class of λ that has desirable properties for model selection or estimation accuracy, under some regularity conditions. For example, Zhao and Yu (2006) and Meinshausen and Yu (2009) obtained the expression λ = λ n , which depends on at least data size, n; this requires that the model selection be consistent. In addition, for example, Bunea, Tsybakov and Wegkamp (2007), van de Geer (2008), Wainwright (2009), Sun and Zhang (2012), and Chételat, Lederer and Salmon (2014) obtained a more rigorous evaluation that is essentially of the form P(estimation error ≤ δ λ ) ≥ 1 − λ for a class of λ, where δ λ and λ are constants depending on at least λ. The second approach uses an information criterion that takes the form of −2l(β λ ) + η λ , where l(·) is the loglikelihood function and η λ is a penalty term that depends on at least λ; they showed that the model selection based on the λ that minimizes the information criterion is consistent (e.g., Yuan and Lin 2007;Wang, Li and Leng 2009;Zhang, Li and Tsai 2010;Fan and Tang 2013). Both approaches to selecting λ include the results for the case in which the dimension of the parameter vector p goes to infinity, and these are valuable because the Lasso with this value of λ has been shown to have desirable properties. However, the choice of λ remains somewhat arbitrary. When λ n and η λ , as defined above, satisfy the consistency requirement, c × λ n and c × η λ also satisfy it for a fixed coefficient c. For the rigorous evaluation in the first approach, no value of λ minimizes both δ λ and λ , and so we have no appropriate rule for choosing the optimal value of λ. It will be a severe problem because this arbitrariness for the choice of λ leads to the arbitrariness of model selection.
In a Gaussian linear regression setting, an appropriate selection of λ theoretically assured from the viewpoint of classic statistics can be achieved through an information criterion obtained by Efron et al. (2004) and Zou, Hastie and Tibshirani (2007). They derived an unbiased estimator of the true prediction error as a C P -type criterion, through an elegant use of Stein's unbiased estimation theory (Stein 1981). In other words, we can say that they derived a finite correction to Akaike's information criterion (AIC; Akaike 1973) (Sugiura 1978;Hurvich and Tsai 1989) for the Lasso in Gaussian settings with known variance, because in these settings the true prediction error becomes essentially the same as the Kullback-Leibler divergence (Kullback and Leibler 1951). The corrected AIC can be expressed as −2l(β λ ) + 2|{j :β λ,j = 0}|, and so it is easy to use for model selection. Our aim in this paper is to derive such an information criterion for the Lasso in more general settings that is assured theoretically and can be computed without heavy computational tasks. Towards this aim, we obtain an asymptotically unbiased estimator of the Kullback-Leibler divergence, that is, we obtain the AIC for the Lasso, based on its original definition under the framework of generalized linear models (see McCullagh and Nelder 1983).
The remainder of this paper is organized as follows. After introducing in Section 2 generalized linear models and the assumptions for our asymptotic theory, some limiting distributions for the Lasso estimator are given in Section 3. The purpose of our asymptotic theory is to approximate some statistics for the given finite samples, and the limiting distributions obtained based on the asymptotic theory are different from those in Knight and Fu (2000). In Section 4, we use the limiting distributions to achieve our main goal which is the derivation of the AIC for the Lasso, and in Sections 5 and 6, its validity is demonstrated for several models thorough the performance of simulations and real data analyses. In Section 7, the AIC for the Lasso is extended for several cases including the cases of using other penalty terms, and a discussion is provided in Section 8. The program code used in Sections 5 and 6 is available from https://sites. google.com/site/shuichikawanoen/research/aic.r.

Setting and assumptions
Let us consider a natural exponential family with a natural parameter θ (∈ Θ ⊂ R r ) for an r-dimensional random variable y, whose density is with respect to a σ-finite measure μ. We assume that θ in Θ satisfies 0 < exp{y T θ + b(y)}dμ(y) < ∞, that is, Θ is the natural parameter space. Then all the derivatives of a(θ) and all the moments of y exist in the interior Θ int of Θ, and, in particular, E θ (y) = a (θ) and V θ (y) = a (θ). For a function c(η), we will denote ∂c(η)/∂η and ∂ 2 c(η)/∂η∂η T by c (η) and c (η), respectively. We assume that V θ (y) = a (θ) is positive definite and so − log f (y; θ) is a convex function with respect to θ.
Let (y i , X i ) be the i-th set of responses and regressors (1 ≤ i ≤ n); we assume that the y i are independent r-dimensional random vectors and X i are (r × p)-matrices of known constants. We will consider generalized linear models with natural link functions for such data, that is, we consider a class of density functions {f (y; X i β) : β ∈ B} for y i , where β = (β 1 , . . . , β p ) T is a coefficient vector and B is an open convex set. We denote the true value of β by β * .
Before developing the asymptotic theory for this model, let us explain about two types of aims for using asymptotic theory. The first aim is to approximate something well for the case where the data size goes to infinity in the future. The second aim is to approximate something well for the given real data whose size is large but of course finite. Assumptions which suit the first one sometimes do not suit the second one. For example, for a regression model with regressors {x i }, let us consider the case where the limiting frequency distribution of {x i : 1 ≤ i < ∞}, p ∞ (·), is quite different from the frequency distribution of the given real data {x i : 1 ≤ i ≤ n 0 }, p n0 (·). In this situation, the asymptotic variance of regression coefficients' estimator should be evaluated based on p ∞ (·) if we want to evaluate the variance in the future. If we want to evaluate the variance for the given real data, however, we should evaluate it based on not p ∞ (·) but p n0 (·). That is, as a limiting frequency distribution of {x i }, although to assume p ∞ (·) will suit the first aim, it is better for the second aim to assume p n0 (·). Roughly speaking, our asymptotic theory is to approximate some statistics for the given real data {(y i , X i ) : 1 ≤ i ≤ n 0 }. In light of this, we assume (C1) {X i } lies in a compact set X with Xβ ∈ Θ int for all X ∈ X and β ∈ B, for each β, and the limit of Let g yi,Xi (β) be the log-likelihood for y i , i.e., g yi,Xi (β) = log f (y i ; X i β). Under the above-mentioned model with the conditions (C1) and (C2), we obtain several expressions that will be used for our asymptotic theory, as follows: (R1) There exists a convex and differentiable function h(β) such that The expression (R3) is a direct consequence of (C2) because −g yi,Xi (β) = X T i a (X i β)X i . The other expressions will be proved in the appendix.

Limiting distribution
Let || · || 1 be the 1 norm, i.e., ||β|| 1 = p j=1 |β j |. For the above-mentioned model, the Lasso estimator of β * iŝ where λ is a regularization parameter. Here we put n in the penalty term for our asymptotic theory; this corresponds to the setting in Theorem 1 of Knight and Fu (2000) which provided the limiting value but did not discuss the limiting distribution. If we assume that the penalty term is o(n),β λ converges in probability to β * . We think, however, this closeness betweenβ λ and β * does not reflect the characteristic ofβ λ for the given real data, because it is moved to 0 from β * . As mentioned in the previous section, the purpose of our asymptotic theory is to approximate some statistics for the given real data. The penalty term is assumed to be O(n), because in this case,β λ converges to a vector made by moving β * close to 0 (this will be shown below). Actually, for more tractable regularization methods, an information criterion is already derived by assuming O(n) for the penalty term (see, e.g., Konishi and Kitagawa 2008).
To consider the limiting value ofβ λ , we define a random function, as follows: The function u n (β) is convex with respect to β, and argmin β∈B u n (β) is equal toβ λ . In Knight and Fu (2000), the same type of random function was defined for the Gaussian case, but their function did not sum over the g yi,Xi (β * ). We did this, however, so we would not need to be concerned with b(y) in g yi,Xi (β). From (R1), we see that u n (β) converges in probability to h(β) + λ||β|| 1 for each β. Because u n (β) is a convex function with respect to β, similarly to in Knight and Fu (2000), we can apply the convexity lemma from Andersen and Gill (1982) or Pollard (1991).
is convex. Similarly to in Knight and Fu (2000), we can apply the convexity lemma from Hjort and Pollard (1993) or Geyer (1996), and then we have under the conditions (C1), (C2), and (C3).
A small generalization of the above-mentioned convexity lemma is required to prove (5), and so the proof will be given in the appendix.

Bias evaluation
Model selection can be approached by trying to reduce twice the Kullback-Leibler divergence (Kullback and Leibler 1951) between the true distribution and the estimated distribution, where (ỹ 1 , . . . ,ỹ n ) is a copy of (y 1 , . . . , y n ), in other words, (ỹ 1 , . . . ,ỹ n ) is distributed according to the distribution of (y 1 , . . . , y n ) and is independent of (y 1 , . . . , y n ), andẼ denotes the expectation with respect to only (ỹ 1 , . . . ,ỹ n ). Because the first term on the right-hand side does not depend on the model selection, we need to consider only the second term. A simple estimator of the second term is −2 n i=1 g yi,Xi (β λ ), but this underestimates it. We then consider minimizing the bias correction, in AIC-type information criteria (see, e.g., Chapter 3 in Konishi and Kitagawa 2008). Because the second term depends on the true distribution, it cannot be given explicitly. In a Gaussian linear regression setting with a known common variance, that is, when g yi,Xi (β) can be written as −(y i −X i β) T (y i −X i β)/2− (r/2) log(2π) by standardizing the data, it can be shown by an elegant use of Stein's unbiased estimation theory (Stein 1981) that the number of nonzero coefficients inβ λ , |{j :β λ,j = 0}| is an unbiased estimator of the second term (Efron et al. 2004;Zou, Hastie and Tibshirani 2007). This means that can be regarded as the AICc, a finite correction of the AIC (Sugiura 1978;Hurvich and Tsai 1989). However, this criterion cannot be extended for the general case, and so we evaluate the second term asymptotically in the same way as was done for the AIC. That is, considering that E[ we use in place of (6), where z limit is the limit to which (8) converges in distribution; we say that E(z limit ) is an asymptotic bias. Now we evaluate (8). Using a Taylor expansion around β * * , the first term can be expressed as where β † is a vector on the segment fromβ λ to β * * . Using (4) in Theorem 1, the terms includingβ yi,Xi (β * * )/ √ n − √ nλ × sgn(β * * (2) ) converges in distribution to s (2) . In addition, we can easily check from (R3) and Lemma 3 that n i=1 g (22) yi,Xi (β † )/n = −J (22) (β * * ) + o P (1). Thus, by using also (5) in Theorem 1, we have Using the same type of Taylor expansion, the second term on the right-hand side of (8) can be expressed as where β ‡ is a vector on the segment fromβ λ to β * * . Similarly to in the above analysis, we see that the terms includingβ yi,Xi (β ‡ )/n converges to −J (22) (β * * ). In addition, there exists a |J (2) |dimensional random vectors (2) having a N(0, J (22) Thus, it follows from (10) and (11) that Because s (2) ands (2) are independently distributed according to N(0, J (22) (β * )), we obtain the following theorem.
Theorem 2. The asymptotic bias in (9) is given by

under the conditions (C1), (C2), and (C3).
We cannot know the values of β * or β * * , and so we replace tr{J (22) (β * ) J (22) (β * * ) −1 } by its consistent estimator. LetĴ (2) = {j :β λ,j = 0} for β λ = (β λ,1 , . . . , β λ,p ) T , which is called an active set, and let J n (β) = n i=1 (2) and (J n (β λ ) jk ) j∈Ĵ (2) ,k∈Ĵ (2) , respectively, we have See the appendix for the proof. Thus, we propose the following index as an AIC for the Lasso: Here we useβ 0 as a consistent estimator of β * as done in the adaptive Lasso (Zou 2006). Whenβ 0 is expected to be unstable, for example, when p is large, we propose the use of a more stable but consistent estimator in place ofβ 0 . We thus have only to select the λ that minimizes this AIC Lasso where I r is the r × r identity matrix. That is, J n (β) does not depend on β and soĴ * (22) n =Ĵ * * (22) n , which means that (13) reduces to (7). Hence, the AIC in (13) can be regarded as a generalization of the AICc for the Gaussian linear regression when the variance is known.

Simulation study
In this section, to check the performance of the AIC in (13), we perform some simulation studies using logistic regression, Poisson regression, and Gaussian graphical models, and we compare it with CV and the criterion with the penalty term derived by Efron et al. (2004) and Zou, Hastie and Tibshirani (2007). This last criterion can be written as It is not theoretically assured, that is, it is not a finite correction of the AIC for the cases of the above models, but for simplicity, we will call it the AICc.
We assessed the performance in terms of the second term of the Kullback-Leibler divergence: whereλ is the value of λ selected by each criterion, (ỹ 1 , . . . ,ỹ n ) is a copy of (y 1 , . . . , y n ), andẼ denotes the expectation with respect to only (ỹ 1 , . . . ,ỹ n ). We evaluated the expectation using test datasets of size 1000. As a secondary index for the assessment, we also determined the rates of false positives and false negatives: FP = |{j :β j = 0 ∧ β * j = 0}|/|{j : β * j = 0}| and FN = |{j :β j = 0 ∧ β * j = 0}|/|{j : β * j = 0}|, for each of the three criteria. When a criterion has a larger FP but a smaller FN, compared to the other criteria, no conclusion can be made from this secondary index.

Logistic and Poisson regression models
As simple examples of the generalized linear model, here we consider a logistic regression model and a Poisson regression model where X i is a (1 × p)-matrix of known constants, and β is a p-dimensional coefficient vector. For these models, J n (β) can be written as respectively, and so we can easily obtain the AIC in (13). The simulation settings were as follows. For the p-dimensional regressors, we used vectors obtained from the multivariate Gaussian distribution N p (0, Σ), where the (i, j)-element of Σ was set to 0.5 |i−j| . Here we do not regard the regressors as random vectors. The true coefficient vector β * was and seven cases were considered for the pairs of p and k, as follows: (p, k) = (8, 1), (8, 2), (8, 3), (16, 2), (32, 2), (500, 5), (1000, 5). We generated a dataset of size n = 100 or n = 200, and used the Lasso to estimate the coefficient vector (we used the package glmnet in R). One hundred simulations were conducted. Tables 1 and 2 show the averages and standard deviations of the KL, and the averages of the FP and FN for the logistic and Poisson regression models, respectively. In all cases, the average of the KL for the AIC is almost equal to or smaller than those for CV and the AICc. In the logistic regression, the average KL for CV tends to be clearly larger than that for the AIC when n = 100 and p is large, and the average KL for the AICc is sometimes considerably larger than those for the AIC and CV especially when p is large. In the Poisson regression, because the average KL is almost the same for all criteria, we check the secondary index. Then, we see that the sum of FP and FN values for CV is sometimes clearly larger than those for the AIC and AICc especially when p is small. Thus, we can say that the AICc and CV perform poorly in comparison with the other criteria, at least in the case of the logistic and Poisson regressions, respectively. We can thus conclude that, overall in these simple examples, the AIC is superior to CV and the AICc. (14) for the logistic regression models.   (13)

Gaussian graphical model
Suppose that a q-dimensional random vector z i is distributed according to a multivariate Gaussian distribution N q (μ, Σ). Without loss of generality, we can assume that the mean vector is the zero vector. The graphical Lasso estimates the covariance matrix Σ, under the assumption that the precision matrix C = Σ −1 is sparse (Yuan and Lin 2007;Friedman, Hastie and Tibshirani 2008). Let us denote the p-dimensional vectors vech(z i z T i ) and −vech{C − diag(C)/2} by y i and β, respectively, where p = q(q + 1)/2 and vech(·) is the half-vectorization of a symmetric matrix. Then, the log-likelihood of y i is Therefore, for this model, the elements of J n (β) can be written as and so we can easily obtain the AIC in (13). The simulation settings were those used by Yuan and Lin (2007). The models for the precision matrix are as follows: We considered four cases for the pair of n and q, as follows: (n, q) = (25, 5), (50, 5), (50, 10), (100, 10). The parameter vector β was estimated by the graphical Lasso using the package glasso in R. The simulations were conducted 100 times. Table 3 shows the averages and standard deviations of the KL, along with the averages of the FP and FN. Also in this model, the average of the KL for the AIC is almost equal to or smaller than those for CV and the AICc, but the differences are small. For the cases of (n, q) = (25, 5), (50, 10), the difference between the KL averages for the AIC and CV becomes slightly large.

Real data analyses
We investigated the effectiveness of our criterion through real data analyses. We used eight benchmark datasets, which are depicted in Table 4. The "pima" and "biodegradation" datasets were available from the UCI database (http:// archive.ics.uci.edu/ml/index.html). The "colon", "leukemia", "takeover.  (13)  bids.case", "doctor.visits", and "mathmark" datasets were, respectively, obtained from the packages HiDimDA, plsgenomics, CountsEPPM, AER, and gRbase in R. The "flow.cytometry" dataset was obtained from Sachs et al. (2005). Logistic regression models were applied into the "pima", "biodegradation", "colon", and "leukemia" datasets, Poisson regression models were applied into the "takeover.bids.case" and "doctor.visits" datasets, and Gaussian graphical models were applied into the "flow.cytometry" and "mathmark" datasets. The covariates were standardized for each dataset. We randomly divided the observed data into training samples for constructing the models and test samples for computing the KL. The numbers of training samples are shown in Table 4, while the remaining observations were regarded as test samples. Note that we randomly selected 1000 observations in advance if the number of the observed data was larger than 1000. We repeated these procedures 10 times. Table 5 shows the results. In almost all cases, the AIC is superior to the other criteria. In particular, the results for the "colon" and "leukemia" datasets suggests that the AIC is sometimes clearly superior to CV.

Extensions
The AIC in (13) can be extended for more general cases. In this section, we will indicate the broad possibilities of this by providing an actual AIC for some particular cases.

Model with a nuisance parameter
For the Gaussian linear regression model in which each random vector z i is independently distributed according to the r-dimensional Gaussian distribution N r (X i γ, σ 2 I r ) with unknown variance parameter σ 2 , the estimation is usually performed without any penalization, even if γ is estimated by the Lasso. We will begin by considering such an example, that is, a case in which there are several parameters with no penalty terms. For simplicity, we will denote all parameters by β as before, and let J be an index set of β j , which is estimated without penalization. In this setting, the estimator of β can be written asβ λ ≡ argmin β∈B {− n i=1 g yi,Xi (β) + nλ j / ∈J |β j |}. Let us define β * * by argmin β∈B {h(β) + λ j / ∈J |β j |}, J (1) by {j : β * * j = 0} ∩ J and J (2) by {j : β * * j = 0} ∪ J . As a result, Lemma 3, Theorem 1, and Theorem 2 hold. Their derivations are a little more complicated than those in Sections 3 and 4, because (2) and (3) hold only when j ∈ J and ∂h(β)/∂β j | β=β * * = 0 always holds when j ∈ J . Noting thatĴ (2) includes J with probability one, the AIC in this case is also given by (13).
For the above-mentioned Gaussian linear regression model with unknown variance, if γ and σ 2 are estimated by the Lasso and without penalization, respectively, one might think that the penalty term in the AIC will be The remaining evaluation of the bias is the same as in Section 7.2, and the AIC in this case can be given by (20).

Discussion
By generalizing and modifying the original definition of the AIC, various criteria have been proposed, e.g., the GIC (Konishi and Kitagawa 1996), the DIC (Spiegelhalter et al. 2002), the FIC (Claeskens and Hjort 2003), and the GAIC (Lv and Liu 2014). For the Lasso, a popular regularization method, there was not even a naive AIC, except for in the case of Gaussian linear regression. In this study, we used the original definition of the AIC to obtain an AIC for the Lasso for a generalized linear model. For several settings, BICs for the Lasso have been proposed, such as those by Yuan and Lin (2007) and Wang, Li and Leng (2009). But such BICs have not been derived from Bayes factors, and so the AIC in (13) can be regarded as the only criterion for the Lasso that has the same roots as those of the classic information criteria.
The penalty term in (13) is written by using an information matrix with respect to the active set, and simulation studies indicated that its value is close to twice the number of members in the active set. We can interpret this to mean that the active set contributes to the penalty by approximately the usual amount, and the other parameters do not. While the active set consists of parameters in the model that are selected by the Lasso, it is adaptively selected from the full set, and so the above interpretation is not necessarily obvious. As was remarked for the Gaussian case in Lockhart et al. (2014), the adaptive selection costs extra bias, and shrinking the nonzero coefficients decreases the bias by approximately the same amount. This is an interesting phenomenon, but it is not clear why they should be almost the same amount.
The selection of the regularization parameter for the Lasso by the AIC in (13) requires few computations, compared to those required by CV. Nevertheless, its performance is almost the same as or better than that of CV. It is particularly worth noting that the AIC is not inferior to CV even if the dimension of the coefficient vector p is particularly large, although the AIC is based on the asymptotic theory with a fixed p. It would be interesting to determine why the AIC works well for the case of large p, and the theoretical clarification of this will be an area for our future work.
As mentioned above, previously there was not even a naive AIC for the Lasso. Thus, in this paper, we have considered only the most basic setting for our theorems, and they have been extended for a few settings. Using these extensions, we will be able to obtain, for example, the AIC for the generalized Lasso (Tibshirani and Taylor 2011) and the AIC for more general penalty terms. To check their performance in specific problems will be an important area for our future work. Beyond the selection of the regularization parameter for the Lasso, "inference after selection" problem is a current challenging topic and recently several methods are proposed (e.g., Lee et al. 2013 andJavanmard andMontanari 2014). To check the compatibility between the AIC in (13) and such methods will be also an important area for our future work.