Selection and Fusion of Categorical Predictors with L 0 -Type Penalties

In regression modelling, categorical covariates have to be coded. Depending on the number of categorical covariates and on the number of levels they have, the number of coefficients can become huge. To reduce the model complexity, coefficients of similar categories should be fused and coefficients of non-influential categories should be set to zero. To this end, Lasso-type penalties on the differences of coefficients are a standard approach. However, the clustering/selection performance of this approach is sometimes poor–especially when the adaptive weights are badly conditioned or not existing. In some situations, there is no incentive to cluster similar categories. To overcome this, a L 0 penalty on the differences of coefficients is proposed, whereby the L 0 ‘norm’ is defined as the number of non-zero entries in a vector. The proposed penalty favours to find clusters of categories that share the same effect on the response variable while the estimation accuracy is comparable to Lasso-type penalties. Numerical experiments within the framework of generalized linear models are promising. For illustration, data on the unemployment rates in Germany is analyzed.


Introduction
In the majority of regression problems, at least some of the available covariates are categorical. A categorical covariate has to be coded. Depending on the number of categorical covariates and on the number of levels they have, the number of coefficients can become huge. Hence, the accuracy of estimates can be poor. Moreover, when including categorical variables, users want to know if and how these predictors determine the response, and, in particular, which categories have to be distinguished. Typically, there are subsets of categories that have the same effect on the response variable. Recently, various approaches to obtain selection and fusion of categories by regularized estimation have been proposed: Bondell and Reich (2009) propose to apply the fused Lasso (Tibshirani et al., 2005) to the coefficients of a nominal predictor; all pairwise differences of coefficients are penalized. For ordered factors, it is more appropriate to penalize differences of adjacent coefficients, see Gertheiss and Tutz (2010) and Tutz and Gertheiss (2014). However, Lasso-type penalties come with shrinkage effects that depend on the coefficients' absolute values (Fan and Li, 2001). As a consequence, there are often strong shrinkage effects for large (differences of) coefficients while small (differences of) coefficients are estimated to be non-zero. When the focus is on the fusion and the selection of categories, one wants to avoid such effects. To enhance Lasso-type penalties, Zou (2006) proposes adaptive weights; each penalty term is weighted by its inverse maximum likelihood (ML) estimate. It yields asymptotically normal and consistent model selection. However, the quality of the adaptive weights depends on the quality of the ML estimate that can be poor.
As an alternative, we propose L 0 penalization for categorical effects; where the L 0 'norm' is defined as the number of non-zero entries in a vector. As Bondell and Reich (2009) or Gertheiss and Tutz (2010), we consider differences of coefficients; but instead of the absolute value, the L 0 norm is applied to the differences of coefficients. The difference between unordered and ordered factors is taken into account by using all pairwise differences or only adjacent differences. Computational issues are met by local quadratic approximations. The optimization problem is related to model selection with information criteria like the Akaike information criterion (AIC; see for example, Bozdogan, 1987) or the Bayesian information criterion (BIC; Schwarz, 1978). As the proposed penalty allows for the fusion of categories, it extends this approach. L 0 penalization is an established approach in some fields of statistics: it is applied to wavelets (Antoniadis and Fan, 2001) and to signals (see Zhang, 2010, Rippe et al., 2012).
The article is organized as follows: Section 2 motivates L 0 penalization for categorical effects in generalized linear models. In Section 3, we introduce the method; computational issues, the relation to best subset selection and some generalizations are discussed. Section 4 investigates the numerical properties of the proposed method. In Section 5, the unemployment rates in Germany between 2005 and 2010 are analyzed. We investigate which state-specific intercepts are clustered in a model with a global temporal trend.

Framework and L 1 -Type Fusion Penalties
In what follows, we assume a generalized linear model (GLM) with response y i for observation i, i = 1, . . . , n. As a start, we consider only one categorical covariate x with levels 0, . . . , k. Let the rows of the design matrix X be given by x T i = (1, x i1 , . . . , x ik ) with x ir = 1 if x i takes the value r and x ir = 0 otherwise, r = 1, . . . , k. This representation refers to dummy coding with category 0 as reference category,ˇ0 = 0. The corresponding predictor is defined as Á i = x T iˇ, whereˇ= (ˇi nt ,ˇ1, . . . ,ˇk) T is the coefficient vector andˇi nt denotes the intercept. For the response y i |x i , a simple exponential family with log-likelihood l n (ˇ) is assumed: where ϑ i ( i ) denotes the natural parameter, b(·) is a specific function corresponding to the type of the exponential family, c(·) is the log-normalization constant and ϕ the dispersion parameter (compare Fahrmeir and Tutz, 2001). The observations y i are assumed to be conditionally independent. Response and predictor are linked by the response function h(Á i ) which is twice continuously differentiable with det(∂h/∂Á i ) / = 0 ∀i. That is, we assume: For more details on GLMs, see, for example, Fahrmeir and Tutz (2001). Estimatesâ re obtained by minimizing the negative log-likelihood l n (ˇ). Accounting for a penalty, the objective function is defined as: where P(ˇ) denotes the penalty and where is a tuning parameter. The larger is, the stronger is the impact of the penalty. For = 0, the ML estimate is obtained. The choice of the penalty P(ˇ) is crucial. The Lasso (Tibshirani, 1996) penalizes the absolute values of coefficients and enforces variable selection. One obtains sparse but shrunken estimates. For dummy-coded categorical predictors, this is not the best choice; setting parameters to zero corresponds to the fusion with the reference category which can be chosen arbitrarily. Even though this problem can be handled by coding the categorical covariates differently-for example, as the deviation from a mean level ('effect coding') or as the deviation from adjacent categories ('split coding')-penalties that contain differences of parameters as proposed by (Tibshirani et al. (2005), Bondell and Reich (2009) or Gertheiss and Tutz (2010), are a common choice. They encourage the fusion of coefficients and thus, of categories irrespectively of the coding, and they allow one to fuse coefficients subject to more than k constraints. However, fusion-type penalties come along with some problems: For an ordered categorical predictor, fusion-type penalties consider the differences of parameters that refer to adjacent categories, including the reference category 0. The corresponding Lasso-type penalty has the form: However, the penalty does not always enforce fusion efficiently. If coefficients are ordered, for example in the form 0 =ˇ0 ≤ˇ1 ≤ . . . ≤ˇk, and if one is close to the true values, that is, in the range where the estimated parameters are ordered, the effective penalty is P(ˇ) = k r=1 |ˇr −ˇr −1 | = |ˇk −ˇ0| = |ˇk|. That means, the approach basically penalizes the range of the coefficients. The problem is even more obvious in an orthonormal linear model with one ordered predictor and without an intercept-that is, X T X is the identity matrix I (k+1)×(k+1) . Situations like this, are for example, typical for models with categorical effect modifiers or models with group specific intercepts. In this cases, one can derive an explicit solution of the objective function (2.2) with penalty (2.3): Proposition 1. Assume a penalized linear model with orthonormal design; that is and l as before.
The proof of Proposition 1 is given in Supplement A. The structure of the explicit estimate reveals that the coefficients of the outer categories are always merged first. There is no shrinkage for coefficients that are not yet fused with one of the outer categories-no matter how close the corresponding ML estimates are. For the minimal penalty parameter that causes the fusion of all coefficients, the estimate of all coefficients is equal to¯M L . For a fixed value of , the mean of the penalized estimate equals¯M L in the assumed setting. Similar results in the context of signal processing can be found in Pollak et al. (2005).
The left panel of Figure 1 shows the (exact) coefficient path of an exemplary model with k = 7. One can see that a coefficientˇr is not fused with any otherˇs, r / = s, unlessˇr is fused with one of the outer coefficientsˇ0,ˇk. The right panel of Figure 1 shows the same situation but the coefficient path is obtained with an L 0 norm instead of the L 1 norm in penalty (2.3). Categories with similar effects are fused-no matter which position they take in the order of ML estimates. This is the main motivation to consider L 0 penalties as an alternative when investigating categorical predictors. For a nominal categorical predictor, L 1 fusion-type penalties consider all pairwise differences of coefficients: (2.4) Assume a fixed value of the tuning parameter and letˇ( 1) , . . . ,ˇ( k) denote the (arbitrary) ordering of the solution. Then, a short transformation (see Supplement A) shows that r>s |ˇ( r) −ˇ( s) | = k r=1 w (r) |ˇ( r) −ˇ( r−1) |, where w (r) = r(k − r + 1). For the 'outer' differences r ∈ {1, k}, w (r) = k; for medium values of r, the weights w (r) are higher. That is, penalty (2.4) can be represented as a weighted version of penalty (2.3). The issues for nominal predictors are essentially the same as for ordered predictors. Similar to Proposition 1, one can show that the slopes of the coefficient path depend on the order of the corresponding ML estimate-even though not only the 'outer' coefficients are fused.
Efficiency of L 1 -penalized estimates can be improved by using adaptive weights (Zou, 2006) that weigh each penalty term by its inverse ML estimate. This results in heavy weights on penalty terms with small ML estimates and in small weights on penalty terms with large ML estimates. When the adaptive weights of Zou (2006) are combined with fusion-type penalties, there is an incentive to fuse categories that have close ML estimates and one obtains asymptotically normal and consistent results (see for example, Gertheiss and Tutz, 2010;Oelker et al., 2014). However, adaptive weighting requires ML estimates; its quality depends on the quality of the ML estimates.

L 0 -Type Fusion Penalties
In what follows, the fusion of categories is enforced by penalizing differences of coefficients as in the approaches discussed previously, but to overcome the drawbacks of Lasso-type penalties, the L 0 norm is employed. For an ordered predictor, we propose where · 0 = |·| 0 and where we define 0 0 = 0. In contrast to Lasso-type penalties, it does not matter whether a difference is small or huge; the penalty is reduced only if one of the differences equals zero. As a consequence, when two different values of , for example 1 > 2 , yield solutions with the same set of zero and non-zero differences, P ord (ˆ 1 ) = P ord (ˆ 2 ) holds. The set of zero/non-zero differences changes for specific thresholds. When the predictor x is nominal, an appropriate coefficient profile does not only relate to the coefficients of adjacent categories but to the comparison of all coefficients. The penalty considers all pairwise differences of coefficients, Penalty (3.2) is more complex; with k levels, there are k(k + 1)/2 pairwise differences; but apart from that, the effect of the penalty is the same as for ordered predictors. Note that for large values of the tuning parameter , the coefficientsˇ1, . . . ,ˇk are set to zero as the differences in penalties (3.1) and (3.2) include the difference to the reference categoryˇ0 = 0. As it will be seen in Section 5, it can be useful to weight the penalty terms. To this end, we introduce general weights w r , w r,s respectively, and use the modified penalties: To enhance the performance, we will, for example, combine the L 0 approach with adaptive weights as employed for the L 1 penalties in Section 4.1. When analyzing the unemployment rates in Germany in Section 5, the weights allow one to account for the spatial structure of the federal states. Therefore, we define the weights w r,s as an indicator for a common border of two states -such that we obtain a coefficient profile that is consistent with geography.

Computational Issues
In the literature, there is a wide range of strategies to handle optimization problems that contain L 0 norms. In order to represent for example, signals sparsely, the objective minˇ ˇ 0 subject to y = Xˇhas to be optimized. With some assumptions on X and assuming that there is a sufficiently sparse representation of y, Donoho and Elad (2003) find this representation by solving a convex optimization problem instead. Wipf and Rao (2005) derive a method based on sparse Bayesian learning including local optimality conditions to solve the same problem. In the framework of wavelets, the L 0 norm acts as penalty. The objective minˇf (ˇ) + ˇ 0 is, for example, solved by penalty decomposition methods that are based on rank optimization procedures (see Zhang, 2010, 2013). Rippe et al. (2012) and Johnson (2013) ˇi −ˇi −1 0 to smooth segmented observations y 1 , . . . , y n . While Johnson (2013) proposes a dynamic programming algorithm, Rippe et al. (2012) solve the problem iteratively employing a weighted Ridge penalty.
In order to minimize the objective function (2.2), we propose to approximate the L 0 norm by a modified logistic function and to derive a quasi Newton method for the approximated objective function. In detail, the L 0 norm is approximated by: where is a relatively large scalar and where c is a small, positive constant. Figure 2 gives some illustration: the circles denote the L 0 norm for a scalar argument . The continuous line denotes the approximation of the L 0 norm. For → ∞ and c → 0, the approximation approaches the L 0 norm.  To obtain a penalized iteratively re-weighted least squares (PIRLS) algorithm, in addition to approximation (3.4), we employ a local quadratic approximation ifˆ( k+1) is close toˆ( k) as proposed byFan and Li (2001) and as described by Oelker and Tutz (2013). It allows for derivatives of the approximated objective that fit exactly in the framework of known PIRLS algorithms. Sketching the idea shortly, penalties (3.1) and (3.2) are rewritten as P(ˇ) = L l=1 a T lˇ 0 , where vectors a l build all needed differences of coefficients. For a nominal predictor x with five levels including the reference category, where all pairwise differences are considered, one obtains, for example, The 'diagonal part' of A nom gives the differences with the reference categoryˇ0. Finally, starting with an initial valueˆ( 0) , one obtains the following iterative update for the current estimateˆ( k) : The parameter is a step length parameter that usually equals 1; if < 1, it allows one to control the algorithm's convergence and to stabilize the algorithm in the case of marginal solutions. Approximating the L 0 norm, smaller values of seem to be a good choice. Matrix A basically contains the derivatives of the approximated L 0 norm: The algorithm is terminated when |ˆ( k+1) −ˆ( k) |/|ˆ( k) | ≤ , for a fixed > 0. Concerning the tuning, the constant c > 0 guarantees differentiability, determines the steepness of the logistic function. Both parameters have to be determined subject to the scale of the (coded) covariate x. However, in penalized regression models, the covariates are usually scaled and/or standardized (E(x) = 0 and V(x) = 1). In such settings, c = 10 −5 works quite well in our experience. When is sufficiently large, the coefficients paths look like step functions; the steps occur when the coefficients are merged and as the shift of the estimates is relatively large compared to the change of .
As long as the approximation is close enough to the L 0 norm, the concrete choice of has no major impact on the result's quality. However, for different tuning parameters, the scale of changes; if is too large, there may be convergence problems.
The structure of the objective function is not trivial. As the penalty is not convex, there is no guarantee that the proposed algorithm finds the global minimum of the objective function. However, the results are very plausible. Given that the tuning parameter is in a realistic range, the results for different initial values do not differ essentially in a majority of cases. We recommendˆ( 0) = 0 T or to combine the default approach of the R function glm() ( (0) = y for the loss function) with the initial valueˆ( 0) = 1 T for the approximation of the penalty (referred to as 'default set of initial values'). Furthermore, the results for different initial values should be checked. Comparisons with a simulated annealing algorithm (Xiang et al., 2013) that is appropriate for complex optimization problems, show that the deviations of the PIRLS algorithm from the simulated annealing are small for the relevant range of . The L 0 approach of Rippe et al. (2012) for signals works with a different approximation but also with a PIRLS algorithm and obtains similar results. Fan and Li (2001) propose to approximate the SCAD penalty that has a similar curvature, by a PIRLS algorithm; comparisons with the exact estimate in an orthonormal setting approves this procedure. Oelker and Tutz (2013) give detailed information on the local quadratic approximation. The algorithm is implemented in the R package gvcm.cat (Oelker, 2013;R Core Team, 2013).

The General Case with Multiple Predictors
So far, we assumed that there is only one predictive factor x. Of course, this is not the standard case and, in what follows, we assume that there are p nominal and/or ordered predictive factors x j with k j + 1 levels each. The design matrix is still denoted by X , but now X ∈ R n×q , where q = 1 + p j=1 k j ; X contains p dummy coded predictors and an intercept. The according penalty is defined as: where J j equals penalty (3.1) for ordered factors and penalty (3.2) for nominal factors. The parameterˇj = (ˇj 1 , . . . ,ˇj k j ) T denotes the vector of coefficients linked to the j-th covariate. The computational issues are not affected by this generalization.
However, the tuning should be adjusted as the penalty terms of several factors with different numbers of levels and measured on different scales (nominal/ordered) should be comparable. Bondell and Reich (2009) argue for example, that there is a bijective relation between the standardization of the data and weighting the penalty terms if one penalty term relates to one covariate and if the penalty is a norm; for example, in case of the Lasso for p continuous covariates. Bondell and Reich (2009)  is appropriate; see, Gertheiss and Tutz (2010). The weights consider the number of observations per level and the number of differences in the penalty. They can be combined with the L 0 penalty easily. Depending on the concrete content, different weighting schemes can be reasonable; alternatively, one could for example, think of J j (ˇj) = const. ∀j.

L 0 Penalization and Information Criteria
There is a relation between L 0 penalization and model selection by information criteria like the AIC or the BIC: one minimizes where AIC = 1 for the AIC and BIC = log(n)/2 for the BIC. The degrees of freedom df(model) are the number of influential parameters in the model and therefore equal p j=0 ˇr 0 . Let us first consider a model with binary predictors only. Then the proposed L 0 penalty has the form P bin (ˇ) = p j=1 ˇj 0 . Hence, it holds that P bin (ˇ) + 1 = df(model).
That is, when the tuning parameter of the proposed L 0 approach is fixed to the values AIC or BIC , the objectives of the L 0 approach and of model selection based on information criteria AIC/BIC coincide. However, the computational approach differs: for model selection based on information criteria, unconstrained models with all possible subsets of coefficients are compared by criterion (3.5). The L 0 approach optimizes the approximated, constrained objective and does it without subsets.
Selection problems are much more complex if one has p categorical covariates with k j + 1 levels each, because then, in addition to simple selection of relevant variables, one also wants to investigate which categories of the categorical predictor have to be distinguished. In best subset selection with categorical predictors, all models that can be built by the fusion of categories must be considered as candidate models. It is well known from cluster theory that the number of such candidate models increases strongly with the number of categories per predictor (Jain and Dubes, 1988 In this general case, the degrees of freedom are defined as the number of coefficients in a model that are unlike and unequal to zero. They are given by df(model) = 1 + p j=1 k j r=1 ˇj r 0 s<r ˇj r −ˇj s 0 -which is unequal to the proposed penalties. However, the proposed penalties can be applied to the same situations as best subset selection for categorical covariates. In contrast to best subset selection, the proposed penalties do not need candidate models and-as we will see later on-they are nevertheless feasible for more complex models. As the tuning parameter can be varied, information on the order of fusions of coefficients is obtained. Hence, the proposed penalty is not only an attractive alternative to Lassotype penalties, but as well an alternative to model selection based on information criteria.

Illustration and Numerical Experiments
In this section, we investigate some aspects of the proposed approach by numerical experiments. We start with an illustrative example followed by some experiments on the estimation accuracy and on the clustering/selection performance when the tuning parameter is chosen according to cross-validation criteria.

Illustrative Example
We consider a linear model with the two ordered predictors x 1 and x 2 . Both predictors have four levels and are drawn from a multinomial distribution with equal probabilities for each level. The response is Gaussian;ˇt rue = (ˇi nt ,ˇ1 2 ,ˇ1 3 , 14 ,ˇ2 2 ,ˇ2 3 ,ˇ2 4 ) T = (2, 0.8, 0.6, 0.4, −0.6, −0.6, −0.4) T and Var(y i |X ) = 1 ∀i. All levels of x 1 have different impact on the response whereas the levels 2 and 3 of x 2 are influential but do not need to be distinguished. We generate n = 100 observations and consider two models: the proposed L 0 penalty, that is, penalty (3.1) for the two ordered factors, and a penalty with the same differences but with the L 1 norm instead of the L 0 norm. There is one global tuning parameter in both models, which is chosen by a generalized cross-validation criterion (GCV) as for example defined in O' Sullivan et al. (1986) and as used in the R package mgcv (Wood, 2011). The GVC criterion is given by GCV = n · dev/(n − df(model)) 2 , where the deviance is defined as dev(y,ˆ ) = −ϕ(l n (y,ˆ , ϕ) − l n (y, y, ϕ)), where l n (·) denotes the log-likelihood; df(model) is estimated by the trace of the hat matrix W T/2 (k * ) X (X T W (k * ) X + A ) −1 X T W 1/2 (k * ) of the final iteration (k * ) of the PIRLS algorithm. Figure 3 (top) shows the resulting coefficients paths and the GCV score for the L 0 penalization with tuning parameters c = 10 −5 , = 0.05, = 60. At the very right end of the coefficient paths, the ML estimates are displayed; at the very left end, the estimate for the minimal tuning parameter that gives maximal penalization is shown. As there are no shrinkage effects, for some values of , the estimates are the same. The coefficient path looks like a horizontal tree. The GCV score in the right panel is a step function that jumps when the estimate changes. This happens because the GVC criterion does not depend on the tuning parameter ; for identical estimatesˆ 1 =ˆ 2 , the GCV scores are the same. Hence, the function has no clear minimum; we choose the maximal value of with minimal GCV score as CV . The optimal model is marked by a dotted line at CV = 0.36. In this model, levels 2 and 3 of predictor x 2 have the same impact on the response (ˆ2 2 =ˆ2 3 = −0.41, 24 = −0.40). Levels 3 and 4 of predictor x 1 are falsely fused (ˆ1 3 =ˆ1 4 = 0.30). The estimates of the remaining effects are:ˆi nt = 2.05,ˆ1 1 = 0.81. In the optimal model with the same differences but with a L 1 norm in the penalty, two coefficients are falsely fused (ˆL 1 = (2.04, 0.60, 0.27, 0.27, −0.37, −0.27, −0.26) T ). Figure 3 (bottom) shows the coefficient paths and the GCV score accordingly. In contrast to the L 0 penalty, the path is characterized by steady shrinkage effects; the GCV score is a continuous function with a clear minimum. For the L 1 penalty, the shrinkage effect is slightly bigger as for the L 0 penalty: the sum of squared errors are

Choice of the Tuning Parameter
As for every penalized approach, the choice of the tuning parameter is a crucial issue for L 0 penalization. In the illustrative example, we employ a GCV criterion that requires an estimate of df(model) and that gives concurrent jumps in the coefficient paths and in the GCV score. An alternative, frequently used approach to choose the tuning parameter is K-fold cross-validation with the predictive deviance as loss criterion. K-fold cross-validation relies on models estimated on different training/test data sets for different values of . As we approximate the L 0 norm which is not continuous, the estimate can change abruptly even if the tuning varies only slightly. The values of at which the estimate changes, will not be the same for all training data set. Thus, depending on the chosen folds, for K-fold cross-validation, the overall cross-validation score can be quite wiggly. Therefore, we compare the performance of the GCV criterion and of 5-fold cross-validation in Section 4.3.

Performance
To evaluate the overall performance of the penalties, we consider the estimation accuracy, the prediction accuracy and the error rates of the selection and clustering process. The estimation accuracy is assessed by the squared errors in terms of coefficients: SSE = 1 q q j=1 (ˇt rue j −ˆj) 2 , whereˇt rue denotes the vector of true coefficients andˆthe estimate of the current simulation run. The median of all squared errors is the robust estimate for the mean squared error (MSE) of a method. The prediction accuracy is assessed by the predictive deviance and referred to as MSEP. To judge the model selection process, we consider the selection and the clustering of coefficients separately; the selection of coefficients refers to the coefficients (ˇj l = 0) whereas the clustering process refers to the differences of coefficients (ˇj l =ˇj m ). We distinguish between false positive rates (fraction of truly zero coefficients that are set to non-zero, FP) and false negative rates (fraction of truly non-zero coefficients that are set to zero, FN). We focus on four settings. A setting similar to the illustrative example of Section 4.1 is considered in more detail, it is referred to as G3. In addition, a setting with Gaussian response and 50 nominal predictors is investigated (G50). Settings with Poisson distributed and with binomial distributed response are analyzed (P8, B8). For each setting, 100 replications are considered; for each replication, we compute the ML estimate, the estimate obtained with the L 1 penalty, the estimate obtained with the adaptively weighted L 1 penalty and the estimate obtained with the proposed L 0 penalty. Moreover, we combine the proposed L 0 penalty with the same adaptive weights as employed for the adaptively weighted L 1 penalty. For all penalized approaches, the tuning parameter is chosen by the GCV criterion and by 5-fold cross-validation with the predictive deviance as loss criterion (CV). In addition, a model selection method for categorical predictors is implemented. The method is based on the information criteria AIC and BIC and compares not only all possible subsets of coefficients, but as well all possibilities to fuse different numbers of levels of a categorical predictor. This method is referred to as AIC, BIC respectively.
In the setting G3, there are three nominal covariates with four levels each; true = (ˇi nt ,ˇT 1 ,ˇT 2 ,ˇT 3 ) T = (1, (0, −1.5, −1.5), (0, 0, 2), (−3, −3.5, 4)) T ; in each replication, n = 50 observations are generated. The upper left panel of Figure 4 shows the boxplots of the squared errors. Apart from some outliers, the estimation accuracy of all considered approaches is approximately the same. This is typical: in standard situations, (adaptive) L 1 and L 0 penalization do not show substantial differences. However, as seen in Table 1, the L 0 approach produces more parsimonious and interpretable models. The methods based on information criteria are characterized by the highest FN rates. Comparing the L 1 penalization with and without adaptive weights, the adaptive weighting improves the FP rates substantially. Comparing the adaptively weighted L 1 and the adaptively weighted L 0 , the clustering performance is substantially enhanced by the L 0 penalty. Again, this is typical: with the L 0 penalty, the false positive rates are lower while it can happen that the false negative rates increase slightly in comparison with L 1 penalization.
In setting G50, there are 50 nominal covariates with four levels each;ˇt rue is a vector of length 151, it contains 72 non-influential coefficients and 54 truly different effects; n = 500. In contrast to setting G3, for this and the following settings, model selection based on information criteria is not feasible anymore on a default computer. The lower panel of Figure 4 depicts the squared errors of setting G50. It stands out that the L 0 penalized models perform slightly better than the pure L 1 penalized approaches. Regarding the FP/FN rates in Table 1, it is even more obvious that the proposed L 0 approach generates more parsimonious models. Overall, the approach 'L 0 , CV' performs the best.
In setting P8, there are four influential nominal covariates;ˇt rue = (ˇi nt ,ˇT 1 ,ˇT 2 ,ˇT 3 ) T = (2, (0, −1.2, −1.2), (1.4, 1.4, 0), (0.4, 0.6, 0.8), (−0.7, −1, −1.3)) T . We assume four more non-influential, nominal covariates which are to be detected. For an observation i, the assumed predictor is Á model i =ˇi nt + 8 j=1 x T ijˇj ; n = 100. In Figure 5, the squared errors and the PMSE of the penalized methods are distinctively smaller than of the ML estimates. In Table 1, the L 0 approach reduces the false positive rates even more as the adaptively weighted L 1 penalty. However, for the L 0 approach, the CV performs better than the GCV criterion. A possible explanation is that df(model) is not estimated as good as before. In the optimal model obtained by L 0 penalization and GCV, df(model) is estimated adequately in only eight of 100 cases (was 52 in setting G3).
In setting B8, there are four influential and four non-influential ordered predictors. In contrast to the previous settings, the distribution of the categorical covariates is not balanced; for the n = 400 observations, the covariates are drawn from a multinomial distribution with sampled probabilities between 0.12 and 0.44. In this setting, it can happen that the unpenalized estimate is quite extreme. As the adaptive weights depend on the quality of the ML estimate, they rely on estimates with a slight Ridge penalty for all coefficients (in the PIRLS algorithm A is replaced by A ridge = diag(0.2)). As seen in Figure 5, the estimation accuracy of the approximated L 0 penalty is slightly worse than that of the Lasso penalties. Concerning the selection and clustering performance in Table 1, it stands out that the FN rates are quite high when the L 0 penalty is combined with the CV criterion. For these approaches, the optimal values of are relatively large, too. Apparently, the sample size of the training data sets is too small for differentiated estimates. Hence, in settings like B8, the CV criterion is not recommended for L 0 -type penalties. In contrast, with the GCV criterion, the clustering and selection performance of the L 0 /GCV penalized models is better than that of the corresponding L 1 penalized approaches. In general, L 0 penalized models seem to be sparser; the L 1 penalized models tend to have smaller FN rates. Even though there is less shrinkage in the coefficients paths obtained with L 0 penalization, in general, the MSE of the L 0 approach is not smaller than the MSE of the L 1 approach as the estimates obtained with the L 0 approach are more sensitive to variations in the data. In standard situations, the adaptively weighted L 1 penalty and the L 0 approach perform comparably in terms of the estimation accuracy. In terms of variable selection, the L 0 approach has a higher incentive to cluster categories and reaches smaller FP while slightly enlarged FN are possible. Combining the L 0 approach with adaptive weights enhances the clustering and variable selection performance distinctly. With the L 0 penalization, we obtain stable results in settings where the computation of all possible subsets of coefficients which is needed for model selection based on information criteria, is not possible/efficient.

Unemployment Rates in Germany
We analyze the unemployment rates of the federal states of Germany in the years 2005 to (Weise et al., 2011. The data is given in Table 2. For each of the 16 federal states, there are six annual unemployment rates observed: (state it , rate it ), i = 1, . . . , 16, t = 2005, . . . , 2010. The aim is to find states with the same trends in the unemployment rates while accounting for the heterogeneity amongst the 16 units. Mixed models are the default approach to such data; see, for example, Molenberghs and Verbeke (2005). If one wants to model the unemployment rates by a mixed model, a potential predictor is Á it =ˇi nt + b int,i +ˇ1 · time; whereby b int,i , i = 1, . . . , 16, are random effects for which a distribution is assumed, typically a normal distribution with variance 2 b : b int,i ∼ N(0, 2 b ). Clustering federal states with similar effects relates to identical random effects and hence, requires sophisticated distributional assumptions; for example, a mixture distribution of Gaussian components. This in turn requires elaborate estimation theory, for details, see for example Heinzl (2013). Moreover, the data is positively skewed. There are high unemployment rates, but they occur rarely. The response seems to be rather Gamma than Gaussian distributed.
Hence, we assume a fixed effects model with Gamma distributed response and a logarithmic link function. The predictor contains one intercept per federal state and a global temporal trend: Á it =ˇi nt,i +ˇ1 · time, with i = 1, . . . , 16.
(5.1) To cluster the federal states, in a first model, the subject-specific intercepts are penalized by penalty (3.2), whereby differences to the reference category are omitted as there is none; that is, all pairwise differences of intercepts are penalized by the L 0 norm: In a second model with the same predictor, the spatial structure of the federal states is considered. Weights w r,s are defined as indicators for states with a common border (w r,s = 1 if neighbored, w r,s = 0 else). We will refer to this model as the 'spatial' model. For both models, the tuning of the algorithm is similar: = 36, spatial = 26, = 0.5. The tuning parameter is chosen by the generalized cross-validation criterion of O' Sullivan et al. (1986). It yields CV = 0.14 and spatial CV = 0.05. Figure 6 shows the coefficient paths of the subject specific intercepts for both models. The left panel relates to the first model with penalty (5.2), the right one to the spatial model. In both models, there are basically two clusters of federal states. The upper cluster contains the states of the former German Democratic Republic (GDR) including Berlin plus the city state Bremen. Interestingly, with the spatial weights, the city state Bremen switches the cluster for a relatively large value of . Figure 7 illustrates the resulting clusters for the optimal choice of in a map of Germany. The darker the coloring, the bigger is the subject-specific intercept and the higher is the unemployment rate over the time.
In the left panel, the ML estimates are illustrated. Even though all estimates differ, the pattern of the former GDR in the north-east is clearly seen. In the middle, the subject specific intercepts are clustered by the pairwise penalty (5.2). Here, the states  of the former GDR plus Bremen form one cluster with the biggest impact on the response (ˆi nt,i = 2.89). The effects of the southern states Baden-Würtenberg and Bayern are the lowest (ˆi nt,BW = 1.90,ˆi nt,BW = 1.91). The remaining states except for Rheinland-Pfalz (ˆi nt,RP = 2.12) are clustered; the according intercept isˆi nt,i = 2.39. The right panel of Figure 7 illustrates the results of the spatial model. The results resemble the middle panel; however, the picture is more differentiated: the states of the former GDR form one cluster (ˆs patial int,i = 2.93), but there are slightly different estimates for the states Thüringen and Bremen (ˆs patial int,TH = 2.78,ˆs patial int,HB = 2.80). The effects of Baden-Würtenberg and Bayern are the same as before; but in the west, only Hamburg, Niedersachen and Schleswig-Holstein are clustered (ˆs  Heinzl (2013) obtains similar clusters for the same data by fitting a linear mixed model with Dirichlet process mixtures using the EM algorithm. However, the computational burden for such models is higher.

Summary
In this article, we propose L 0 penalization for categorical effects in GLMs. The penalty works on differences of coefficients and accounts for the different amount of information contained in nominal and ordered factors. Unlike Rippe et al. (2012), we provide a classical regression framework for L 0 penalization. Computational issues are met by a local quadratic approximation which can be traced back to (Fan and Li, 2001). The approximation allows one to derive a PIRLS algorithm; that is, all features of Fisher scoring algorithms are sustained. It is possible to obtain coefficient paths.
Applying L 0 penalization to plain coefficients with fixed tuning has a close relation to best subset selection. As the L 0 approach allows for more flexible terms such as difference in the penalty and as it works for more complex models, L 0 penalization is an attractive alternative to model selection based on information criteria. In an illustrative example and several numerical experiments, the proposed method is competitive; however, it requires carefully tailored tuning. The range of applications for L 0 penalization is huge: applied adequately to continuous covariates, it is an alternative computational approach to model selection based on information criteria; the application to subject specific intercepts is convincing; the computational framework allows one to combine L 0 penalization easily with other types of (smooth) covariates.

SUPPLEMENT A
Proofs for Section 2.

SUPPLEMENT B
Data and R Code to replicate the results of Section 5.