Automatic grouping using smooth-threshold estimating equations

: Use of redundant statistical model is often the case with practi- cal data analysis. Redundancy widely investigated is inclusion of irrelevant predictors which is resolved by setting their coeﬃcients to zero. On the other hand, it is also useful to consider overlapping parameters of which the values are similar. Grouping by regarding a set of parameters as a single parameter contributes to building intimate parameterization and increasing estimation accuracy by dimension reduction. The paper proposes a data adaptive automatic grouping of parameters, which simultaneously enables variable selection that can yield sparse solution, by applying the smooth-thresholding. The new procedure is applicable to several estimation equation-based methods, and is shown to possess the oracle property. No convex optimization is needed for its implementation. Numerical examinations including large p small n situation are performed. Proposed automatic grouping applies to interaction modeling for Ohio wheeze data and for credit scoring data.


Introduction
It is typical in regression modeling such as for economic, social and clinical data that several kinds of predictors are sampled as many as possible to avoid oversight of important factors. Redundancy widely investigated is inclusion of irrelevant predictors which is resolved by setting the corresponding regression coefficient zero using model selection or hypothesis testing, i.e., variable selection 310 M. Ueki and Y. Kawasaki problem. On the other hand, it can be useful to consider redundancy caused by inclusion of overlapping parameters where their regression coefficients are similar, which might come from functional reason. The similarity does not necessarily require that the predictors are highly correlated because it may arise that a latent factor common in the predictors influences the response although they behave independently. Thus, the concept of overlapping parameters is different from that of the grouping effect of [26] in terms of dealing with the association to the response.
Consider a d-dimensional multiple regression model y i = d j=1 X ij θ j + ǫ i , i = 1, . . . , n, with regression coefficients θ j , predictors X ij , and random error ǫ i . Suppose that there is a set of group indices Q ⊂ {1, . . . , d} such that θ j = θ k for every pair of Q. Apparently, better estimation of θ j for j ∈ Q is on regarding them as a single parameter rather than as separate parameters provided that we know Q. This is equivalent to consider the revised regression model of y i = X jQ α Q + j∈{1,...,d}\Q X ij α j + ǫ i , i = 1, . . . , n, in which X iQ = j∈Q X ij , the grouped predictor, α Q is the corresponding regression coefficient, and α j = θ j for j ∈ {1, . . . , d} \ Q. It in turn implies that a careful consideration of the validity of introducing X iQ in the model is needed. For example, simply adding continuous and categorical predictors sounds rather odd, and scaling of predictors alters relationships between regression coefficients. The simplest case is that the predictors are all binary and independent each other, which turns out to be the problem of pooling categories in contingency table. Even without the independence between predictors, the regression model can represent additive effect by the number of falling under the category to the response, which seems a reasonable model in practical statistical modeling. Thus, we start with binary predictors in the following examples although the application is not limited to binary predictors. Binary predictor often is useful in handling categorical predictors as well as their interaction effect (See Section 5).
Since the overlapping parameters are unknown in advance, we need to infer it from the data. Exhaustive search of possible combinations of parameters is virtually impossible if the number of parameters is large. To address the issue this paper develops an automatic grouping method by using contemporary variable selection techniques [16,26,25,18,1,4], enabling data adaptive grouping of parameters as in Bondell and Reich [1]. The proposed method is an extension of smooth-thresholding [18] that carries out simultaneous grouping and variable selection, i.e., zero parameter estimates are also produced. It is shown that the method possesses model selection consistency [6,25,18] in this context, termed the grouping consistency. Required optimization technique in smooth-thresholding of Ueki [18] is Newton-Raphson-type method which does not require convex programming, hence it is computationally efficient and stable. L 1 -type penalization [1] is an alternative choice, but needs convex programming. Smooth-thresholding is therefore more advantageous, particularly, for estimation problem where iterative optimization is needed, such as the logistic regression and generalized estimating equation.
Our method requires an initial estimate as in the adaptive lasso [25]. The simplest choice is an unpenalized estimator when the number of parameters is smaller than the sample sizes. When the model has larger number of parameters than that of samples, we suggest to employ the elastic net as an initial estimate that can handle the grouping effect [26]. The tuning parameters included are selected by a BIC-type criterion [21,20,18]. Because the method is estimating equation-based, it has wide applicability, e.g., for the generalized estimation equation [14,17] and the Buckley-James estimator [2,13].
Our proposal is applicable to interaction modeling for categorical predictors. The aim is similar to that of Choi, Li and Zhu [4] who propose a variable selection method for the model with the strong heredity constraint [9]. The model obtained from their method is more interpretable, since it allows presence of interaction only when the corresponding main effects exist. The strong heredity constraint is appropriate in analyzing the designed experiments [3,12]. However, other interaction models may be plausible in other fields, e.g., where interaction effect appears without main effect. In such situations, our method is useful owing to its flexibility for grouping of separate categories of interaction. Real data examples given in Section 5 demonstrate our procedure in detail.

Grouping by smooth-thresholding
Suppose that we have a dataset X = (X 1 , . . . , X n ) of size n from a distribution having a parameter vector θ * where θ * = (θ * 1 , . . . , θ * d ) T is defined in Θ, a subset of R d . This paper considers estimation of θ * via estimating equations u(θ) = 0, where u(θ) = n i=1 u(X i ; θ), and u(x; θ) = {u j (x; θ)} j=1,...,d are the d-dimensional vector-valued estimating function for estimating θ * . For example, if u is the score function then it becomes the maximum likelihood estimation. Hereafter we assume that u satisfies E{u(X; θ * )} = 0, E{u(X; θ * ) 2 } < ∞, and some suitable regularity conditions for consistency and asymptotic normality of the full model estimator [e.g., 19,Ch. 5]. The additional assumption in this paper is that the full model involves overlapping parameters redundantly. Specific parameter structure underlying is given in Section 3.1. Turning to the example in Section 1, the saturated model has d parameters, θ = (θ 1 , . . . , θ d ) T , but in reality consists of only d − |Q| + 1 intrinsic parameters such that θ j = α Q for j ∈ Q and θ j = α j for otherwise. Notation | · | represents cardinality of the set. Our purpose is to identify genuine relationship, which is of course unknown, and simultaneously estimate it as accurately as possible. Smooththresholding [18] intends to connect the estimating equations of several submodels smoothly.
The procedure of [18] can be viewed as a weighted L 2 -penalization for L(θ) where L represents the criterion to be minimized, which corresponds to u, with a weighted L 2 -penalty function d j=1ŵ j θ 2 j /2 in whichŵ j =δ j /(1−δ j ). Enlarging this perspective leads to a simple automatic grouping procedure as used in Bondell and Reich [1] who employ L 1 -penalization. Instead of d j=1ŵ j θ 2 j /2, we propose to use the penalty function ; Other choices ofδ jk may be possible depending on the problem, such as the correlation between jth and kth predictors. Analogously to variable selection, restriction θ j = θ k is produced fromδ jk = 1, which is equivalent to |θ ini for given tuning parameters λ and γ. Although it may arise thatδ 12 =δ 23 = 1 butδ 13 < 1, restrictionδ 13 = 1 is automatically imposed owing to the fact that there is only one free parameter under restrictions θ 1 = θ 2 and θ 2 = θ 3 . Furthermore, it is possible to engage variable selection as well as grouping. To this end, introduce the 0th parameter θ 0 , which is always restricted to zero, and construct h(θ) in the same manner withθ ini 0 = 0. Then the penalty function is modified as whereŵ 0k =ŵ k for each k, in which the second term is the penalty function of [18] for variable selection.

Implementation
A simple algebra rewrites the penalty function in (2.1) to a more convenient quadratic form h(θ) = θ TŴ θ/2, whereŴ is d × d matrix having the kth diagonal entry of d j=0,j =kŵ jk and off-diagonal (j, k)-entry of −ŵ jk . Let S full = {1, . . . , d}, and denote the jth unit vector in d-dimensional Euclidean space by e j . We also represent the ddimensional zero vector by e 0 . Define the active set A by the complement of the inactive set, i.e., the active set consists of indices smallest in each group. We also define d row vectors R 1 , . . . , R d where R j 's are e T j for j ∈ A and e T ξ(j) for j ∈ A c . Here the function ξ(j) is defined as follows. Let f (j) = min(i ∈ S full ∪ {0} :δ ji = 1) and define f (l) (j) = f {f (l−1) (j)} with f (0) (j) = f (j) for l = 0, 1, 2, . . . . Then we define ξ(j) to be f (l) (j) satisfying f (l) (j) = f {f (l−1) (j)} for some non-negative integer l. The collection of R 1 , . . . , R d is denoted by the d × d matrix R, and the set {θ ∈ Θ : θ = Rα} with an intrinsic parameter α ∈ Θ represents the resulting parameter space whose dimensionality is |A|. Indeed, if jth column of R is zero vector, which corresponds to the event where index j is included in A c , α j does not appear in Rα. Notably, θ j = (Rα) j for j ∈ A c with ξ(j) = 0 is exactly zero, i.e., sparse solution.
Denote by R A the d × |A| matrix that consists of jth column vectors of R for j ∈ A, and similarly, by α A the |A|-dimensional active parameter vector and byŴ A the |A| × |A| sub-matrix ofŴ . Note that each component ofŴ A is finite. Then, estimation for active parameters α A can be done by minimization By differentiating this with respect to α A , we have the estimating equation where Newton-Raphson-type procedures work effectively. The estimator in the full model space Θ is given byθ λ,γ = R Aαλ,γ,A with a solutionα λ,γ,A to (2.2). We could introduce other types of penalization that carry out automatic grouping instead of the smooth-thresholding, e.g., L 1 -penalty [1] and other penalties including elastic net, adaptive lasso, and smoothed clipped absolute deviation, following such a way of Johnson, Lin and Zeng [11]. However, we can suffer from finding the solution if we use those penalties that require convex programming. In contrast, the smooth-thresholding is free from this issue, which is an advantage over other penalization methods.

Illustrative example
It may be somewhat difficult to imagine the mechanism that underlies our grouping method. An application to simple estimation problem is thus helpful for illustrative purpose. Consider the loss function L(θ) = ||θ − x|| 2 /2 with a d-dimensional data vector x.
If we haveŵ jk ∈ [0, ∞) for all j and k, the matrixŴ defined in Section 2.2 is symmetric, thus so is I +Ŵ . Since, by definition, θ TŴ θ = 2h(θ) ≥ 0 for each θ, we can see that I +Ŵ is invertible. The solution to (2.2) is thus written aŝ θ = (I +Ŵ ) −1 x. The following result is helpful in understanding our procedure. Theorem 2.1. All elements of (I +Ŵ ) −1 are non-negative.
If we further assume thatŵ 0k = 0 for k = 1, . . . , d, we haveŴ 1 = 0 where 1 denotes the d-dimensional vector of ones. Using this fact, it can be readily seen that (I +Ŵ ) −1 1 = 1, implying that sum to unity for each rows. Thus, the linear operator (I +Ŵ ) −1 permits re-allocation of each component of data vector x.
In more general situations whereŵ 0k ≥ 0, it holds that (I +Ŵ )1 = g where g is the d-vector whose jth element is 1 +ŵ 0j . Hence, (I +Ŵ ) −1 g = 1, which states that the re-allocation mentioned above applies to the data x j /(1 +ŵ 0j ) shrunken toward zero. The above argument justifies our grouping method.

Choice of initial estimate
Choice of initial estimate is in practice an important task. Theoretically, it should be root-n consistent (see Section 3). Although the simplest choice is the unpenalized full model estimator, when the number of parameters gets larger than that of samples, automatic variable selection method is desirable since the unpenalized estimators fail to perform. The lasso [16] however can not appropriately handle the grouped predictor variables, whereas the elastic net [26] can overcome this issue. Grouped predictor variables tend to appear in highdimensional data. Consequently, we recommend using the elastic net as the initial estimate, in particular, for high-dimensional data. Numerical experiments given in Section 4 provide comparisons between the lasso and elastic net initial estimates.

Oracle property
In this section, we analyze the theoretical properties of the proposed method with respect to the oracle property [6,25,20,21,11,18,4]. In this section, we assume regularity conditions under which the full model estimator based on the estimating equations, u(θ) = 0, is consistent and asymptotically normally distributed [e.g., 19, Ch.5]. We use the initial estimatorθ ini possessing root-n consistency to the true parameter vector θ * . Before mentioning the theoretical property we define A and R in Section 2.1 based on the true θ * instead ofθ ini , and they are denoted by A * and R * . Using R * , we define the intrinsic vector α * such that θ * = R * α * A * . Full information on parameterization is condensed to a matrix R or R * , and these are used to evaluate the model selection consistency in this context, termed the grouping consistency. The following theorem states existence of the tuning parameters that confers on the smooth-threshold estimator as the same good performance as the oracle estimator.
Provided that R = R * holds, the smooth-threshold estimating equation coincides with the oracle one (A.2) given in Appendix. Then, the solution to (A.2) denoted byα λ,γ,A * possesses an asymptotic normality: Under the same assumptions in Theorem 3.1, we have asymptotic normality, i.e. n 1/2 (α λ,γ,A * − α * A * ) is asymptotically normally distributed with mean zero and the covariance matrix of the oracle estimator.
The next section argues choice of tuning parameters to maintain the grouping consistency even after tuning parameter selection. Noting that the theorem holds for arbitrarily fixed γ > 0, pre-specification of γ is beneficial for saving computation [20]. We use γ = 2 throughout our numerical examples, because, in our numerical studies, other choices and data adaptive choice based on the BIC result in comparable performance. Consequently, λ is the only tuning parameter which we should determine.

Choice of tuning parameters
We propose the following BIC-type criterion to select the tuning parameter λ for the smooth-threshold estimator: where df λ = |A| and ℓ is a loss function such as the −2 × loglikelihood function. The selected λ minimizes the BIC. To consider properties of the BIC, we shall prepare some notations. Denote the estimator based on the estimating function u over a parameter space H ⊂ Θ byθ H , which is a d-dimensional vector. Likewise, defineθ H by ℓ(θ H ) = inf θ∈H ℓ(θ). Situations in whichθ H =θ H occur when using the Wald-type loss ℓ w (θ) = (θ −θ full ) Tv ar(θ full )(θ −θ full ) [10,20].
The true parameter space is also represented by Θ R * . We write Θ R * as Θ * for brevity. In addition, we impose some assumptions on ℓ which are essentially the same as those given in [18].
In the following, we give a justification for the use of the BIC. Define Λ = {λ : λ > 0}. Recalling that λ ∈ Λ determines the matrix R, define the ideal tuning parameter set by Ω * = {λ ∈ Λ : Θ * = Θ R }. For sufficiently large n, Ω * is not empty because Ω * includes λ that satisfies n 1/2 λ → 0 and n (1+γ)/2 λ → ∞ by Theorem 3.1. The overfitted tuning parameter set is defined by Ω O = {λ ∈ Λ : Θ * ⊂ Θ R , R * = R} and the underfitted tuning parameter set by Ω U = {λ ∈ Λ : Θ * ⊂ Θ R }. Theorem 3.3. Under Assumptions 3.1 and 3.2, for any λ * ∈ Λ such that n 1/2 λ * → 0 and n (1+γ)/2 λ * → ∞ with a fixed γ > 0, as n → ∞, it follows that The proof goes in much the same line as [18] and thus is omitted. The above theorem states that the BIC selects tuning parameters such as in Theorems 3.1 and 3.2 rather than those in Ω O or Ω U for large n. It follows from the same argument as in Wang, Li and Tsai [21, Section 3.3] that the BIC selects a tuning parameter that yields the true model, i.e., the minimizer of the BIC enters in Ω * . As a result, BIC enables consistent model selection, which is a justification for using the BIC as a tuning parameter selector. Even when a unique loss function is absent, typically when we resort to generalized estimating equations and Buckley-James estimator, [10] gives a justification for the Wald-type loss is based on a relationship to an approximate posterior probability conditional on the parameter estimates. Following [18], we prefer loss functions satisfyingθ H =θ H . Taking this into consideration, we can instead use an alternative loss function, the score-type loss ℓ s (θ) = u(θ) Tv ar{u(θ full )}u(θ), which is asymptotically equivalent to the Wald-type loss when θ =θ H because u(θ H ) ≈ ∇u(θ full )(θ H −θ full ) for a sub-model H. The score-type loss ℓ s is the loss function such thatθ H =θ H for a given H.

Shortcut in model selection
The smooth-thresholding brings a further advantage in choosing tuning parameters if we use the BIC for which the loss function ℓ yieldingθ H =θ H for a given H. We then do not need to prepare candidate tuning parameters, typically, using arbitrary discretization. This is because we know the degrees of freedom of parameters, once tuning parameters and initial estimates are specified. Given γ, arrange d(d − 1)/2 pairs of |θ ini j −θ ini k | γ+1 in ascending order and denote them as q 1 ≤ · · · ≤ q K where K = d(d − 1)/2, and define q 0 = 0. Then, it is obvious that the change of active or inactive set occurs at λ = q i for each i = 0, . . . , K, which in turn implies that the degree of freedom is constant for each λ ∈ [q i , q i+1 ). Therefore, minimizing BIC is equivalent to minimizing ℓ(θ λ ) for λ over [q i , q i+1 ). Since, by assumption, ℓ corresponds to the loss function in obtainingθ λ , q i is the minimizer of ℓ(θ λ ) for λ ∈ [q i , q i+1 ), because the smooth-threshold estimatorθ λ is a solution to the weighted ridge penalized minimization with respect to the active parameters. Becauseŵ jk is monotone increasing in λ, smaller λ alleviates the extent of penalization. Consequently, to seek the minimizer of BIC in λ, it suffices to evaluate only at each λ = q i for i = 0, . . . , K, implying that we can find the exact minimizer of BIC with this shortcut procedure. It is also noteworthy that the procedure is similar to that of [24].

Numerical experiments
For our simulation studies, we evaluate the accuracy of grouping using the following measures. Let B be the index set of the group which the jth parameter θ * j belongs to, and define, for each j, proportion of correctly fused (PCF) with the other members in its group, and that of incorrectly fused (PICF) with the members in other groups across N simulations by whereθ i,r denotes the ith parameter estimate for rth simulation and B c is the complement of B. Here I(·) represents the indicator function. If the obtained grouping is complete, the PCFs and PICFs are one and zero, respectively. For zero parameters, we also define the proportions of correctly and incorrectly setting zero by where M = {j ∈ S full : θ * j = 0} and M c is its complement. On the other hand, the estimation performance is evaluated by the average relative absolute error (ARAE) compared with the oracle estimator, ARAE(θ) = Hereθ o j,r denotes the oracle estimator of the jth parameter for rth simulation obtained under which the genuine parameterization is in hand. The estimator is better if ARAE is closer to unity.

Logistic regression model
We consider an example with binary response with 10 predictor variables in which first two variables are quantitative and others are dichotomous X = (X 1 , . . . , X 10 ). Response variable Y i is sampled independently from Bernoulli trial with success rate of 1/{1 + exp(−X i β * )}. We consider two models having coefficients vector of where first two of β * are assumed to be out of interest for grouping. Regarding eight parameters remained, model 1 has two intrinsic parameters (−4, 0, 4), while model 2 has eight different parameters with a zero parameter. Since     model 2 has no group, the grouping procedure is redundant and initial estimate will be more efficient. On the other hand, the jth predictor variable X ij for ith sample is generated in the following way. First, define latent variable Z i which is 10-dimensional multivariate normal random variable with mean zero and covariance matrix whose (j, k)-entry is ρ |j−k| . Then define X 1i = Z 1i , X 2i = Z 2i and X ij = I(Z ij > 1), the latter corresponds to about 16% occurrence being exposed. The experiments validate all combinations of n ∈ {100, 200} and ρ ∈ {0, 0.5}, where unpenalized logistic regression estimates are used as the initial estimates. Tables 1 and 2 summarize the result of numerical experiments repeated 500 times. The oracle estimators for models 1 and 2 are the logistic regression estimator through the parameterization of β = β(α) = (α 1 , α 2 , 0, 0, 0, α 3 , α 3 , 0, α 4 , α 4 ) T and β(α) = (α 1 , α 2 , α 3 , α 4 , α 5 , 0, α 6 , α 7 , α 8 , α 9 ) T . The third and fourth columns in Table 1 give ARAE(θ) and ARAE ini = ARAE(θ ini ), respectively. The fifth column provides the average model dimension (AMD) where the parameters out of interest are excluded, hence, the ideal model dimensions are 2 and 7 for models 1 and 2, respectively. The sixth column presents three quantities, p c 0 , p o 0 , and p u 0 : p c 0 is the proportions whether all zero parameters are correctly set zero, and set nonzero for all nonzero parameters; p o 0 is proportion whether all nonzero parameters are correctly set nonzero but not all zero parameters are correctly set zero; p u 0 equals to 1 − p c 0 − p o 0 , i.e., other cases. The seventh column (PCM) presents the proportion of identifying correct model which accounts for whether true zero parameters are set zero exactly. Table 2 presents the PCFs and PICFs defined above for each coefficients, which inform performance of grouping in detail. In addition, PC 0 and PIC 0 are shown.
For simulations under model 1, the fact that the ARAEs shown in Table 1 for the proposed grouping method are all less than those of initial estimator emphasizes improvement to the unpenalized estimator. Table 2 illustrates success both in zero parameter identification and in grouping, particularly, for larger sample size setting, which coincides with the theoretical results given in Section 3.
On the other hand, PCF j s are not defined since there is no group for model 2, and are denoted by "−" in Table 2. The initial estimator works well in model 2, which is shown in Table 1 through ARAE ini s whose values are close to unity. ARAEs for the proposed grouping method are comparable to ARAE ini s, insisting the advantage of our method. Table 2 also implies that the missclassification rate can be reduced by increasing the number of samples.

Generalized estimating equation
The second example applies to the generalized estimating equation, where each sample of size n has four sets of observation together with 10 predictor variables same as those in the previous example. Response vector Y i is independently generated from 4-dimensional multivariate normal of N (X i β * , V ) where V is the covariance matrix whose (j, k)-element is 0.667 |j−k| , i.e., AR(1) working correlation is specified. We use two models given in equation (4.1).
The experiments validate all combinations of n ∈ {100, 200}, and ρ ∈ {0, 0.5} where unpenalized estimates of generalized estimating equation, where working correlation is correctly specified, are used for the initial estimates. Tables 3 and 4 summarize the numerical experiments repeated 500 times. Simulations in the current example result in more accurate grouping and estimation performance compared with those in the logistic regression example. Although the scoretype loss is preferable to the Wald-type loss as stated in Section 3.2, the results of model selection were almost the same. Therefore, we show the result for score-type loss only. Moreover, Tables 3 and 4 also report the improvement of performance as the increase of sample size.

Large p small n situation
In this section we consider a large p and small n situation in normal linear model, where n and p represent the number of samples and parameters, respectively. We use p = 10n dichotomous predictor variables X = (X 1 , . . . , X 10n ). As in the Table 3 Results in generalized estimating equation example. ARAE(θ), average relative absolute error of the proposed automatic grouping method; ARAE ini , average relative absolute error of the initial estimator; AMD, average model dimension;  Table 4 Results in generalized estimating equation example. PCF j /PICF j , proportion of correctly fused with the other members in its group/proportion of incorrectly fused with the members in other groups (%), for indices js to be grouped, where the corresponding coefficients under models 1 and 2 are presented; PC 0 /PIC 0 , proportion of correctly setting zero/proportion of incorrectly setting zero (%) preceding examples, Z ij is defined as I(X ij > 1) for j = 1, . . . , 10n with latent variable X i of 10n-dimensional multivariate normal distribution. Continuous response variables Y are sampled from the multivariate linear regression model of y i = X i β * + ǫ i with independent and identical random error e i from N (0, σ 0 ), where β * is 10n-dimensional coefficient vector. The following two models are considered: where 0 10n−4 is the zero vector of (10n− 4)-dimension. Model 3 has three intrinsic parameters (−4, 0, 4), while model 4 has five intrinsic parameters. Excluding zero parameters, the ideal model dimensions are 2 and 4, respectively. In both models, we expect that 10n − 4 zero components are exactly set zero. Since the ordinary least-squares break down in large p small n situation, we use penalized regression methods, the lasso and elastic net, for the initial estimates. As we mentioned in Section 2.3, the elastic net is expected to work better in high-  Tables 5 and 6 show the results of numerical experiments repeated 200 times. Notably, grouping with the elastic net initial estimate is better than that with the lasso initial Table 6 Results in large p small n example.θ ini , initial estimator used; PCF j /PICF j , proportion of correctly fused with the other members in its group/proportion of incorrectly fused with the members in other groups (%), for indices js to be grouped, where the corresponding coefficients (nonzero only) under models 1 and 2 are presented; PC 0 /PIC 0 , proportion of correctly setting zero/proportion of incorrectly setting zero (%) Remarkably, from the fact that the ARAEs are less than those of initial estimate, the automatic grouping is likely to improve the estimation accuracy.

Ohio wheeze data
We first analyze the Ohio data which is a subset of the six cities study, a longitudinal study of the health effects of air pollution [22]. The dataset contains complete records on 537 children from Steubenville, Ohio, each of whom was examined annually at ages 7 through 10. This dataset was previously analyzed by Zeger, Liang and Albert [23] and Fitzmaurice and Laird [7]. The repeated binary response is the wheezing status (1 = yes, 0 = no) of a child at each occasion. Maternal smoking was categorized as 1 if the mother smoked regularly and 0 otherwise. Previous studies that treat the age as a continuous variable found weak effect of maternal smoking. We re-analyzed the data by treating the age as a qualitative variable for four categories of ages 7, 8, 9, and 10 as in Fitzmaurice and Laird [7]. This strategy creates eight interactions between age and smoking status. Applying the generalized estimating equations with binomial model and exchangeable working correlation generated the result given in Table 7, where baseline is set to I(age = 9, smoke = 0). P -values obtained imply weak difference from the baseline for I(age = 10, smoke = 0) and I(age = 8, smoke = 1). Variable selection of [18] selects the model having only these two variables. However, our automatic grouping additionally identifies an interaction I(age = 9, smoke = 1), which has not ever been specified. The model obtained implies presence of interactions between maternal smoking and age different effect from baseline. Both Wald-and score-type losses in BIC leaded to the identical conclusion. This example points out that the grouping can detect variables that are missed by variable selection alone.

Credit scoring data
We apply the developed grouping method to the credit-scoring data analyzed in [5]. The dataset consists of 1000 consumers' credits from a southern German bank, and the aim is to model the probability that a client will not pay back the credit. The response variable is "creditability" which is given in dichotomous (y = 0 for creditworthy, y = 1 for not creditworthy), and 20 factors are available. We fit logistic regression model. According to [5], we analyzed seven risk factors, H 1 : running account (trichotomous, no/good/bad), H 3 : duration of credit in months (metrical), H 4 : amount of credit in DM (metrical), H 5 : Table 8 Results for credit scoring data. Survived predictors, interactions that are survived by the proposed automatic grouping; Group code, number of groups created by the proposed method ("Not grouped" indicates predictors out of interest of grouping; group 0 means the parameters set to zero); Coef AG , coefficients estimated by the proposed method, where those in identical group are omitted; Coef SLR , recalculated parameter estimates by the standard logistic regression based on grouped predictors obtained from our method; P SLR , output of P -value corresponding to Coef SLR