SUBSET : Best Subsets using Information Criteria

SUBSET, written in the matrix language Gauss, is a program that identifies optimal subsets of means or proportions based on independent groups. All possible configurations of ordered subsets of groups are identified and the best model is selected using either the AIC or BIC information criterion. For means, both homogeneous and heterogeneous variance cases are considered. SUBSET offers an alternative approach to traditional posthoc multiple-comparison procedures such as the Tukey test for pairwise comparisons. Major advantages of SUBSET over traditional pairwise comparison procedures include the fact that intransitive decisions are avoided and that issues related to type I error control, sample size and heterogeneity of variance do not arise.


Introduction
Researchers often use analysis of variance to investigate mean differences among several response groups.If the null hypothesis based on equality of means is rejected, it is common practice to employ multiple comparison techniques to study the patterns of differences among the means.For example, Kirk (1995) describes 22 multiple comparison procedures including pairwise comparisons such as the Tukey test.In general, these procedures depend upon interpreting multiple tests of significance.As detailed in Section 3, below, Dayton (1998) advocated replacing these procedures by a wholistic model selection approach based on information criteria.The program, SUBSET, implements this information theoretic approach for comparisons among means or among proportions from independent samples.Section 2 presents a summary of the theory underlying the use of information criteria for model selection while Sections 3 and 4 consider applications of this theory to sample means and sample proportions, respectively.Section 5 describes how to use the SUBSET program and exemplary applications are presented in Section 6.

2
Information Criteria Akaike (1973Akaike ( , 1974) ) developed a decision-making strategy based on the Kullback-Leibler (1951) information measure arguing that this measure provides a natural criterion for ordering alternate statistical models for data.Adapting the notation of Akaike (1987) for the case of univariate data, the Kullback-Leibler information for the true distribution, g t (x), of random variable x, relative to some other distribution, g o (x), can be written as: (1) where expectations are taken with respect to g t (x).In the context of maximum likelihood estimation, let x = {x i } be N values of an iid random variable, x, with true density When selecting among M competing models, Akaike uses Equation ( 5) to calculate AIC m , m = 1,…,M, for the models and, then, selects the model with min(AIC m ) as the preferred model.The conventional interpretation of AIC is as an estimate of the loss of precision (or, increase in information) that results when θ θ θ θ x , the MLE, is substituted for the true parametric value, θ θ θ θ t , in the likelihood function.Thus, by selecting the model with min(AIC m ), the (estimated) loss of precision is minimized.
As noted by Sclove (1987), AIC represents a penalized log-likelihood function that can be written in the general form: where a(N) is a function that may depend upon the total sample size, N. Various adaptations of AIC have been suggested that, unlike AIC, make the statistic dependent upon sample size.In particular, the Schwarz (1978) BIC (or, SIC) statistic and the Bozdogan (1987) CAIC statistic use penalty terms equal to Log e (N) and Log e (N) + 1, respectively.As noted by Bozdogan (1987), these latter procedures are asymptotically consistent in the sense that, when the null case is the true model, the probability of selecting the true model approaches one, rather than an arbitrary significance level, as is true for conventional hypothesis testing procedures.

Application of Information Criteria to the Paired-Comparisons of Means
Conventional pairwise-comparison procedures for means involve conducting a set of statistical tests.Often this is done subsequent to testing the omnibus hypothesis of equality of means for K independent groups (i.e., µ k = µ for k = 1,…,K) using analysis of variance techniques although this is not technically required for many procedures.One popular approach, the Tukey HSD procedure, sets up q statistics for the K(K -1)/2 different pairs of means and refers these statistics to the appropriate null distribution of the studentized range statistic for a span of K means.Thus, K(K -1)/2 hypotheses of the form µ k = µ k′ for k ≠ k′ are tested.Among the problems with such procedures cited by Dayton (1998) are: (1) Some arbitrary technique is utilized to control the family-wise type I error rate for the set of correlated pairwise tests; (2) The issues of homogeneity of variance and differential sample size pose problems for many paired-comparison procedures; (3) Intransitive decisions (e.g., outcomes suggesting mean 1 = mean 2, mean 2 = mean 3, but mean 1 < mean 3) are the rule rather than the exception with typical paired comparison procedures since they entail a series of discrete, pairwise significance tests.(4) There exists a large variety of competing procedures that differ in how type I error is controlled and, consequently, in power (e.g., SPSS for Windows offers seven distinct procedures to choose among).For means based on K independent groups, there is a total of 2 K-1 patterns of ordered subsets with equal means within subsets.For example, with three groups for which the means are ranked and labeled 1, 2, 3, the 2 2 = 4 distinct ordered subsets are {123}, {1,23}, {12,3} and {1,2,3}, where a comma is used to separate subsets that are unequal in mean value.Dayton (1998) proposed using model-selection criteria such as the AIC or BIC statistic for selecting the most appropriate ordering of subsets of means for purposes of interpretation.In particular, this approach was advocated as avoiding many of the objections were raised to conventional pairwise comparison procedures.The program, SUBSET, computes both the Akaike AIC and the Schwarz BIC statistics for all 2 K-1 distinct ordered subsets.Since the number of ordered subsets can be quite large for practical problems (e.g., 512 for K = 10 groups but 524,288 for K = 20 groups), only the ordered subsets corresponding to the smallest AIC and BIC values, as specified by the user, are printed out.
Creating the patterns of ordered subsets of means within SUBSET is based on the recognition that digit inversions in the first 2 K-1 binary equivalents of the integers from 0 through 2 K uniquely define these patterns.For example, for K=4 these eight binary equivalents are 0000, 0001, 0010, 0011, 0100, 0101, 0110 and 0111 and they correspond to the ordered subsets {1234}, {123,4}, {12,3,4}, {12,34}, {1,2,34}, {1,2,3,4}, {1,23,4} and {1,234}.In the program, SUBSET, once these binary equivalents are generated, the sub-matrix extraction and substitution features of the Gauss language are used to create the actual patterns of equivalent means (and variances, for the heterogeneous case).There is no limit to the number of groups that can be analyzed since the program only stores results for the S (specified by user) smallest AIC and BIC values at each iteration.Of course, execution time can become relatively long for large K.Typical execution times on a 266mz notebook computer are: K = 4 groups, 2 3 = 8 patterns: .06 seconds K = 12 groups, 2 11 = 2,048 patterns: 6.97 seconds K = 20 groups: 2 19 = 524,288 patterns 3049.74 seconds, or 50.83 minutes Information criteria such as AIC or BIC are based on the log-likelihood of the data.In SUBSET, it is assumed that the observations arise from normal densities.Since the log-likelihood is maximized for any given model when variance estimates are computed using the sample size, n, rather than n-1, in the denominator, this conversion is made within the program.SUBSET calculates AIC and BIC based on the usual assumption of homogeneity of variance as well as based on a restricted heterogeneous variance model for which it is assumed that there is a unique population variance for each of the distinct subsets of means.For the homogeneous case, the conventional analysis of variance within-groups sum of squares, SS w , is converted to a variance estimate, SS w /N, where N is the total sample size.For the restricted, heterogeneous variance case, an estimated variance for a subset of means can be obtained (a) by pooling the estimates from the separate groups or (b) by computing the sample variance for the combined sample.The latter approach is illustrated in Dayton (1998) and is the procedure incorporated into SUBSET.For any given model, AIC is given by the expression -2Log e (likelihood) + 2p, where p is the number of independent parameters estimated in calculating the likelihood for the observed data.Similarly, BIC is given by -2Log e (likelihood) + Log e (N)p.For a model with T subsets of means, p equals T+1 for the homogeneous case and 2T for the restricted heterogeneous case.For example, for the ordered subset {1,2,34} the values of T are 4 and 6, respectively, for AIC and BIC.Since Log e (N) > 2 for N > 7, AIC and BIC may, and often do, result in different orderings of subsets of means with, predictably, simpler models being favored by BIC.In Dayton (1998), results of a limited simulation with AIC and CAIC (the slightly different criterion than BIC suggested by Bozdogan (1987) with penalty term Log e (N+1)p), it was found that: "Overall…the accuracy of CAIC is always approximately equal to or superior to Tukey HSD but tends to be lower than AIC when there are relatively many clusters of means, especially with smaller sample sizes."Accuracy, in this study, was stringently defined in terms of all-pairs power following Ramsey (1978).

Application of Information Criteria to the Paired-Comparisons of Proportions
A simple extension of the approach presented above for sample means allows the identification of optimal subsets for data in the form of proportions.Consider K groups of sizes n 1 ,…,n K with sample proportions, p 1 ,…,p K , respectively.Assuming independent Bernoulli trials, the log-likelihood for the k th (ordered) sample outcome is n k p k Log e (p k ) + n k (1-p k )Log e (1 -p k ) and the log-likelihood for all samples is found by summing across the K groups.Note that the sample proportion, p k , is the MLE for the corresponding population proportion and that omitting the combinatorial constant to take into account unordered samples only omits a constant term from the log-likelihood.Unlike the situation for sample means, there is no need to consider homogeneous and heterogeneous cases since each Bernoulli process is based on a single parameter, π k , say.Otherwise, model selection can be based on the same reasoning as for sample means.That is, there is a total of 2 K-1 distinct patterns of subsets of proportions to evaluate.For each pattern, the log-likelihood is converted to AIC by the formula -2Log e (likelihood) + 2p and to BIC by the formula -2Log e (likelihood) + Log e (N)p, where p = T for a model with T subsets of proportions.

Using the SUBSET Program
SUBSET is written in the microcomputer matrix programming language, Gauss for Windows NT/95 Version 3.2.32 (Aptech Systems, 1997).SUBSET is run in interpretive mode, which means that the Gauss system must be installed on the microcomputer.However, extensive knowledge of Gauss syntax is not required to run the program.The source code, SUBSET.E, as well as a compiled version, SUBSET.GCG, of the program are available but note that the Gauss system is required to run either version.For generalpurpose analysis, there is no other program that computes AIC and/or BIC for the models available in SUBSET.For a small number of groups (e.g., 5 or less), it is reasonably easy to program the computations in a spreadsheet as was reported by Dayton (1998).Data for analysis is imported into SUBSET from a spreadsheet or database program.The import routine in the Gauss program determines the nature of the spreadsheet/database from the file extension (e.g., file.XLS denotes a Microsoft Excel file whereas file.DB2 denotes a dBase II file).The general format for the spreadsheet/database file is: It is conventional to code the groups with names, or 1, 2, etc., or A, B, etc. but SUBSET rearranges the groups in rank order of means {proportions}, from smallest to largest, and presents groups in ranked order, 1, 2, etc., in the output.Thus, in practice, it is most convenient to order the means {proportions} in this same manner in the spreadsheet/database prior to analysis.A sample data set for five groups clipped from a Microsoft Excel spreadsheet is shown in the Exemplary Output section, below.
The Gauss program can import data from spreadsheet formats such as Microsoft Excel, Lotus 123 or Quattro-Pro or from database programs such as dBase IV, Paradox or FoxPro or from a Gauss dataset.There are restrictions on the nature of the spreadsheet or database that can be imported.These restrictions can be found by referring to the description of the Gauss "import" command in Gauss Help.For example, for GAUSS for Windows NT/95 version 3.2.32 when using a Microsoft Excel spreadsheet, it must be saved as version 7.0 or earlier (but no earlier than 2.1).In particular, spreadsheets created by later versions of Excel such as that found in Office 97 cannot be directly imported but must be saved as an earlier version (e.g., version 4.0).Actually, data can also be input from a character-delimited ASCII file but this is typically less convenient than using, for example, a spreadsheet.
To run the compiled version of SUBSET, follow these steps (assume SUBSET.GCG is located in the directory C:\Program): Open Output is directed to the screen and to a default file named Subset.outin the directory in which the Gauss system is started.The output file can be changed by editing the appropriate line in the Gauss program.Note that only output from the current analysis is saved to the file.

Exemplary Output
Example 1: Assume the data below in cells A1:D6 of an Excel 4.0 spreadsheet (note that the groups have been sorted in ascending magnitude of means).The data are taken from the SPSS/PC+ manual (Norusis, 1986).The dependent variable is annual consumption of alcohol in pints by adult males as reported by Greeley et al. (1980) for the named ethnic groups.

Group Count Mean
Number of AIC/BIC values to display (5 is recommended)?{provide an appropriate number}

Var(unbiased)
The input to SUBSET and the output generated by SUBSET are: (gauss) run c:\program\subset.gcgDefault file for all printed output is Subset.out in the current directory Program SUBSET for Ordered Subsets of Means or Proportions Prepared by: C. Mitchell Dayton Department of Measurement & Statistics