Asymptotic Performance of the Location and Logistic Classification Rules for Multivariate Binary Variables

: This paper focuses on the Asymptotic Classification Procedures in Two Group Discriminate Analysis with Multivariate Binary Variables. Two data patterns were simulated using the R-Software Statistical Analysis System 2.15.3 and was subjected to two linear classification namely; Location and Logistic Models. To judge the performance of these models, the apparent error rates for each procedure are obtained for different sample sizes. The results obtained show that the location model performed better than Logistic Discrimination with the variation in the error rates being higher for Logistic Discrimination rule.


Introduction
Discrimination is a decision support tool with a wide range of applications, such as health applications, bankruptcy prediction, education planning, taxonomy problems, including engineering applications.It is a multivariate statistical classification technique for separating distinct sets of objects and allocating a new object to a previously defined group.In scientific literature, discriminant analysis has many synonyms such as classification, pattern recognition and character recognition, depending on the type of scientific area in which it is used.The technique usually proceeds in the following manner: a sample of objects is drawn from a population and partition of this sample is known.Each object within the population is described by several characters or certain measurements, which together form a feature vector belonging to a suitable feature space.Using the feature vectors and the individual labels of the sample, an allocation rule is established in order to classify other non-labeled objects from the previous population.The technique of discriminant analysis, though fairly old, still reflects inference in its applications.In attempting to choose an appropriate analytical technique, we sometimes encounter a problem that involves a categorical dependent variable and several metric independent variables.If the dependent variable is metric, the undoubtedly multiple regression could be employed.A statistical technique that addresses the situation of a non-metric dependent variable is discriminant analysis.In this type of situation the researcher is interested in the predication and explanation of the relationship that affect the category in which an object is located, such as why a person is or is not a customer or if a firm will succeed or fail.The subject discriminant analysis has been well dealt with over the years.The review in the areas of logistic discrimination and the location model has been fairly presented in Krzanowski [1].Some comparative studies on the location and logistic models have also been carried out with stringent data characteristics.The relative efficiency of these two statistical methods under different data conditions, however, has been an issue of debate (e.g Baron, [2], Dey & Austin, [3], Baah, [4].A logical question is, how do the two techniques compare with each other?Research findings about the relative performance of the two methods appear to be inconsistent.
The basic assumption of Discriminant Analysis is that of normality.If the populations under study are normally Logistic Classification Rules for Multivariate Binary Variables distributed with homogeneity of covariance, a linear discriminant function is used.A quadratic discriminant function is used if the covariances are not homogenous.The studies of Fan and Wang [5] and Lei and Koehly [6] made use of the assumption of normality.They considered the case of equal covariance and unequal covariance structures for the two groups.
Kakai, Pelz, and Palm [10] did a Monte Carlo study to assess the relative efficiency of the linear classification rule in 2, 3 and 5-group discriminant analysis.The simulation design took into account the number p of variables (4, 6, 10 and 18), the size sample n so that: n/p = 1.5, 2.5 and 5. Three values of the overlap, e of the populations were considered (0.05; 0.1; 0.15) and their common distribution was normal, chi-square with 12, 8 and 4 df; the heteroscedasticity degree, Γ was measured by the value of the power function of the homoscedasticity test related to Γ (0.05; 0.4; 0.6; 0.8).For each combination of these factors, the actual empirically computed error rate was used to calculate the relative error of the rule.The results showed that for normal or homoscedastic populations, the efficiency of the rule became better for large number of groups.Non-normality or heteroscedasticity negative impacted the performance of the rule whereas high values of the ratio n/p and high overlap have positive effect on the rule.The mean relative error of the rule became three times more important from homoscedastic to heteroscedasticity.
Bull and Donner [16] looked at the asymptotic relative estimated efficiency (ARE) of multiple LD compared with multiple DA.Two cases were considered -strong correlations between populations and no correlation between populations.In the first case, LD exhibited substantial increase in the ARE, while the second case exhibited no substantial increase in the ARE.It was also found that as the distance between populations increase the discriminant procedure does relatively better, with the logistic procedure eventually producing infinite parameter estimates when there is no overlap between populations.
Lei and Koehly, Egbo and Onyeagu, and Egbo [6,8,9], performed a Monte Carlo simulation to furnish information about the relative accuracy of Linear discriminant analysis and logistic discriminant under various commonly encountered and interacting conditions.The factors manipulated under multivariate normality are equality of covariance matrices, degree of group separation, sample size, and prior probabilities.They stated that the relative performance of the LDA and LD procedures depends on the interaction between model assumptions and population group distance.The degree of group separation was measured in terms of the squared Mahalanobis distance, ∆ 2 set at 2.68 (small and 6.7 (large).They found that if total misclassification is of interest, the optimal cut-score is 0.5.With a cut score of 0.5, LD and LDA with proportional or accurate prior specification perform similarly and best among other LDA specifications examined in the study, providing good to excellent classification accuracy for extreme population priors or large ∆ 2 .In general they observed that the misclassification rates were good for large ∆ 2 .
In a study of Krzanowski [11], five different sets of data were used to evaluate the performance of Location Model with Fisher's LDF, LD and a method in which all the continuous variables were converted to binary ones.The sample sizes considered for the data sets are as follows: a total of 40 -20 from π I and 20 from π 2 ; 63 from π I and 30 from π 2 ; 38 from π I and 24 from π 2 ; a total of 186 -99 from π I and 87 from π 2 ; and a total of 137 -59 from π I and 78 from π 2 ; respectively for the data sets one to five.LM gave satisfactory results and in the situation with relatively large sample size, gave much better results.
Efron [12] looked at the asymptotic relative efficiency of the normal discrimination procedure (LDA) and LD under multivariate normality, and found that this efficiency depends on ∆, the Mahalanobis distance between two normal populations, as well as on the number of individuals in each population.LD was shown to be between one-half and twothirds as effective as LDA for statistically interesting values of the parameters.He stated that the LD procedures must be less efficient than the LDA at least asymptotically, as n ---> ∞.He further stated that though LD is less efficient and also more difficult to calculate, it is more robust, at least theoretically, than LDA.
Kakai and Pelz [10] performed a Monte Carlo study to assess the asymptotic error rate of linear, quadratic and logistic rules in 2, 3 and 5-group discriminant analyses.The simulation design that was considered took into account the overlap of the populations (e = 0.05, 0.1, 0.15), their common distribution (Normal, Chi-square with 12, 8 and 4 df) and their heteroscedasticity degree, Γ, measured by the value of the power function, 1 -βof the homoscedasticity test related to Γ (1 -β = 0.05, 0.4, 0.6, 0.8).For each combination of these factors, the asymptotic error of the 3 rules was computed using large samples of size 20,000.The efficiency parameter of the rules was their relative error with regard to the optimal error rate.The results showed the overall best performance of the quadratic rule for the normal heteroscedastic cases.The linear rule seemed to be more robust to an increased number of groups than the two other rules.The logistic rule was less affected by the distribution of the populations.For small size samples, the three rules become less efficient.
On the study of prior probabilities, Krzanowski [12] specified a range of values of p 1 and p 2 and 0.1 and 0.9 in a Monte Carlo simulation to compare LM to Fisher's LDF.He also varied the number of binary variables q between 2 and 4. It was observed that for equal priors, the error rates were a constant for both models.However, the error rates were found to decrease as p 2 increased.
Also in a simulation study, Adebanji et al [13] looked at the effects of the sample size ratio on the performance of the linear discriminant function under non-optimal conditions, with 4 variables in each group.They observed thatratio combinations exceeding 1:2, the misclassification of observations for the smaller group were much higher, and four times much higher than the larger group when the ratio exceeds 1:3.For increased disproportional representation of the sample groups, the performance of the classification rule deteriorates, and its performance could not be improved by asymptotic increase in sample size.
The utility of an allocation rule can be assessed by the probabilities of misclassification or error rates, that it gives rise to.When parameters are known in the discriminant model the error rates are given by the optimum error rates, since they indicate the best results possible with the model.When parameters are unknown, various types of error rates may be distinguished.In particular, once an allocation rule has been derived in practice, it is essential to have a reliable method for estimating the error rates that it incurs to have some measure of its utility and to be able to assess its performance relative to other allocation rules.Accordingly, we need to consider methods of estimating the error rates arising from the allocation rule derived.Lachenbruch and Mickey [24], [7], [8], [9] discussed some means of estimating error rates for a given discriminant function.An object from π 1 may be misclassified into π 2 .Also an object from π 2 may be misclassified into π 1 .If misclassification occurs, a loss is incurred.Let c (i/j) be the cost of misclassifying an object from π j into π i .The objective of the study is to find the best classification rule."Best here means the rule that minimizes the Expected Cost of Misclassification (ECM).Such a rule is referred to as the Optimal Classification Rule (OCR).In this study we want to find the OCR where X is discrete and to be more precise, Bernoulli.

Location Model
The classical discriminant analysis assumes that the discriminatory variable v is continuous and assumes normality.Often in practice, the discriminatory variable is a mixture of continuous and discrete variables.Let v denote a random vector of observations made on any individual which is a mixture of q discrete variables x and p continuous variables y. if the i th discrete variables has s i categories (i = 1, …., q) then the contingency table formed from x has s = s 1 x s 2 x … x s q locations; and denote these locations by z 1, z 2 , …, z s .then the location model (LM) as proposed by Krzanowski [14] has the following distribution assumptions: (1) The conditional distribution of y given that x falls in location z m is The marginal distribution of the locations is given by Afifi and Elashoff [15] adopted the location model by allowing different values for the continuous variable location means µ i (m) and multinomial probabilities p im (m = 1, …, s) in the two populations (i = 1, 2) but constraining the conditional continuous variable dispersion matrix ∑ to be constant over all locations and over both populations.From the normality assumption of the model, the conditional probability density of y, given that the discrete variables locate the individual in cell m, is In π i, (i = 1, 2).Thus the joint probability density of obtaining the individual cell m and observing the continuous variable values y is In π i, (i = 1, 2).Inserting these two joint probability densities into the likelihood ratio rule, and tidying up the expression by algebraic manipulation yields the allocation rule: Allocate the individual &′ = ( ′, ′) to π i if the discrete variables x correspond to the m th multinomial cell and Otherwise allocate v to π .Given that the LM is appropriate, probabilities of misclassifications from populations π and π are shown to be.
The allocation rule for the LM can be derived following basic principles.Define the conditional probability density of y, the continuous variable, given that the discrete variable locate the individual in cell m, to be Then, the joint probability density of obtaining the individual cell m and observing the continuous variable values y is In π i, (i = 1, …, g).taking natural logs on both sides ?@; (& = ?@− ?@A(2CD |∑|E − ( − ( ) = F + ?@+ ( ( ) Where q = -½ ln((2Ӆ) c /∑/) -½ y 1 ∑ -1 y.Since q has the same value for all populations π i in cell m, the allocation rule is Allocate v 1 = (y I , x I ) to the population π i in cell m for which ) is greatest (11)

Logistic Discrimination
We have primarily concerned with discrimination and classification assuming a multivariate normal model for the variables in each group.However, one often finds that the variables in a study are not always continuous, but a mixture of categorical and continuous variables.If the group membership variable is categorical or a mixture with continuous, then logistic discrimination may be performed using logistic regression (Bull & Donner, [16]."Logistic discrimination can be viewed as a partially parametric approach as it is only the ratios of the densities (f i (v)/f j (v), i ≠ j) that are being modelled."(Mclachlan,[17]).
The logistic approach to discrimination is postulated as an alternative for discrimination and classification by parametric specification of the posterior probabilities P(π 1 /v) and P(π 2 /v) where With α 0 = α 0 -k, and k is any of the forms discussed earlier.The fundamental assumption of the logistic approach to discrimination is that the log of the ratio of the groupconditional densities is linear, that is, The classification rule, therefore is Otherwise assign v to π 2. Directly generalizing the LD to the g-group case, the model for the posterior probabilities is given as: We therefore assign v to the group which has the greatest posterior probability.Thus Allocate v to the population π i for which p(π i /v) is greatest.

Testing Adequacy of Discriminant Coefficient
Consider the discriminant problems between two multinomial populations with mean µ 1 µ 2 and common matrix ∑.The coefficient of the MLD discriminant function a 1 x are given by α = ∑ -1 δ where δ= µ 1 µ 2 .In practice of course the parameters are estimated by χ i 1 χ i 2 and S = m -1 {(n 1 -1)s 1 + (n 2 -1)s 2 }, where m = n 1 + n 2 -2 Letting the coefficient of the sample MLDF given by j = k_ !l A test of hypothesis H 0 : α i = 0 using the sample Mahalanobis distances D 2 p = Md 1 W -1 d and D 2 1 = Md 1 W 11 d has been proposed by Rao (1965) this test statistics uses the statistic: Where t = u u u under the null hypothesis has F p -k , mp + 1 distribution and we reject H 0 for large value of this statistics.

Evaluation of Classification Function
One important way of judging the performance of any classification procedure is to calculate the error rates or misclassification probabilities (Richard and Dean, [18]).When the forms of parent populations are known completely, misclassification probabilities can be calculated with relative case.Because parent populations are rarely known, we shall concentrate on the error rates associated with the sample classification functions.One this classification function is constructed a measure of its performance in future sample is of interest.Total probability of misclassification (TPM) is given as: The smallest value of this quantity by a judicious choice of R 1 and R 2 is called the optimum error rate (OER).

Probability of Misclassification
In constructing a procedure of classification, it is desired to minimize on the average the bad effects of misclassification (Onyeagu [19], Richard and Dean, [18], Oludare [20]).Suppose we have an item with response pattern x from either µ 1 µ 2 .We think of an item as a point in a r-dimensional space.We partition the space R into two regions R 1 and R 2 which are mutually exclusive.If the item falls in R 1 , we classify it as coming from µ 1 and if it falls in R 2 we classify it as coming from µ 2 .In following a given classification procedure, the researcher can make two kinds of errors in classification.If the item is actually from µ 1 the researcher can classify it as coming from µ 2 .Also the researcher can classify an item from µ 2 as coming from µ 1 .We need to know the relative undesirability of these two kinds of errors in classification.Let the priori probability that an observation comes from µ be q 1 , and from µ 2 be q 2 .Let the probability mass function of µ 1 be f 1 (x) and that of µ 2 be f 2 (x).Let the regions of classifying into µ 1 be R 1 and into µ 2 be R 2 .Then the probability of correctly classifying an observation that is actually from µ 1 into µ 1 is P(1/1) = ∑f 1 (x) and the probability of misclassifying such an observation into µ 2 is P(2/1) = ∑f 1 (x) similarly, the probability of correctly classifying an observation from µ 2 to µ 2 is P(2/2) = ∑f 2 (x) and the probability of misclassifying an item form µ 1 into µ 2 is P(1/2) = ∑f 2 (x) The total probability of misclassification using the rule is TPMC(R) = q 1 ∑f 1 (x) + q 2 ∑f 2 (x) In order to determine the performance of a classification rule R in the classification of future items.We compute the total probability of misclassification known as the error rate.Lachenbruch [21] defined the following types of error rates.
(i) Error rate for the optimum classification rule, R opt .
When the parameters of the distributions are known, the error rate is TPMC (R) = q 1 ∑f 1 (x) + q 2 ∑f 2 (x) which is optimum for this distribution.
(ii) Actual error rate: the error rate or the classification rule as it will perform in future samples.(iii)Expected actual error rate: the expected error rates for classification rules based on samples of size n1 from π 1 and n 2 from π 2 .(iv) The apparent error rate: this is defined as the fraction of items in the initial sample which is misclassified by the classification rule.The table above is called the confusion matrix and the apparent error rate is given by (yt) = @ + @ @ Hills [22] called the second error rate the actual error rate and the third the expected actual error rate.Hills showed that the actual error rate is greater than the optimum error rate ad it in turn is greater than the expectation of the plug-in estimate of the error rate.Martin and Bradley [23] proved a similar inequality.An algebraic expression for the exact bias of the apparent error rate of the sample multinomial discriminant rule was obtained by Goldstein and Wolf [24].Who tabulated it under various combinations of the sample sizes n 1 and n 2 .The number of multinomial cells and the cell probabilities.Their results demonstrated that the bound described above is generally loose.

Simulation Experiments and Results
The two classification procedures are evaluated at each of the 118 configurations of n. r and d. the 118 configurations of n. r and d are all possible combinations of n = 40, 60, 80, 100, 200, 300, 400, 600, 700, 800, 900, 1000, r = 3, 4, 5 and d = 0.1, 0.2, 0.3 and 0.4.a simulation experiment which generates the data and evaluates the procedures is now described.
( A glance in tables 1, 2, and 3 Mean apparent error rates increases with the increase in sample sizes in the two classification rules under study.The actual error rate decreases with the increase in the sample sizes.Location Model ranked first followed by Logistic Regression Model.The results presented in this paper have practical implications for the decision of when to use each technique.Because of its demonstrated high levels of accuracy, Location Model may be the method of choice when the researcher is most interested in recovering the highest percentage of correct classification.However, as the results of both studies indicate, there are times when Logistic Modeling may provide a better alternative.In situation where the numbers of variables are large, Logistic Modeling has demonstrated higher levels of classification accuracy.

Table 1 .
Confusion matix of apparent error rate.

Table 2 .
i) A training data set of size n is generated via Rprogram where n 1 = n/2 observations are sampled from π 1 .Which has multivariate Bernoulli distribution with input parameter p 1 and n 2 = n/2 observations sampled from π 2 which is multivariate Bernoulli with input parameter p 2 j = 1 ….r. these samples are used to construct the rule for each procedure and estimate the probability of misclassification for each procedure is obtained by the plug-in rule or the confusion matrix in the sense of the full multinomial.(ii) The likelihood ratios are used to define classification rules.The plug-in estimates of error rates are determined for each of the classification rules.Effect of input parameters P1 and P2 on classification rules at various values of sample size and Replications (mean apparent error rates).

Table 3 .
Effect of input parameters P1 and P2 on classification rules at various values of sample size and Replications (actual error rates).

Table 4 .
Total Ranks for performance on 21 population pairs by classification rules.