ON BALANCING BETWEEN OPTIMAL AND PROPORTIONAL CATEGORICAL PREDICTIONS

. A bias-variance dilemma in categorical data mining and analysis is the fact that a prediction method can aim at either maximizing the overall point-hit accuracy without constraint or with the constraint of minimizing the distribution bias. However, one can hardly achieve both at the same time. A scheme to balance these two prediction objectives is proposed in this ar-ticle. An experiment with a real data set is conducted to demonstrate some of the scheme’s characteristics. Some basic properties of the scheme are also discussed.


Introduction.
A bias-variance dilemma in categorical data mining and analysis is the fact that a prediction method can aim at either maximizing the overall pointhit accuracy without constraint or with the constraint of minimizing the distribution bias, but can hardly achieve both at the same time. The dilemma was notified, analyzed and illustrated by S. Geman et al. [9] in 1992. The origin of this dilemma is that a machine learning algorithm claiming to be distribution unbiased has to pay the price of high variance. It means that the prediction distribution to be as close as possible to the real target's distribution has to expect a high point-to-point prediction error, and vice versa. This issue has also been widely discussed from practical technic points of view since then, sometimes under different terminologies. Yaniv and Foster [23] examined an "accuracy-informativeness" trade-off in three judgement estimation studies and proposed a trade-off model and a trade-off parameter to describe the penalty for lack of informativeness. Friedman [8] describes the similar problem in classification about distribution bias versus variance, suggesting that a lower bias tend to increases variance and thus there is always a "bias-variance trade-off". It has been noticed that the Monte Carlo method, which is usually considered distribution unbiased, has a problem in point-hit accuracy compared to optimal estimation. Mark and Baram [21] then suggested an improvement to increase the accuracy with a loss to unbiasedness. Yu et al. [24] extended the tradeoff to a bias-variance-complexity trade-off framework and proposed a complex system modeling approach by optimizing a model selection criterion of bias, variance and complexity. Zhou et al. [25] detailed a solution in a recommender system to solve the dilemma. They linearly combines two methods, one of which favors diversity and one on accuracy, and suggest that the solution is to define a utility function regarding the combinations and to optimize the function by tuning the combination coefficient.
Most of the discussions above study the bias and variance on a numerical response variable. R. Tibshirani discussed their categorical equivalence in [22]. A generalized bias-variance formulation is discussed in [5] and [16] .
To illusrate this dilemma, we only consider the purely categorical data situation: both explanatory and response variables are of (nominal) categorical type in this article. We also consider a data set consisting of only two categorical variables X and Y , assume that Y is at a certain degree associated with X and that Y has some unknown values to be estimated.
However it should be noted that a data set in the practice of big data mining is usually high dimensional with mixed data type. It can be viewed as two categorical variables though after a few proper processes. A numerical target variable Y can be categorized by unsupervised discretization methods; same can be accomplished to the numerical source variables by supervised discretization methods; a proper supervised feature selection can reduce the number of source variables to such a small yet powerful number that they can be viewed as one explanatory variable. One can refer to [4,11,14,19] for details regarding feature selection. The discussions to discretization can be found in [15].
To estimate the unknown values of Y for any given known value of X, we may either estimate Y by maximum likelihood, a.k.a conditional mode or the optimal prediction in [10, Section 5], or by expectation or the proportional prediction in [10,Section 9]. The former would yield the highest point-hit accuracy rate without any considerations of distribution bias. The latter would produce the highest pointhit accuracy with a constraint to the least distribution bias of Y . The point-hit accuracy rate achieved by the latter approach is in general lower than that by the former. Indeed, when sample size is large enough and representative, and the unknown part of Y is random, the point-hit accuracy rate difference between the optimal and proportional predictions is, according to [10, Sections 5 and 9], where the equality holds if and only if Y is completely dependent of X.
A very simple example of this dilemma can be described as follows. A table with 90 rows and 2 columns, A and B, has 10 unknown values in B to be estimated (or predicted), shown in Table 1. Please note that N A represents an unknown value in the table. The mission is to estimate the unknowns with low distribution bias and high point-hit accuracy.
For simplicity, we assume that the unknown part has exactly the same conditional distribution as the known part, i.e., the proportion of b 1 and b 2 in a 1 is 3 : 1 and that b 1 : b 2 in a 2 is 3 : 5. To minimize the imputation error, the prediction to A = a 1 has to be b 1 and the prediction to A = a 2 be b 2 , which is an inevitably biased imputation; to reduce the level of bias, the ratio of b 1 and b 2 when A = a 1 needs to be 3 : 1 and that same should be 3 : 5 when A = a 2 . The expected accurate rate of the first case is 0.6875 and that of the second case is only 0.578125. Tot.
In general, for a given conditional distribution ..,k {p(y j |x i )} to get the expected maximum accuracy. The overall accuracy rate is then It is equivalent to the Goodman-Kruskal λ[10, Section 5], denoted by λ Y |X , On the other hand, the least distribution biased prediction, or the prediction with the maximum expectation (or the proportional prediction [10, Section 9] is to predict Y by the exact conditional probability of Y on X, i.e, The expected accuracy rate is The accuracy rate is linked to the Goodman-Kruskal-tau[10, Section 9] (or the GK-tau, denoted by τ Y |X ) as follows where, It can be proven ( [14]) that τ Y |X is the highest point-hit accuracy rate under the constraint that the estimated part of Y has the same distribution as the known part of Y .

WENXUE HUANG AND YUANYI PAN
Thus if all variables are categorical and samples are representative and (the sample size is large) enough, either the highest observable (realistic) point-hit accuracy rate with the lowest distribution bias or the highest point-hit accuracy with no care of response distribution bias can be achieved via an appropriate feature selection based on the corresponding association measures as discussed above. But generally the two optimizations cannot be realized at the same time hence one may want to achieve a certain level of balance between these two. This is exactly what we propose in this article: a scheme to balance the optimizations goal to maximizing the prediction accuracy and that to minimizing the prediction bias. Some experiments with real data Famex96 are conducted to demonstrate the characteristics of this scheme. Basic mathematical properties of this scheme are also discussed. Please note that these experiments are designed to estimate the unknown values in a table with a response variable. To focus on this subject, we ignore all other important issues in high dimensional, mix-typed data prediction such as discretization, feature selection, model selection, etc.
The definition of this scheme is described in Section 2 along with the prediction strategy. The experiments are discussed in Section 3. The relationship between the parameter introduced in this scheme and the prediction performance is also studied in that section. The last section is the conclusion remarks and the future work.

The balancing scheme.
Our discussion is about a framework balancing the expected point-hit accuracy with distribution faithfulness and the likely maximum point-hit-accuracy. The variable with unknown values is considered as the response (or dependent) variable, while others are the explanatory (or independent) variables. The data set is divided into two parts by the values in the response variable. All rows with known values in the response variable goes to the learning part and others go to the prediction part. The response variable in the prediction part will be predicted using the values of its independent variables and the information learned from the learning part. Please note that all the variables in both parts are considered as categorical.
Assume that the response variable Y in the learning part has k distinct values: y 1 , y 2 ,..., y k . To simplify the discussion, we assume that there is only one source variable X in both parts and X in the learning part has n distinct values: x 1 , x 2 ,..., x n . A threshold θ is then defined as follows.
where α ∈ [0, 1] while and p( * ) is the probability of * . Apparently, it is a point between the half of the minimum maximum conditional probability and the maximum maximum conditional probability. The prediction method for a given X = x i can be then described as follows. If its maximum conditional probability is greater the predefined threshold θ, its prediction is in favor of increasing point-hit accuracy; otherwise its prediction is in favor of lowering bias. The underlying idea of this scheme is that how to predict the unknowns depends on the tradeoff level, or a balancing rate α, between lowest bias and highest accuracy.
Please note that the coefficient of 0.5 is just a choice of convenience. Any positive number less than 1 can play the same trick, which is to assure all predictions to be conditional mode based when α = 1.
Our prediction to increase the point-hit accuracy is to predict the unknowns by the conditional mode. Monte-Carlo simulation is used to lower bias, which is to randomly pick y j according to a simulated distribution of p(Y = y j |X = x i ).
3. Empirical experiment and discussion. The data set that we use in this experiment is The Survey of Family Expenditure conducted by Statistic Canada in 1996 (Famex96) [6]. This data set has 10, 417 rows and 239 columns. We specifically choose some of its categorical variables to investigate how the prediction accuracy and bias are affected by the balancing rate introduced in the last section and the unknown proportion of the response variable. To focus on this subject, only two categorical variables are included in each experiment, one as the independent variable and another one as the dependent variable. We also randomly generate unknown values only in the response variable for the same reason.
It is needed to mention that there are various types of unknown (or missing) values. Three types were introduced in [3,18]: missing completely at random (MCAR), missing at random (MAR)and not missing at random (NMAR). [1] classifies missing values as four: missing by definition of the subpopulation, missing completely at random (MCAR), missing at random (MAR), and nonignorable (NI) missing values. Each type usually requires a different processing method. The missing values are generated completely at random for the sake of simplicity.
The first experiment uses type of dwelling (HSG T Y P E) as the independent variable and household type categories (HH T Y P E) as the dependent variable. When the missing rates, denoted as r are 0.05, the learning part has 9,899 rows and the prediction part has 518 rows. The contingency table for the learning part is listed in Table2. Observe that the bold part in this table represent the maximal conditional probabilities, which will be the result of a prediction by (conditional) mode(s). Table  3 and Table 4 are the prediction results for the balancing rate α = 0 and α = 1 respectively. The simple match rate is used to measure the point-hit accuracy, which gives us an accuracy rate of 0.41 when α = 1 and an accuracy of 0.3 when α = 0 . The distribution bias is evaluated by 4, inspired by Kullback-Leibler divergence [17], as follows.
As in the K − L divergence, 4 is smaller when the prediction's distribution is closer to the real ones. There are also two advantages of 4 over the K −L divergence: (1) 4 does not over estimate the case of category missing in the prediction; (2) 4 has a fixed range of [0, 1]. By this definition of bias, α = 1 gives a bias of 0.3 and α = 0 gives a bias of 0.14 which supports our claims to the balancing rate's property regarding bias.
When α varies from 0 to 1, Figure1 shows the increase of accuracy and bias as expected.
Finally Figure3 shows that the effect of the missing rate to the prediction performance is negligible.  4. Discussion and future work. In conclusion, sacrifices in maximizing pointhit accuracy has to be made to achieve least bias in prediction and vise versa. To address this tradeoff issue, we define a balancing scheme so the prediction accuracy can be reduced to certain level to tune down the prediction distribution bias. We introduce a balancing rate, a parameter, α, where 0 ≤ α ≤ 1 to measure this tradeoff level. When one categorical independent value's conditional mode is less than a threshold calculated by this rate, it is considered less important in contributing to the accuracy rate, thus needs to be predicted to minimize the bias, i.e., by a Monte Carlo simulation. Otherwise, it is better to predict by conditional mode to achieve the best accuracy. Experiments show how the balancing rate affects the prediction performance and how the tradeoff effect changes along with it. We will be focusing on how this scheme is extended to other predictive methods like neural network, clustering and decision tree in the future.