Weighted Mean Squared Deviation Feature Screening for Binary Features

In this study, we propose a novel model-free feature screening method for ultrahigh dimensional binary features of binary classification, called weighted mean squared deviation (WMSD). Compared to Chi-square statistic and mutual information, WMSD provides more opportunities to the binary features with probabilities near 0.5. In addition, the asymptotic properties of the proposed method are theoretically investigated under the assumption logp=o(n). The number of features is practically selected by a Pearson correlation coefficient method according to the property of power-law distribution. Lastly, an empirical study of Chinese text classification illustrates that the proposed method performs well when the dimension of selected features is relatively small.


Introduction
Feature screening is a practical and powerful tool in data analysis and statistical modeling of ultrahigh dimensional data, such as genomes, biomedical images and text data. In supervised learning, features of data often satisfy the sparsity assumption, i.e., only a small number of features are relevant to the response in a large amount of features. Therefore, Fan and Lv [1] proposed a sure independence screening method based on correlation learning for linear model and theoretically proved the screening consistency. Subsequently, a series of model-free feature screening methods were proposed, which did not require model specification [2][3][4][5][6][7]. These methods learned the marginal relationships between the response and features, and filtered out the features with weak relationships to response.
In this study, we focus on feature screening of binary classification with ultrahigh dimensional binary features. The purpose of feature screening in classification is to filter out a large amount of irrelevant features that are unhelpful for the discrimination of class labels. Both computational speed and classification accuracy are also expected to be taken into account. For categorical features, statistical test (e.g., Chi-square test) [8,9], information theory (e.g., information gain, mutual information, cross entropy) [10][11][12][13], and Bayesian methods [14,15] are usually used for feature screening, especially in the field of text classification. In this study, we propose a novel model-free feature screening method called weighted mean squared deviation (WMSD), which can be considered as a simplified version of Chi-square statistic and mutual information. Next, according to the property of power-law distribution [16,17], a Pearson correlation coefficient method is developed to select the number of the relevant features. Lastly, the proposed method is applied to Chinese text classification. It outperforms Chi-square statistic and mutual information when a small number of words are selected.
The rest of this article is organized as follows. In Section 2.1, we introduce the weighted mean squared deviation feature screening method and investigate its asymptotic properties. In Section 2.2, a Pearson correlation coefficient method is developed based on the property of power-law distribution for model selection. In Section 2.3, the relationships between Chi-square statistic, mutual information and WMSD are discussed. In Section 3, the outstanding performance of the proposed method is numerically confirmed on both simulated and empirical datasets. Lastly, some conclusions of this study are given in Section 4. Some derivations and theoretical proofs are shown in the in the Appendixs A and B.

Weighted Mean Squared Deviation
As an general classification task, let (X i , Y i ) 1≤i≤n be n independent identically distributed observations. For i-th observation, X i = (X i1 , · · · , X ip ) ∈ {0, 1} p is the associated p-dimensional binary feature, and Y i ∈ {0, 1} is the corresponding binary class label. Denote all necessary parameters as follows, Under the model-free feature screening framework, we need to filter out the features that irrelevant (or independent) of class label, i.e., Note that, the probabilities of two classes are considered as weights in ω j . In contrast, j-th feature is relevant, if and only if ω j = 0. Then we define the true model as T = {j : ω j = 0, 1 ≤ j ≤ p} with model size |T | = d 0 and the full model as F = {1, · · · , p}.
Next, the Laplace smoothing method [18] is adopted for parameter estimation, to make all estimators bounded away from 0 and 1. The parameter estimators are denoted asπ = (2 It is easy to represent thatθ 1j =μ 1j /π,θ 0j =μ 0j /(1 −π) andθ j =μ 1j +μ 0j , for 1 ≤ j ≤ p. Then, a model-free feature screening statistic is constructed, called weighted mean squared deviation (WMSD), i.e.,ω which is an estimator of ω j . It is expected that, the features far away from independency should be selected. Intuitively, those features with largerω j values are more likely to be relevant. In contrast, those with smallerω j values are less likely. Consequently, an estimated model is defined as M = {j : ω j > c, j ∈ F }, where c is a positive critical value. The following theorem provides the asymptotic properties of the WMSD method under the assumption of ultrahigh dimension. Theorem 1. Assume log p = o(n) and there exists a positive constant < 1/3, such that ≤ π ≤ 1 − , ≤ θ kj ≤ 1 − for any k ∈ {0, 1} and j ∈ F , and |θ 1j − θ 0j | ≥ for j ∈ T . We have the following two results: Note that, the conditions ≤ π ≤ 1 − , ≤ θ kj ≤ 1 − imply all parameters are bounded away from 0 and 1, and the condition |θ 1j − θ 0j | ≥ implies P(X ij = 1|Y i = 1) = P(X ij = 1|Y i = 0) for j ∈ T . Theorem 1 states that (1)ω j is a consistent estimator of ω j and (2) M is a consistent estimator of T as long as the critical value c lies between 0 and (1 − ) 3 , which is the strong screening consistency of WMSD. However, the lower bound is unknown in real applications. To this end, a practicable method is proposed in the next section. The proof of this theorem is left into Appendix A.

Feature Selection Via Pearson Correlation Coefficient
While the true model T can be theoretically selected by Theorem 1, it strongly depends on the critical value c. However, c is not given beforehand in empirical studies, and it always varies with the data. In order to solve this problem, the following strategy is developed for feature selection. Firstly, without loss of generality, it could be assumed that the features have been appropriately reordered such thatω 1 >ω 2 > · · · >ω p , then all candidate models can be given by which is a finite set with a total of p nested candidate models. Thus, the original problem of determination for critical value c from (0, +∞) is converted into a model selection problem with respect to the model set M. Next, according to our best knowledge of text classification, the relatively large ω j s of irrelevant features approximatively follow a power-law distribution. Meanwhile, both ω j s of relevant features and relatively small ω j s of irrelevant features can not fit the power-law distribution well. The density function of power-law distribution can be represented as, where the power parameter α > 1 and the lower bound parameter x 0 > 0. A typical property of power-law distribution is that it obeys log p(x) = α log x + C, i.e., it follows a straight line on a doubly logarithmic plot, where C is a constant dependent on parameters α and x 0 . Therefore, a common way to probe for the power-law behavior is to construct the frequency distribution histogram of data, and plot the histogram on doubly logarithmic axes. If the doubly logarithmic histogram approximately falls on a straight line, the data can be considered to follow a power-law distribution [16]. This inspires us to use Pearson correlation coefficient of doubly logarithmic histogram ofω j s to find an optimal model from M. The Pearson correlation coefficient of sequences {log j} 1≤j≤m and {logω j } d≤j≤d+m−1 can be represented as, where m is the number of points when calculating Pearson correlation coefficient. Obviously, the absolute value of r d can be used to measure the approximate level of sequence {ω j } d≤j≤d+m−1 to power-law distribution. Thus, the best model is selected as M = M (d) , witĥ where d min and d max are the smallest and largest true model sizes to be considered. In other words, if the sequence {ω j }d +1≤j≤d+m fits the power-law distribution best over all candidate continuous subsequences of {ω j } 1≤j≤p , then the features in model {d + 1 ≤ j ≤d + m} are more likely to be irrelevant and the features in model {1 ≤ j ≤d} are more likely to be relevant. As a result, the Pearson correlation coefficient method is adopted to determine the model size estimated by WMSD.
In numerical studies, parameters m, d min and d max must be artificially given beforehand by empirical experience. The performance of numerical studies suggests that the feature selection method works quite well both on simulated and empirical data.

The Relationships between Chi-Square Statistic, Mutual Information and WMSD
As we know, Chi-square statistic and mutual information are two popularly used feature screening methods for discrete features. Next, the relationships between these two feature screening methods and WMSD will be investigated. According to the definitions of parameter estimators above, the Chi-square statistic can be represented as, where (5) shows the relationship between Chi-square statistic and WMSD (see Appendix B.1 for detailed derivation). Thus, WMSD can be considered as a simplified version of Chi-square statistic.

Simulation Study
To evaluate the finite sample performance of WMSD feature screening method for binary classification with binary features, two standard feature selection methods are considered as competitors, i.e., Chi-square statistic (Chi2) and mutual information (MI). In addition, to investigate the robustness of the proposed method under different classifiers, two popular used classification methods are considered, i.e., naive Bayes (NB) and logistic regression (LR). To generate the simulated data, a multi-variate Bernoulli model [19] with both relevant and irrelevant binary features is considered. Moreover, different sample sizes of training set (i.e., n =1000, 2000, 5000), different feature dimensions (i.e., p =500, 1000), and different true model sizes (i.e., d 0 =20, 50) are considered in parameter setup. For each fixed parameter setting, a total of 1000 simulation replications are conducted. For each simulated dataset, three feature screening methods are adopted, i.e., Chi2, MI and WMSD. Subsequently, the false positive rate (FPR), that is FPR = |T \ M|/|T |, of WMSD is calculated. In the same way, the false negative rate (FNR), that is FNR = |(F \ T ) ∩ M|/|F \ T |, of WMSD is also calculated. Average FPR and FNR values over 1000 replications are reported. Lastly, in order to evaluate the performance of classification, another 1000 independent observations as testing sample are generated for each replication. Then, the area under the receiver operating characteristic curve (AUC) is adopted to evaluate the out-of-sample prediction accuracy. The AUC values of NB and LR on three estimated models (separately selected by Chi2, MI and WMSD) are calculated on the testing sample and averaged over 1000 replications.
For the given simulation model and parameter setup, the simulated data is generated as follows. Firstly, generate the class label Y i ∈ {0, 1} with probability P(Y i = 1) = π = 0.5 for balanced case and π = 0.8 for unbalanced case. Next, given Y i , the j-th binary feature X ij is generated from a multi-variate Bernoulli model with probability P(X ij = 1|Y i = 1) = θ 1j = 0.05{j −0.2 p 0. 2 , · · · , p}, where I(·) is the indicator function. Note that, without loss of generality, we set T = {1, · · · , d 0 }, that is, the first d 0 features are relevant. Moreover, in this simulation, the parameters in Formulas (3) and (4) are set to be m = 100, d min = 10 and d max = 100.
The detailed simulation results are given in Table 1. In balanced case (i.e., π = 0.5), the following results could be obtained. First, if both p and n are fixed, a larger true model size d 0 leads to a larger AUC. Because the more relevant features are involved, the better we can predict. Second, if both d 0 and n are fixed, a larger feature dimension p leads to worse performance in terms of AUC. This is reasonable because the larger feature dimension leads to more challenge for feature selection and then a worse prediction. Third, if both p and d 0 are fixed, a larger sample size n leads to a larger AUC and a smaller FPR. This is expected because the larger sample size leads to a more accurate estimator and then a better prediction. Forth, in almost all parameter settings, the AUC values of WMSD are larger than that of Chi2 and MI, which states that WMSD performs better than the other two methods on the simulated data. Last, for all parameter settings, the FNR values are relatively small, which indicates that WMSD can filter out most irrelevant features. The results of unbalanced case (i.e., π = 0.8) are similar to that of balanced case. For any parameter setting, FPR values are larger than that of balanced case, which implies that feature selection is harder in unbalanced case. Table 1. Results of simulation study. The averaged area under the receiver operating characteristic curve (AUC) values of naive Bayes (NB) and logistic regression (LR) based on three estimated models (Chi-square statistic (Chi2), mutual information (MI) and weighted mean squared deviation (WMSD)) are reported, and the averaged false positive rate (FPR) and false negative rate (FNR) values of WMSD are also reported, over 1000 replications.

An Application in Chinese Text Classification
The dataset is downloaded from CNKI (www.cnki.net), which is one of the largest academic literature platform in China. It contains n = 14, 473 abstracts of articles published in CSSCI (Chinese Social Sciences Citation Index) journals of economics and management fields in 2018. The abstracts are composed of p = 2385 Chinese words (ignored the words with frequencies less than 10). Our purpose is to classify the articles into different fields (economics or management) according to their abstracts, and select a small number of feature words which are helpful for classification. Economics or management is considered as class 1 (i.e., Y i = 1) and the other is considered as class 0 (i.e., Y i = 0), respectively. In summary, there are 8570 abstracts from economics and 5903 from management. To this end, naive Bayes and logistic regression are both considered as standard classification methods. Then, Chi-square statistic, mutual information and WMSD are considered as feature screening methods and the performances of them are compared based on the two classification methods. It is noted that, the results of these feature selection methods are invariable when class 1 and class 0 are exchanged.
Next, we sample 10, 000 abstracts as the training set and the others as the testing set randomly. For comparison of feature screening methods, different numbers of selected words d (from 10 to 100 by 10) are considered. The AUC values of two classification methods with different numbers of selected words are calculated for evaluating feature screening methods. For each setting, a total of 200 random replications are conducted. The averaged AUC values of two classifiers (i.e., NB and LR) over 200 replications for three feature screening methods (i.e., Chi2, MI and WMSD) with different number of selected words, when economics and management are considered as class 1 respectively, are reported in Figure 1. Panel (1) of Figure 1 shows that when naive Bayes classifier is applied and economics is considered as class 1, AUC values based on three estimated models (separately selected by Chi2, MI and WMSD) increase as d becomes larger. Obviously, WMSD far outperforms other methods when d < 50, and they perform similarly when d ≥ 50. Panel (2) shows a similar result as panel (1) when logistic regression is applied. Panels (3) and (4) of Figure 1 show that, WMSD also far outperforms Chi2 and MI when d < 50, if the classes are exchanged. Furthermore, the Pearson correlation coefficient method is used to determine the estimated model size of WMSD. To calculated, the parameters in Formulas (3) and (4) are set to be m = 100, d min = 20 and d max = 100. The averagedd is 25.86 over 200 replications. In each replication, for the samed, AUC values of NB and LR based on three estimated models by Chi2, MI, WMSD are calculated separately. Figure 2 shows the boxplots of AUC for six situations (i.e., NB+Chi2, NB+MI, NB+WMSD, LR+Chi2, LR+MI and LR+WMSD) over 200 replications. It could be observed that, when the estimated model size is relatively small (actually, averagedd is 25.86), WMSD performs more accurate and robust than Chi2 and MI in terms of AUC, whether economics or management is considered as class 1. Lastly, the probabilities of top 10 words ranked by three feature screening methods are also calculated separately, based on all n = 14,473 abstracts. It can be seen from Table 2 that the probabilities of top 10 words ranked by WMSD are larger than that of other two methods. It states that WMSD provides more opportunities to high frequency words (with probabilities near 0.5). Because the word frequencies of almost all words are less than 0.5, the word frequencies of high frequency words are closer to 0.5. It validates the property of WMSD mentioned in Section 2.3.

Conclusions
In this study, a novel model-free feature screening method called weighted mean squared deviation is proposed especially for ultrahigh dimensional binary features of binary classification, which is a measure of dependence between each feature and the class label. WMSD can be considered as a simplified version of Chi-square statistic and mutual information, which can provide more opportunities to the features with probabilities near 0.5. Furthermore, the strong screening consistency of WMSD is investigated theoretically, the number of features is determined by a Pearson correlation coefficient method practically, and the performance of WMSD is numerically confirmed both on simulated data and an real example of Chinese text classification. Three potential directions are also proposed for future studies. First, for multi-class classification with categorical features, the corresponding WMSD statistics need to be theoretically and numerically investigated. Second, the feature selection method via the Pearson correlation coefficient has not been theoretically verified, which is an important problem to be solved. Last, in order to further confirm the outstanding performance of WMSD in empirical research, it may make sense to investigate specifically the observations for which other methods give a probability near 0.5 (i.e., it is hard to predict their class labels) in future studies. It is the relationship between Chi-square statistic and WMSD.