Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents

The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ2) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ2.

N-Gram-common, rare or sparse along with their symmetrical uncertainty towards the classes. The symmetrical information of the N-Gram NG i ∈ X associated with class C j ∈ Y can be represented by Fig. 1. In Fig. 1, the area contained by both the circles is the joint entropy H(X, Y). The circle in the left (red and violet) is the individual entropy H(X), with the red being the conditional entropy H(X|Y). The circle on the right (blue and violet) is H(Y), with the blue being H(Y|X). The violet is the symmetrical information I(X; Y). 4 The representation of the terms of the corpus is the base to determine the computational informativeness of the terms to classify the text documents automatically. The Bag of Words (BOW) model is the basic model to represent the terms. It is a simplified representation of terms, used in the natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as the bag (multi set) of its individual words, disregarding grammar and word order but keeping its multiplicity. The BOW model uses the occurring frequency of the terms as the base criteria to discriminate the terms of the class documents. The major drawback of the BOW model is that, here the order of term occurrence is not important, only the occurring frequency of the term is considered.
The N-gram language (NGL) model (Duoqian et al. 2009) has solved this problem up-to some extent by considering the order of term occurrence in the sentences of various class documents. The N-Gram is a contiguous sequence of n terms in a given text. In the NGL model, the various combinations of terms occurred together in the sentences of various documents is combined as a set. E.g., suppose we have to classify a sentence, "I do not like the story of the movie" as positive or negative? Since this document contains N-Gram "like", by using conventional BOW model may be misclassified as positive document. In such cases, we need a combination of two or more N-Grams "not like" or "do not like" known as N-grams words.
This article investigates about the barriers in ATDC. The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In the symmetrical distribution, the nature of an N-Gram might be common, rare or sparse. The common N-Grams are distributed equally to all the classes, whereas the rare N-Grams belong in most of the documents of a specific class. The sparse N-Grams occurred less frequently in the documents of a class, and their presence or absence is not important to decide the class label of the documents. In this paper, we have focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. The symmetrical distribution of the N-Grams in more than one class requires computation of the symmetrical information associated with all the classes for the N-Gram. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach.
The two levels of filtering gives better results in our day to day life problems motivated us to develop an approach which filters the text document features in two levels. Initially, the SSNG choose various informative N-Grams as a set NG from the entire extracted N-Grams of the corpus (D), such that NG ∈ D. In the second pass filtering, benchmarked χ 2 method (Manning et al. 2008) is being used to select few most informative N-Grams (say NG[k] ∈ NG) from set NG. The SSNG computes the symmetrical strength of the N-Grams based on four criteria-symmetrical uncertainty, membership, strength, and the nature of the N-Gram. To evaluate the performance of the SSNG using TPF approach, we have conducted a substantial number of experiments on movie review (Pang and Lee 2004), ACL IMDB (Maas et al. 2011), Reuters13 (Forman 2003, 20Newsgroup (Joachims 1996), Ohsumed5, Ohsumed10, Ohsumed23 (Joachims 1998) and Pubmed9 data sets using two standard classifiers Multinomial Naive Bayes (MNB) and linear Support Vector Machine (LSVM). In most of the data sets the performance and success rate of the proposed SSNG method using TPF approach is superior to the state-of-the-art methods viz. MI, IG, OR, DFS, and χ 2 .
The remaining part of the paper is organized as follows: The preliminary concepts are discussed in "Preliminary concept" section. The related works are described in "Related works" section. "Proposed work" section describes the proposed work. "Results and discussions" section illustrates results and discussion. The paper is concluded in the "Conclusion" section.

Preliminary concept
The preliminary concept is discussed in this section to explain the contribution part of this study. The preliminary notations are described in Table 1.

Term representation
In this paper, we adopted NGL model to represent the terms as a single set of N-Grams, NG, by combining the set of Uni, Bi, and Tri-Grams (see Fig. 2). The set NG and its subsets NG[k] and NG [s] have been generated by the Apriori algorithm.
To find the frequent terms occurred together in the sentences of various class documents a two-step process, join and prune, have been employed.
1. The join step: This step generates a new list of terms L k which is the combination of terms of set L k−1 by joining it with itself, i.e., L k−1 ⊲⊳ L k−1 . E.g., L k is a set of Bi-Grams, represented as L k = {t 1 t 2 , .., t m−1 t m }. It is generated by making the ordered pair of each term of Uni-Grams set L k−1 = {t 1 , t 2 , .., t m }, i.e., (t m−1 , t m ) where t m−1 , t m ∈ L k−1 . Similarly, the set of Tri-Grams L k+1 has been generated. It is the ordered triplet of terms of L k−1 , i.e., L k+1 = {t 1 t 2 t 3 , .., t m−2 t m−1 t m }. Finally, the set NG is generated by taking the union of Uni, Bi, and Tri-Grams set, i.e., L k+1 L k L k−1 . 2. The prune step: This step eliminates some of the unimportant N-Grams from the set NG by using a threshold value. Here, the elimination is based on the weight of the N-Gram. The proposed SSNG + χ 2 method is used to select the most informative N-Grams set NG [k], such that NG[k] ⊂ NG.

Related works
In literature many researchers have significantly contributed in this direction and compared their core contributions with state-of-the-art methods viz. MI, IG, OR, DFS, χ 2 and TF-IDF. We described the brief description about these methods in this section.
The Mutual information (MI) concept (Manning et al. 2008;Joachims 1998) has been carried out from the information theory to measure the dependencies between random variables and used to measure the information contained by an N-Gram NG i ∈ NG (see Eq. 1). It is strongly influenced by the marginal probabilities of the N-Grams. It assigns higher weight to the rare N-Grams than common and sparse N-Grams. Therefore the N-Grams weights are not comparable for the N-Grams with widely differing frequencies (Wang et al. 2014;Yang and Pedersen 1997).
The Information Gain (IG) is a measure of reduction in entropy for the N-Grams when they are separated into different classes. The IG assigns higher weight to common The count of other the N-Grams N G i occurred in the documents of class C j The count of other the N-Grams t i occurred in the documents of other classes C j The probability of the N-Gram NG i when it co-occurs with class C j The probability of other N-Grams t i when they co-occur with the class C j The probability of the N-Gram NG i when it co-occur with other classes C j The probability of other N-Grams t i when they co-occur with other classes C j The probability of class C j when the N-Gram NG i co-occurs with the class C j The probability of the class C j when other N-Grams N G i co-occur with class C j The probability of other classes C j when the N-Gram NG i co-occur with other classes C j The probability of other classes C j when other N-Grams N G i co-occur with other classes C j N-Grams distributed in many categories than rare N-Grams. The IG is also known as average MI. The computation of IG includes the estimation of the conditional probabilities of a category given an N-Gram and its entropy (see Eq. 2). It is the difference between the original information requirement (i.e. based on the proportion of classes) and the new requirement (i.e., obtained after partitioning of N-Gram NG i ) (Wang et al. 2014;Uysal and Gunal 2012;Forman 2003;Yang and Pedersen 1997;Lewis and Ringuette 1994). (2)

Fig. 2 The most informative frequent N-Grams mining
The Odds ratio (OR) was originally proposed by Rijsbergen (1979) to select the N-Grams for relevance feedback. The OR method is a one sided local feature selection method (Uysal 2016). It is the ratio of the odds of an N-Gram NG i occurring in a class C j to its odds in other classes C j (see Eq. (3)). It is based on the assumption that, the distribution of the features on the relevant documents varies from non-relevant documents. Mladenic and Grobelnik (1999) used OR method and achieved highest F1-measure using MNB classifier. Uysal and Gunal (2012)  Mathematically, Chi-square (Manning et al. 2008) testing is used to determine the independence of the term NG i and class C j during the feature selection (see Eq. 5). The χ 2 method assigns higher weight to common N-Grams than rare N-Grams. It is better than MI because it assigns normalized weight to the terms. Therefore χ 2 weighted terms are comparable in the same category. However, this normalization breaks down for low frequency terms & it is not reliable for low frequency terms (Wang et al. 2014;Yang and Pedersen 1997). Guo et al. (2009) achieved 83.0 % f1 by using self-switching classifier, while 67.7 and 74.7 % f1 using SVM and MNB in 20Newsgroup datasets (10 number of categories were taken). In Ohsumed15 dataset this self-switching classifier gains 73.9 % f1, while 70.2 and 70.9 % using SVM and MNB. Rehman et al. (2015) achieved peak macro f1 by 21.07 % (for 1500 features) using LSVM in Ohsumed23 dataset. In 20Newsgroup dataset his proposed method gain 74.38 % macro f1 while 75.54 % micro f1 using LSVM, similarly 72.99 % macro and 73.10 micro f1 using MNB. Uysal (2016) proposed an improved global feature selection scheme for text classification. It is an ensemble method combining the power of two filter-based methods. The new method combines a global and a one-sided local feature selection method. By incorporating these methods, the feature set represents classes almost equally. This method outperforms the individual performances of feature selection methods. (3) Sharma and Dey (2012) reviewed extensively on sentiment classification problem and described year wise research findings of authors, models with accuracy on review datasets. The maximum 95 % accuracy had been achieved by the authors in the movie review dataset.

The SSNG method
The symmetrical strength of the N-Gram (NG SSNG ) is based on four criteria-symmetrical uncertainty (NG SU ), membership (NG Mem ), strength (NG Strength ), and the nature of the terms (NG RCST ).
The Symmetrical Uncertainty of the N-Grams (NG SU ) The ratio of the information gain of the ith N-Gram NG i for the class C j with the sum of probabilities of NG i and class C j reduces the symmetrical uncertainty of the N-Gram. If the information gain of the ith N-Gram NG i is very high due to high frequency of the common or sparse N-Gram then by dividing this information gain value with the sum of probabilities of N-Gram and the class will be reduced to a smaller value (see Eq. (7)).

The Membership of the N-Gram in a class (NG mem )
The belongings of the N-Gram to the specific class is referred as membership of the N-Gram. A probabilistic ratio of success or failure is computed to evaluate whether the N-Gram belongs to a specific class or not (see Eq. (8)).
According to the criteria used by Uysal and Gunal (2012), the N-Gram present in only one class is more important than others. The minimum N-Gram frequency of such N-Grams in a class is zero. Dividing the numerator of the Eq. (8) by such type of N-Grams will produce an undefined number. Therefore, a very small number ǫ which is closer to zero, but not zero ( 0 < ǫ <= 0.5) has been added in the numerator and denominator of the Eq. (8) to avoid the division by zero error.
The Eq. (8) for computing the membership of NG i in a class C j is similar to the OR (see Eq. (3)). In case of two class problems, the OR assigns equal positive and negative weights to the N-Gram NG i for the class C j and other classes C j . It is due to its one sided weight computation nature. In case of multi-class problems, although the weight assignment of the OR is not equal for all the classes, but due to its one sided nature the positive and negative weights of the N-Gram for different classes have less discriminating power. The extra ǫ has been added in the OR method before taking the logarithm to boost the score of such type of N-Grams which are present only in one class.

The Strength of an N-Gram (NG Strength )
It is an improvement of the standard mutual information (Forman 2003) method (see Eq. 1), where each logarithmic quantity is Table 1). The computation of NG Strength of the term NG i , each logarithmic quantity is multiplied with the total occurrence of term NG i in the documents of class C j and other classes C j (see Eq. 9).
The nature of the N-Gram (NG RCST ) The absolute difference between the probabilities of the class C j and other classes C j when the ith N-Gram NG i is present, computes the nature of the rare, common, or sparse N-Grams (see Eq. (10)).
If NG RCST value of the ith N-Gram NG i is zero or very small then the NG i occurred either equally or less frequently in the documents of all the classes. It means the nature of the N-Gram is either common or sparse. If NG RCST value is high, then the NG i occurred more in one category compared to other categories. The common and sparse N-Grams are with a low membership value to the specific class, less responsible in exact discrimination of the class of documents. Whereas, the rare N-Grams are with a high membership value to the specific class, more responsible. We have observed from an extensive number of experiments that, the cube of (NG SU + NG Mem + NG Strength ) instead of square or fourth power, gives maximum accuracy. The fourth power of NG RCST , reduces the weight of common and spare N-Grams such as near to the value of zero, whereas, it increases the weight of the rare N-Grams very high in comparison to the benchmarked methods. Therefore the most informative rare N-Grams are selected and the uninformative common and sparse N-Grams are eliminated, if the threshold value represents the top most informative N-Grams. Further, the concept has been explained in the "Illustration of the SSNG using example datasets" section by using two example datasets shown in Tables 2 and 5.

Illustration of the SSNG using example datasets
To further illustrate this concept, consider an example dataset shown in Table 2. We illustrate the process of weight calculation using SSNG method for four N-Grams C2 "penalty shootout" 1 2 0 0 0 1 "penalty corner" 0 0 0 0 1 0 "beautifully" 1 0 1 2 0 0 "play" 0 0 2 0 1 1 {"penalty corner", "penalty shootout", "beautifully", "play"} of this example dataset. We assumed, the N-Grams are contained by twelve documents of a balanced dataset with two classes, where each class having six documents (see Table 3). Table 4 shows the confusion matrix of N-Gram "penalty shootout" for its presence or absence to a class C 1 or in C 2 . The computation of weight for N-Gram "penalty shootout" is as follows-1. The symmetrical uncertainty has been computed using Eq. (7) as: 2. The Strength of the N-Gram for class C1 and other class C2 is computed using Eq. (9).
3. The membership of the N-Gram for class C1 and C2 using Eq. (8).
4. The nature of the N-Gram for class C1 and C2 using Eq. (10).
6. Finally, we compute the total contribution of N-Gram in the classification of text documents as: In this study, we have two main objectives: First, to assign highest weight to the rare N-Grams like "penalty shootout" which appeared only in the class "C2" and "penalty corner" which appeared in the 4 documents of the class "C1" and only once in the document of class "C2". The second objective is, assigning very less weight to the common N-Grams like "beautifully" and "play". Here "beautifully" is more informative than "play", because the document frequency of the "beautifully" is 6 in the class "C1" whereas "play" have 4 only. The document frequencies of both N-Grams in the class "C2" are equal to 3. The SSNG method assigns very less weight to the sparse N-Grams. The SSNG method assigns highest weight to N-Gram "penalty shootout" = 1337.6302. The other feature selection methods also give more score to this N-Gram, but the computed weight by the SSNG is very high. The similar calculation of the SSNG weight for other N-Grams gives scores for other N-Grams "penalty corner"= 20.7158, "play"= 0.0.0004, and "beautifully" = 0.3527 ( see Table 4). This example dataset is not normalized because it is very small and contains only four N-Grams in the 12 documents of the two classes. In case of real datasets the terms weigh is normalized using TF-IDF weight before further processing. The main aim of taking the cube of (NG SU + NG Mem + NG Strength ) is quite clear from the computational process of the SSNG. The power of this quantity can be an odd number (i.e., 1, 3, 5,…) because if we take an even number, it will make the weight of the N-Gram positive for some classes which is currently being assigned a negative value. The discriminating power of these N-Grams is less for that class. The positive and negative combination of the weights for an N-Gram finds more appropriate discriminating power of the N-Gram, instead of positive combinations. It is because, e.g. a rare N-Gram which is present in a specific class C j and absent in other classes, then its positive value for other classes C j create ambiguity and will deficit its discriminating power. Further, if we choose the power as one, it will not fulfill our objectives and the weights are computed as similar to the state-of-the-art methods. Further, if we select power more than three, the weights are very high for rare N-Grams as it is already high if we choose it three.
Similarly, (NG RCST ) 4 finds the representation ability of the N-Gram for a class compared to other classes. It will assign the highest weight to the rare, less weight to the common, and very less weight to the sparse N-Grams. Suppose, we have four N-Grams NG i , NG j , NG k and NG l of a example dataset shown in Table 5. The nature of the NG i is common and the other N-Grams NG j , NG k and NG l have rare, very rare, and sparse natures respectively. The representation ability of the NG i for a class C 1 is 2.3 and for other classes C 1 is 2.25 (see Table 5). The absolute difference between the representation ability of the NG i for a class C j and other classes C j have been computed to identify the discriminating nature of the NG i in ATDC. In this particular case, we get this absolute difference as |2.3 − 2.25| = .05. The fourth power of (0.05) 4 is very small in comparison to (0.05) 1 , (0.05) 2 ,and (0.05) 3 . The fourth power has reduced the weight of common and sparse N-Grams near to zero, whereas increased the weight of the rare N-Grams four times (see Table 5). Therefore, to fulfill our objectives of assigning very less weight to common and sparse N-Grams whereas highest weight to rare N-Grams, we have taken this value as four in (NG RCST ) 4 .
We observed that the weight assignment process of the MI, IG, DFS, and χ 2 are as described in the literature. The MI gives highest weight to rare N-Grams like "penalty shootout" and "penalty corner", but very less weight (near to zero) to common N-Grams "beautifully" and "play", which is the cause of its low performance. Similarly, the IG assigns highest weight to "penalty corner" instead of "penalty shootout" and give more weight to "play" than "beautifully". It is due to its biased nature towards the terms distributed in many categories. Although, its performance is quite better than MI, but performs slightly lower than SSNG & χ 2 .
The DFS assigns highest weight to the rarest N-Grams and minimum weight to the common N-Grams in the range from 0.5 to 1. This method is best suited for the document frequency based weight computations, but does not perform well in case of term frequency based weight computations. The weight assignment process of the χ 2 based on the term frequency is similar to the SSNG (see Table 4). This is the main reason to select the χ 2 method, for filtering the SSNG weighted terms, at the second stage.

The TPF approach
In order to measure the importance of the N-Gram, the SSNG method using the TPF approach is applied. The TPF approach is explained in the Algorithm 1. The TPF Algorithm 1 works as follows: 1. The corpus D is divided into two subsets D train and D test in line 1.

Data set
In this study, we have experimented with ten standard text data sets movie reviews, 20Newsgroup, Reuters13, Ohsumed23 and Ohsumed10. We also worked on Pubmed9 dataset, which consists of nine categories. The detailed summary of the data sets used in the study is given in Table 6. The movie reviews dataset 5 was prepared by Pang and Lee (2004) and contains movie reviews collected from the http://www.imbdb.com (Internet Movie Data-base). 6 This 5 http://www.nltk.org/%24nltk%5fdata%24/. 6 http://www.cs.cornell.edu/People/pabo/movie-review-data/. dataset has been used as a benchmark by many researchers, and it is also known as polarity dataset v2.0 or Cornell Movie Review Dataset. There are total of 1000 positive and 1000 negative reviews and this dataset is based on two class problem (Sharma and Dey 2012; Pang and Lee 2004). The ACL IMDB movie review dataset 7 is a very large dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. In this data set 25,000 highly polar movie reviews for training, and 25,000 for testing (Maas et al. 2011).
The 20Newsgroups(20ng) dataset contains newsgroup documents from 20 different classes (Joachims 1996). The original owner of this dataset was Mitchell (1997). This dataset is known for its large size and balanced classes. This data set consists of 20,000 messages taken from 20 newsgroups. 8 The Reuters dataset is the most widely used dataset for text classification. The Reu-ters13 is a subset of the Reuters dataset as used by Forman (2003). It consists of 13 classes out of 90 from the original Reuters dataset.
The Ohsumed dataset 9,10 is the challenging dataset due to its very high sparsity (Joachims 1998). There are 23 classes of documents which are combinations of title and abstracts taken from Pubmed. We partitioned this dataset into four sub data sets Ohsumed5, Ohsumed10, Ohsumed15, and Ohsumed23. These sub datasets contain 5, 10, 15 and 23 classes of articles respectively.
The Pubmed9 dataset used in the experimental study is similar in structure to Ohsumed dataset. It contains documents of nine classes. Each document is a combination of abstracts with their title. All the documents are automatically extracted from the Pubmed website using Entrez software utilities 11 in R environment. 12 The nine classes of documents for this data set are viz. bird flu, swine flu, proteins, cancer, Bacterial Pneumonia, Fungal Pneumonia, Viral Pneumonia, Idiopathic interstitial pneumonia, Legionnaires. Each class contains 5000 documents on this data set.
The BBC dataset 13 consists of 2225 documents from the BBC news website, corresponding to stories in five topical areas from the year 2004-2005. It contains 5 Class Labels viz. business, entertainment, politics, sport, and tech (Greene and Cunningham 2006).
The BBC_Sports dataset (Greene and Cunningham 2006) consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from the year 2004-2005. Their are 5 Class Labels viz. athletics, cricket, football, rugby, and tennis in this dataset.

Experimental setup
All the experiments have been carried out on a machine with specification as core i7, 8GB RAM, 2.4 GHz Processor in UBUNTU 14.04 64-bit OS. We have used R-3.1.2 to automatically extract articles from the Pubmed website, and Mysql 5.6 to store the information related to articles in the database.
The process of ATDC-Tokenization, preprocessing of the words of the corpus (T), feature extraction (NG ⊃ T), feature selection (NG[k] ⊂ NG and NG[s] ⊂ NG[k]), and statistical analysis are performed in Python 2.7 with nltk, scipy, numpy, ipython notebook, scikitlearn, matplotlib etc. packages. 14 In order to to prepare the Pubmed9 dataset, we used the Entrez software utility, 15 to fetch the PubMed articles from the NCBI web page.
We experimented on ten standard datasets along with the Pubmed9 dataset. The Apriori algorithm based the TPF approach has been used to select the most informative N-Grams. Initially, the corpus D is divided into two subsets training (D train ) and test (D test ), tokenized the sentences of the documents into tokens (t p ), web links, punctuation marks, stop words, and white spaces have been removed. The set of N-Grams NG have been generated. In continuation, we choose k informative N-Grams (NG[k] ⊂ NG). In the first pass of the TPF approach, we choose k as 500, 1000, 2000, 3000, 5000, 10,000, 15,000, and 20,000. Subsequently, the feature selection methods viz. MI, IG, OR, DFS, χ 2 and SSNG have been applied to select the k informative N-Grams. In the second pass, we applied the χ 2 method which further filters 500, 1000, 2000, 3000, 5000, 10,000, 15,000, and 20,000 N-Grams, and select the most informative N-Grams (NG[s] ⊂ NG[k] ), based on the maximum accuracy gained by the MNB and LSVM classifiers.

Results and discussions
The experimental results have been compared using maximum accuracy achieved by the classifiers MNB and LSVM, based on the most informative N-Grams (NG[s] ⊂ NG[k] ⊂ NG) selected using MI + χ 2 , IG + χ 2 , OR + χ 2 , DFS + χ 2 , χ 2 + χ 2 , and SSNG + χ 2 . We have performed eight experimental trials for both the classifiers MNB and LSVM. The experimental trials are based on the selection of most informative N-Grams as 500, 1000, 2000, 3000, 5000, 10,000, 15,000, and 20,000 (eight for each classifier). Finally, their are total sixteen experimental trials for each dataset. The success rate of the classifiers in each dataset is based on these experimental trials.
In the movie review dataset, the accuracy of the MNB classifier depends upon the number of features and achieves the peak value 98.4 % for 10,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 3). In case of LSVM, the SSNG gains highest 95.8 % accuracy for 3000 and 5000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 4). The success rate of SSNG based on the TPF approach in the movie review dataset is 56.25 % because out of 16 experiments 9 times the SSNG + χ 2 method performed better compared to other methods.
In the ACL IMDB dataset, the accuracy of the MNB classifier depends upon the number of features and achieves the peak value 89.81 % for 20,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 5). In case of LSVM, the SSNG gains highest 89.94 % accuracy for 15,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 6). The success rate of SSNG in ACL IMDB large movie review dataset is 68.75 % because out of 16 experiments 11 times the SSNG + χ 2 method performed better compared to other methods. In the Ohsumed5 dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 84.03 % for 1000 numbers of features (see Table 7) then decreases and remain (see Fig. 7). In case of LSVM, the SSNG gains highest 86.24 % accuracy for 3000 and 10,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 8). The success rate of SSNG in Ohsumed5 dataset is 93.75 % because out of 16 experiments 15 times the SSNG + χ 2 method performed better compared to other methods.
In the Ohsumed10 dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 67.32 % for 2000 numbers of features (see Table 7) the decreases and remain constant (see Fig. 9). In case of LSVM, the SSNG gains highest 70.18 % accuracy for 15,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 10). The success rate of SSNG method based on the TPF approach in Ohsumed10 dataset is 87.5 % because out of 16 experiments 14 times the SSNG + χ 2 method performed better compared to other methods.
In the Ohsumed15 dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 43.91 % for 2000 numbers of features (see Table 7) then decreases and remain (see Fig. 11). In case of LSVM, the SSNG gains highest 65.75 % accuracy for 10,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 12). The success rate of SSNG in Ohsumed15 dataset is 93.75 % because out of 16 experiments 15 times the SSNG + χ 2 method performed better compared to other methods.  In the Ohsumed23 dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 43.91 % for 2000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 13). In case of LSVM, the SSNG gains highest 48 % accuracy for 15,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 14). The success rate of SSNG in Ohsumed23 dataset is 93.75 % because out of 16 experiments 15 times the SSNG + χ 2 method performed better compared to other methods.
In the Pubmed9 dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 73.84 % for 5000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 15). In case of LSVM, the SSNG   Table 7) then decreases and remain constant (see Fig. 16). The success rate of SSNG in Pubmed9 dataset is 68.75 % because out of 16 experiments 11 times the SSNG + χ 2 method performed better compared to other methods.
In the 20Newsgroup dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 95.6 % for 500 numbers of features (see Table 7) and then decreases and remain constant for features greater than 500 (see Fig. 17). In case of LSVM, the SSNG gains highest 95.8 % accuracy for 3000 and 5000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 18). The  success rate of SSNG method in 20Newsgroup dataset is 75 % because out of 16 experiments 12 times the SSNG + χ 2 method performed better compared to other methods.
In the Reuters13 dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 71.59 % for 500 numbers of features (see Table 7) then decreases and remain constant (see Fig. 19). In case of LSVM, the SSNG gains highest 78.52 % accuracy for 2000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 20). The success rate of SSNG in Reuters13 dataset is 62.5 % because out of 16 experiments 10 times the SSNG + χ 2 method performed better compared to other methods.  In the BBC dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 99.28 % for 1000, and 5000 numbers of features (see Table 7) then decrease and remain constant (see Fig. 21). In case of LSVM, the SSNG gains highest 99.64 % accuracy for 10,000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 22). The success rate of SSNG in BBC dataset is 68.75 % because out of 16 experiments 11 times the SSNG + χ 2 method performed better compared to other methods.  In the BBC_Sports dataset, the accuracy of the MNB classifier depends upon the number of features and achie-ves the peak value 98.39 % for 500, 1000, and 2000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 23). In case of LSVM, the SSNG gains highest 100 % accuracy for 500, 1000, and 3000 numbers of features (see Table 7) then decreases and remain constant (see Fig. 24). The success rate of SSNG in BBC_Sports dataset is 87.5 % because out of 16 experiments 14 times the SSNG + χ 2 method performed better compared to other methods.
In the experimental study, we have observed that 1. The accuracy of the classifiers have been found optimal, if the power (NG SU + NG Mem + NG Strength ) was selected as three and four of NG RCST 2. It can be observed from Table 7, the proposed TPF based SSNG + χ 2 has given highest accuracy in nine datasets movie review, ACL IMDB, Ohsumed5, Ohsumed10, Ohsumed15, Ohsumed23, Pubmed9, BBC, and BBC_Sports, while in other two datasets 20Newsgroup and Reuters13, χ 2 + χ 2 has given highest accuracy using MNB.

Conclusion
In this paper, a new text feature selection method symmetrical strength of N-Grams (SSNG method) has been introduced. It has improved the performance of the classifiers by assigning highest weight to the most informative N-Grams, while least weight to the non-informative N-Grams. The SSNG has computed the weight of the N-Grams based on four probabilistic criteria-the symmetrical uncertainty, membership, strength, and the nature of the N-Grams. Further, the two pass filtering (TPF) based feature selection approach has been used to reduce the high dimensionality of the text data. In addition, we have discussed the problem related to representation of the terms using a well known BOW model. We followed the NGL model to generate the N-Grams to solve this problem. Initially, it has extracted more number of features due to NGL model, however, it is essential, to achieve high performance in terms of accuracy and f1_measure. The Apriori algorithm has been applied for pruning of the non-informative N-Grams.