Comparative Analysis of Document level Text Classification Algorithms using R

From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.


II.Pre-processing using R
Text mining plays a vital role in fast growing research area. Preprocessing techniques are applied on data set to reduce the size of the data set which will increase the effectiveness. Though it issomewhat time consuming process, but at the end it plays a very crucial role in terms of quality analysis. R is open source software developed at bell labs. Many statistical functions are built in and it is implemented in c, Fortran languages. It was inspired by S environment and extended via packages.Once you loaded all documents properly, we can process the documents by using tm package in R.tm package allows you to remove punctuation, capitalization, common words, numbers which decreases the text data and improve the efficiency of the system.

Converting to lowercase:
Suppose if we want a word to appear exactly the same every time it appears, then we need to change everything to lowercase by Using the following process.
docs<-tm_map(docs, tolower) Removing "stop words" "Stop words" are nothing but common words that usually have no analytic value. In every document, there are a plenty of those words (a, and, also, the, etc.), such words are frequent by their nature, and will stun your analysis if they remain in the text.

III.CentroidDR
Now-a-days high dimensionality reduction is a challenging task in handling massive amounts of data. Due to the rapid growth of the World Wide Web, there are several classification techniques, which aim to extract features by projecting the original real-world highdimensionality data into a lower-dimensional space through algebraic transformations, but these techniques require a huge amount of memory and CPU resource. So we use linear classifier techniques called CentriodDR [6] document based Centroid Dimensionality Reduction. CentriodDR [2] is one of most efficient and scalable technique which basically projects high-dimensional documents into a low-dimensional space spanned by document centroids. CentriodDR [7] mainly uses document centroids to reduce the dimensionality of documents. Details of CentriodDR are given as follows: Algorithm 1(CentriodDR) Input: Target set of training documents D. Output: Projected data D * ={(X 1 ,y 1 ),(X 2 ,y 2 ),…,(X N ,y N )},where X i ɛ R l , y i ɛ {C 1 ,C 2 ,…,C l } for i=1,2,….,N. Step1: CentroidDRfirst computes the centroids of all documents. A document centroid is the mean representation vector of each document, as detailed in below formula, which can be generated very efficiently.
So centroid i denotes the centroid of the documents C i , i= 1, 2, · · · , l. and C j represents the number of classes.

IV.MCenKNN
KNN is an instance based learning, which is a part of supervised leaning that has been used in many fields like Machine Learning [7], Image processing, Pattern recognition.Disadvantage of KNN: it requires to pre-specify the value of the parameter K.Computation cost is also high that need to compute distance between each query term instance to all training documents. KNN is not able to handle irrelevant attributes or noisy features,KNN [1] also suffers from class imbalance problems [3]. To handle this situation, this paper combines the best features of KNN and Document based centroid dimensionality reduction (CentroidDR) and named asMCenKNN[4] which subsequently reduces the computation time and improve the efficiency compared with CenKNN. Here to classify the documents this proposed algorithm uses k-d tree which is similar essence of binary search tree to store the projected data and search the K nearest neighbors. (4) For the test document d t ɛR n ,project it onto the document-centroid-based space, and normalize it,we then obtain ɛ R l . (5) Search for the K nearest neighbors of over the k-d tree. (6) Classify based on the KNN decision rule, as detailed in formula below.
Where denotes the set of K nearest neighbors of , ( , ) denotes the Euclidean distance.

V. Experiment Results
To analyze the performance of the document level text classification this paper proposed MCenKNNwhich consider a real world text data set. A great way of applying text analysis towards your real data is to find TF-IDF of each word used.TF-IDF stands for Term Frequency -Inverse Document Frequency.It is a statistical measure used to evaluate how important a word is to a document. Term-Frequency: is stated as number of times the term or word occurred in a document. Suppose consider three text documents (docs1, docs2, docs3) and one query document(query) querydocumentcontains"0.It has new new times. /|@.<," first document docs1 contains "1. It has New Yorktimes /|@. <," second document docs2 contains "2. /|@. <, It has New York post" third document docs3 contains "3. It has lossAngeles/|@.<, times". For training data, assign the class-labels as (docs1, docs2)->New York and docs3->Loss Angeles and query->? After loading all the target documents in R. it will discard the words like numbers, capitalization, common words, punctuation that do not contribute to distinguishing between the documents. Then it calculates the TF as follows: tf(t,d) = ft,d i.e. the number of times that term t occurs in document d. For all the documents, we calculate the tf scores for all the terms in document against query document as follows Where N is number of documents,{ ∈ : ∈ } the number of documents in which particular term appears and it is shown as follows: We now compute the centroids of all documents which we define it as CentroidDR(Centroid Dimensionality Reduction) i.e.; Project the high dimensional documents D onto a documentcentroid-based space. A document centroid is the mean representation vector of each document, as detailed in below formula.
So centroid i denotes the centroid of the documents C i , i= 1, 2, · · ·, l. and C j represents the number of classes. Where, KNNxˆtdenotes the set of K nearest neighbors of xˆt,dist(xˆt , xˆi) denotes the Euclidean distance between xˆtandxˆi, and I(xˆi, C j ) is the indicator function, which is 1 when xˆibelongs to C j otherwise 0.  Fig 5:class-label of test document prediction using kd-tree From the above Fig 5,it is clear that the query belongs to the class "New York" i.e.; it is more similar to the docs1 which was also predicted using KNN decision rule .When MCenKNN is compared with CenKNN, MCenKNNwhichuses centroid values spanned by their documents rather than using class-centroids.ThisMCenKNN will predict the class-label of test document accurately and takes less time when compared to CenKNN. Generallyhumans are strong at visual analytics. That is part of the reason why "Plot Word Frequencies" and "word-clouds" have become so popular.  This example shows that out of 3 documents docs1, docs2 belongs to "New York" class and docs3 belongs to "Loss Angeles" class. And now to search the terms in a sentence that belong to "Loss Angeles" but more number of documents that are retrieved belongs to "New York" class, Then the probability of the terms in a sentence that it is required to search(relevant) is only 10% i.e. 10 but the irrelevant terms are about 99% i.e. 99,990. Therefore, Accuracy is defined as: Accuracy=99,990/1,00,000=99.99% We make us of Precision and Recall which are two measures to define test's accuracy. Precision: % of selected items that are correct. Recall: % of correct items that are selected. Therefore, the computed precision and recall values of the taken example is: Precision=1/(1+1)=50% and recall=1/(1+1)=50% As, a small example it became equal but if tested by taking large dataset it gives more recall compared to precision and also it is noticeable that as soon as the recall value get increased the precision value get decreased. A measure which combines both precision and recall is the harmonic mean of precision and recalli.e.; nothing but the traditional F-measure or balanced F-score and is also defined as the precision and recall when they are close, and is more generally the square of the geometric mean divided by the arithmetic mean.

VI. Conclusion
This paper proposes A Modified Document text classifier MCenKNN using R to eliminate imbalanced class distribution problems and noisy-term features. Through our experimental results it shows that by combining strengths of two text classification algorithms like Centroid and KNN.ThispaperproposesCentroid based text classification algorithm which can handle high dimensional data spanned by document centroids and skewed data i.e., non-separable data by taking the advantage of non-linear text classifier like K-Nearest Neighbor.MCenKNN which uses centroids spanned by document centroids is able to classify the class-label of test document accurately when compared to centroids spanned by classes in CenKNN.Therefore the experimental results which are obtained using MCenKNN are more accurate when compared toCenKNN.