Elsevier

Knowledge-Based Systems

Volume 49, September 2013, Pages 179-189
Knowledge-Based Systems

Projected-prototype based classifier for text categorization

https://doi.org/10.1016/j.knosys.2013.05.013Get rights and content

Abstract

Currently, the explosive increasing of data stimulates a greater demand for text categorization. The existing prototype-based classifiers, including k-NN, kNNModel and Centroid classifier, are receiving wide interest from the text mining community because of their simplicity and efficiency. However, they usually perform less effectively on document data sets due to high dimensionality and complex class structures these sets involve. In most cases a single document category actually contains multiple subtopics, indicating that the documents in the same class may comprise multiple subclasses, each associated with its individual term subspace. In this paper, a novel projected-prototype based classifier is proposed for text categorization, in which a document category is represented by a set of prototypes, each assembling a representative for the documents in a subclass and its corresponding term subspace. In the classifier’s training process, the number of prototypes and the prototypes themselves are learned using a newly developed feature-weighting algorithm, in order to ensure that the documents belonging to different subclasses are separated as much as possible when projected onto their own subspaces. Then, in the testing process, each test document is classified in terms of its weighted distances from the different prototypes. Experimental results on the Reuters-21578 and 20-Newsgroups corpora show that the proposed classifier based on the multi-representative-dependent projection method can achieve higher classification accuracy at a lower computational cost than the conventional prototype-based classifiers, especially for data sets that include overlapping document categories.

Introduction

With the exponential growth of online information, finding ways of organizing document data efficiently and effectively has become an important issue. As the task of assigning unlabeled documents to predefined categories, text categorization has drawn considerable attention and has been widely used in many applications, such as customer relationship management [1h0005h0005h0005], spam email filtering [2], web page classification [3h0010h0010h0010], and text sentiment classification [4h0015h0015h0015]. In the field of machine learning, which involves the development of computer programs that improve their performance through the experience they gain over sets of prepared data known as training data sets, numerous classification techniques have been used in recent years for text categorization. These include decision tree classifiers [5], [6], [7h0020h0020h0020], Bayes classifiers [8], [9], [10], support vector machine (SVM) [1h0025h0025h00251], [12], [1h0030h0030h00303], neural network classifiers [1h0035h0035h00354], [1h0040h0040h00405], [1h0045h0045h00456], boosting methods [1h0050h0050h00507] and prototype-based classifiers [1h0055h0055h00558].

The last of these techniques, the prototype-based classifier, begins by describing the data set by a (usually small) set of prototypes, each representing a particular subset of the training documents. Then, classification of a test document is carried out by finding the closest prototype, typically, by means of a distance measure. For example, in a k-nearest-neighbor (k-NN) classifier [1h0060h0060h00609], [20], [2h0065h0065h00651], the test document is classified by using the majority vote of its k nearest neighbors, where each training document is considered as its own prototype. In the centroid-based classifiers (“Centroid classifiers” for short) [2h0070h0070h00702], [2h0075h0075h00753], [24], [25], classification relies on the closeness or similarity of the test document to the centroid (i.e., prototype) of each document category. The main advantage of the prototype-based classifiers is that they provide an intuitive summarization of the given data in a small number of prototypes and thus lead to plausible and interpretable class structures and classification schemes [26].

In text categorization, however, the design of an effective classifier depends heavily on the factors such as the dimensionality of the document data, the size of the document set, and the class structures and the presence of outliers (noisy documents) [2h0080h0080h00807]. For the work described here, we are interested in two issues: high dimensionality and complex class structures of the document data. First, the “bag-of-words” representation commonly used for documents [2h0085h0085h00858], in which each keyword is treated as a feature,1 results in high dimensionality of the data. In the conventional prototype-based classifiers, the commonly used measures (such as the Lp distance and the cosine function) tend to be invalid in high-dimensional spaces due to the curse of dimensionality [2h0090h0090h00909], which means they cannot be directly applied for classifying such high-dimensional data. Second, the documents in different classes are on different topics but, typically, share some common terms, which results in overlapping classes. For the existing prototype-based classifiers, it is difficult to characterize document data containing such complex class structures. For instance, it is well known that one of the drawbacks of k-NN is its inefficiency in searching for neighbors, which it does over the whole training data set because each document is considered as a prototype; whereas Centroid classifier [25] constructs a single prototype for each class. These techniques thus do not take class overlap into consideration. Furthermore, documents in the same class usually comprise different subclasses, each of which relates to a document subtopic. For example, documents on the topic Sports may contain two different subtopics: Basketball and Soccer. In such cases, this indicates that one document category should be modeled by multiple prototypes, depending on the subclasses.

To address these problems, one can resort to a feature selection technique, such as feature dependency, information gain or the gain ration [3h0095h0095h00950], [3h0100h0100h01001], to reduce the data dimensionality and enable the conventional classifiers in the reduced low-dimensional space. But in practice the feature selection techniques usually suffer from high computational complexity, because they are in general not feasible for performing an exhaustive search to find the optimal feature subsets, due to the huge number of admissible subsets which is exponential with regard to the data dimensionality. Recently, the feature-weighting technique, which assigns a continuous weighting value to each feature, has been applied to soft feature selection. For example, in the weighted discrete k-NN (WD-kNN) [32], each feature is assigned a weight indicating its contribution to class prediction; the weights are then used in the weighted distance measure to improve the conventional k-NN. In effect, this sort of feature-weighting technique equates to projecting each class onto a subspace, in which the test documents are classified. However, such a feature-weighting technique is generally conducted by a global approach, that is, it is assumed that each document category is associated with a unique term subspace. As discussed previously, the documents in different classes (and in the different subclasses of the same document category) typically exist in different subspaces; therefore, a locally feature-weighting technique is required for effective text categorization.

In this paper, the novel classifier PPC (stands for projected-prototype based classifier) is proposed for text categorization, using multi-representative-dependent projection. In the training process, PPC learns multiple representatives for the documents of each category, and the corresponding projected subspaces, via a newly developed locally feature-weighting scheme and the projection technique. A set of prototypes existing in individual subspaces are then constructed for each document category. Unlike the existing prototype-based methods, our approach makes it possible to discover the hierarchical class structure in the projected subspaces. In the testing process, the test document is projected onto each subspace, and is classified by measuring its weighted Euclidean distance from each of the projected prototypes. The main contributions of our work can be summarized as follows:

  • We propose a way of modeling each document category by multiple prototypes, each of which represents one subtopic of the category. Therefore, by our model, a hierarchical class structure can be constructed in the projected subspaces (corresponding to the particular term subspaces in which the subclasses exist).

  • We optimize each prototype in the corresponding term subspace by an efficient algorithm, whose objective function is newly designed, in which each document is weighted to reduce the impacts of noisy samples on resulting prototypes. Moreover, we propose a locally feature-weighting scheme to optimize the different contributions of a term to class prediction.

  • We propose an adaptive method to automatically determine the number of prototypes, while simultaneously evaluating the quality of each prototype. Therefore, in our model, each document category is modeled by the smallest number of optimal prototypes.

The remainder of this paper is organized as follows: our projected prototypes are described in Section 2. Section 3 presents the new method PPC. In Section 4 the experimental results and evaluation of PPC are reported. Finally, the paper concludes with a short discussion and an indication of further work in Section 5.

Section snippets

Projected prototypes of document categories

In what follows, documents are represented using the vector space model (VSM) [3h0105h0105h01053], [34]. In this representation, documents are presented as a set of vectors {x1, x2,  , xN}, where N denotes the total number of documents in the training set Tr. In the simplest form [35], each document x is represented by a term frequency (TF) vector, x = tf1, tf2,  , tfD〉, where tfd is the frequency of the dth term in the document. Here the terms are considered as the features of the vector space while D

The proposed method PPC

This section describes both the training algorithm and the testing algorithm of our proposed method, PPC. In the training process, PPC uses the training samples to learn a set of optimal prototypes for the individual classes, taking advantage of representatives derived from a new locally feature-weighting algorithm, Representatives Learning. Based on the classification model obtained in the training process, the testing algorithm then operates. We will begin by describing the testing algorithm.

Experiments

In order to evaluate the performance of PPC, two common benchmarking data sets for text categorization were used. We also experimentally compared PPC with some mainstream classifiers. The environment settings of the computer used for all the experiments were as follows: Celeron (R) D CPU 3.06 GHz and 2GMb RAM, with the Microsoft Windows XP operating system.

Concluding remarks

In this paper, we first discuss the problems encountered by the conventional prototype-based classifiers, which learn the prototypes in the entire data space, in text categorization. The difficulty arises because the documents in different classes are associated with different term subspaces and, often, the documents are related to different subtopics (subclasses of the document category) despite belonging to the same class. We propose a new definition for the prototypes in text categorization,

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 61175123 and the Key Industry-University Collaboration Project of Fujian Province of China under Grant No. 2010H6007.

References (49)

  • M. Dash et al.

    Feature selection for classification

    Intelligent Data Analysis

    (1997)
  • H. Uguz

    A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm

    Knowledge-Based Systems

    (2011)
  • L. Chen et al.

    Class-dependent projection based method for text categorization

    Pattern Recognition Letters

    (2011)
  • Y. Gao et al.

    Edited adaboost by weighted KNN

    Neurocomputing

    (2010)
  • B. Zhou, Y. Yao, J. Luo, A three-way decision approach to email spam filtering, in: Proceedings of the 23rd Canadian...
  • S. Lakshminarayana

    Categorization of web pages – performance enhancement to search engine

    Knowledge-Based Systems

    (2009)
  • X. Lin, C. Guo, F. Chang, Classifying textual components of bilingual documents with decision-tree support vector...
  • P. Vateekul, M. Kubat, Fast induction of multiple decision trees in text categorization from large scale, imbalanced,...
  • S. Hassan, M. Rafi, S.M. Shaikh, Comparing SVM and naive Bayes classifiers for text categorization with wikitology as...
  • N. Bouguila, D. Ziou, Nonparametric Bayesian learning model: application to text and image categorization, in:...
  • S.B. Kim, H.C. Rim, D. Yook, H. Lim, Effective methods for improving naive Bayes text classifier, in: Proceedings of...
  • T. Guernine, K. Zeroual, A new fuzzy hierarchical classification based on SVM for text categorization, in: Proceedings...
  • F. Harrag et al.

    Improving Arabic text categorization using neural network with SVD

    Journal of Digital Information Management

    (2010)
  • R. Schapire et al.

    Boostester: a boosting-based system for text categorization

    Machine Learning

    (2000)
  • Cited by (0)

    View full text