Projected-prototype based classifier for text categorization
Introduction
With the exponential growth of online information, finding ways of organizing document data efficiently and effectively has become an important issue. As the task of assigning unlabeled documents to predefined categories, text categorization has drawn considerable attention and has been widely used in many applications, such as customer relationship management [1h0005h0005h0005], spam email filtering [2], web page classification [3h0010h0010h0010], and text sentiment classification [4h0015h0015h0015]. In the field of machine learning, which involves the development of computer programs that improve their performance through the experience they gain over sets of prepared data known as training data sets, numerous classification techniques have been used in recent years for text categorization. These include decision tree classifiers [5], [6], [7h0020h0020h0020], Bayes classifiers [8], [9], [10], support vector machine (SVM) [1h0025h0025h00251], [12], [1h0030h0030h00303], neural network classifiers [1h0035h0035h00354], [1h0040h0040h00405], [1h0045h0045h00456], boosting methods [1h0050h0050h00507] and prototype-based classifiers [1h0055h0055h00558].
The last of these techniques, the prototype-based classifier, begins by describing the data set by a (usually small) set of prototypes, each representing a particular subset of the training documents. Then, classification of a test document is carried out by finding the closest prototype, typically, by means of a distance measure. For example, in a k-nearest-neighbor (k-NN) classifier [1h0060h0060h00609], [20], [2h0065h0065h00651], the test document is classified by using the majority vote of its k nearest neighbors, where each training document is considered as its own prototype. In the centroid-based classifiers (“Centroid classifiers” for short) [2h0070h0070h00702], [2h0075h0075h00753], [24], [25], classification relies on the closeness or similarity of the test document to the centroid (i.e., prototype) of each document category. The main advantage of the prototype-based classifiers is that they provide an intuitive summarization of the given data in a small number of prototypes and thus lead to plausible and interpretable class structures and classification schemes [26].
In text categorization, however, the design of an effective classifier depends heavily on the factors such as the dimensionality of the document data, the size of the document set, and the class structures and the presence of outliers (noisy documents) [2h0080h0080h00807]. For the work described here, we are interested in two issues: high dimensionality and complex class structures of the document data. First, the “bag-of-words” representation commonly used for documents [2h0085h0085h00858], in which each keyword is treated as a feature,1 results in high dimensionality of the data. In the conventional prototype-based classifiers, the commonly used measures (such as the Lp distance and the cosine function) tend to be invalid in high-dimensional spaces due to the curse of dimensionality [2h0090h0090h00909], which means they cannot be directly applied for classifying such high-dimensional data. Second, the documents in different classes are on different topics but, typically, share some common terms, which results in overlapping classes. For the existing prototype-based classifiers, it is difficult to characterize document data containing such complex class structures. For instance, it is well known that one of the drawbacks of k-NN is its inefficiency in searching for neighbors, which it does over the whole training data set because each document is considered as a prototype; whereas Centroid classifier [25] constructs a single prototype for each class. These techniques thus do not take class overlap into consideration. Furthermore, documents in the same class usually comprise different subclasses, each of which relates to a document subtopic. For example, documents on the topic Sports may contain two different subtopics: Basketball and Soccer. In such cases, this indicates that one document category should be modeled by multiple prototypes, depending on the subclasses.
To address these problems, one can resort to a feature selection technique, such as feature dependency, information gain or the gain ration [3h0095h0095h00950], [3h0100h0100h01001], to reduce the data dimensionality and enable the conventional classifiers in the reduced low-dimensional space. But in practice the feature selection techniques usually suffer from high computational complexity, because they are in general not feasible for performing an exhaustive search to find the optimal feature subsets, due to the huge number of admissible subsets which is exponential with regard to the data dimensionality. Recently, the feature-weighting technique, which assigns a continuous weighting value to each feature, has been applied to soft feature selection. For example, in the weighted discrete k-NN (WD-kNN) [32], each feature is assigned a weight indicating its contribution to class prediction; the weights are then used in the weighted distance measure to improve the conventional k-NN. In effect, this sort of feature-weighting technique equates to projecting each class onto a subspace, in which the test documents are classified. However, such a feature-weighting technique is generally conducted by a global approach, that is, it is assumed that each document category is associated with a unique term subspace. As discussed previously, the documents in different classes (and in the different subclasses of the same document category) typically exist in different subspaces; therefore, a locally feature-weighting technique is required for effective text categorization.
In this paper, the novel classifier PPC (stands for projected-prototype based classifier) is proposed for text categorization, using multi-representative-dependent projection. In the training process, PPC learns multiple representatives for the documents of each category, and the corresponding projected subspaces, via a newly developed locally feature-weighting scheme and the projection technique. A set of prototypes existing in individual subspaces are then constructed for each document category. Unlike the existing prototype-based methods, our approach makes it possible to discover the hierarchical class structure in the projected subspaces. In the testing process, the test document is projected onto each subspace, and is classified by measuring its weighted Euclidean distance from each of the projected prototypes. The main contributions of our work can be summarized as follows:
- •
We propose a way of modeling each document category by multiple prototypes, each of which represents one subtopic of the category. Therefore, by our model, a hierarchical class structure can be constructed in the projected subspaces (corresponding to the particular term subspaces in which the subclasses exist).
- •
We optimize each prototype in the corresponding term subspace by an efficient algorithm, whose objective function is newly designed, in which each document is weighted to reduce the impacts of noisy samples on resulting prototypes. Moreover, we propose a locally feature-weighting scheme to optimize the different contributions of a term to class prediction.
- •
We propose an adaptive method to automatically determine the number of prototypes, while simultaneously evaluating the quality of each prototype. Therefore, in our model, each document category is modeled by the smallest number of optimal prototypes.
The remainder of this paper is organized as follows: our projected prototypes are described in Section 2. Section 3 presents the new method PPC. In Section 4 the experimental results and evaluation of PPC are reported. Finally, the paper concludes with a short discussion and an indication of further work in Section 5.
Section snippets
Projected prototypes of document categories
In what follows, documents are represented using the vector space model (VSM) [3h0105h0105h01053], [34]. In this representation, documents are presented as a set of vectors {x1, x2, … , xN}, where N denotes the total number of documents in the training set Tr. In the simplest form [35], each document x is represented by a term frequency (TF) vector, x = 〈tf1, tf2, … , tfD〉, where tfd is the frequency of the dth term in the document. Here the terms are considered as the features of the vector space while D
The proposed method PPC
This section describes both the training algorithm and the testing algorithm of our proposed method, PPC. In the training process, PPC uses the training samples to learn a set of optimal prototypes for the individual classes, taking advantage of representatives derived from a new locally feature-weighting algorithm, Representatives Learning. Based on the classification model obtained in the training process, the testing algorithm then operates. We will begin by describing the testing algorithm.
Experiments
In order to evaluate the performance of PPC, two common benchmarking data sets for text categorization were used. We also experimentally compared PPC with some mainstream classifiers. The environment settings of the computer used for all the experiments were as follows: Celeron (R) D CPU 3.06 GHz and 2GMb RAM, with the Microsoft Windows XP operating system.
Concluding remarks
In this paper, we first discuss the problems encountered by the conventional prototype-based classifiers, which learn the prototypes in the entire data space, in text categorization. The difficulty arises because the documents in different classes are associated with different term subspaces and, often, the documents are related to different subtopics (subclasses of the document category) despite belonging to the same class. We propose a new definition for the prototypes in text categorization,
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant No. 61175123 and the Key Industry-University Collaboration Project of Fujian Province of China under Grant No. 2010H6007.
References (49)
- et al.
Integrating the voice of customers through call center emails into a decision support system for churn prediction
Information Management
(2008) - et al.
A feature selection method based on improved fisher’s discriminant ratio for text sentiment classification
Expert System With Application
(2011) - et al.
Combining decision tree and naive Bayes for classification
Knowledge-Based Systems
(2006) - et al.
A comparison study on multiple binary-class SVM methods for unilabel text categorization
Pattern Recognition Letters
(2010) - et al.
Text classification based on multi-word with support vector machine
Knowledge-Based Systems
(2008) - et al.
Latent semantic analysis for text categorization using neural network
Knowledge-Based Systems
(2008) - et al.
Development of a patent document classification and search platform using a back-propagation network
Expert Systems With Applications
(2006) - et al.
A prototype classifier based on gravitational search algorithm
Applied Soft Computing
(2012) - et al.
An improved k-nearest-neighbor algorithm for text categorization
Expert Systems With Applications
(2012) An improved centroid classifier for text categorization
Expert Systems With Applications
(2008)