Terms-based discriminative information space for robust text classification
Introduction
Text classification is witnessing growing interest in recent years. This is due to the availability of digitized text such as Web pages, e-mails, blogs, digital libraries, social media, online advertisements, corporate documents, product reviews and much more [22]. Many applications based on these different data sources can be posed as text classification problems. In these problems, documents need to be categorized into predefined classes representing different semantic groups (e.g. spam and non-spam, topics, sentiments).
Text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional term space (easily in hundreds of thousands for user generated content on the Web), making learning and generalization difficult. Secondly, due to the high cost of labeling documents, researchers are forced to rely upon small training sets or collect training data from sources different from the target domain. This results in a distribution shift between training and test data. Thirdly, documents are of varying quality, languages and lengths, making a uniform knowledge-based approach inefficient or infeasible. For example, an important domain for text classification which embodies these challenges is that of e-mail spam filtering: vocabulary of terms can be huge; users’ preferences for spam and non-spam often differ; non-generic labeled collections are often not available; e-mails come in a wide variety of languages and qualities. Addressing these challenges demands a corpus-based, statistically robust, and computationally efficient text classification method.
There are numerous classification techniques available today that can be utilized for text classification as well. The naive Bayes classifier, which is a probabilistic generative method, and the support vector machine, which is a statistical discriminative method, are generally considered effective for text classification. However, the former only performs better than the latter for very small training data [61], while the latter is very sensitive to distribution shift between the training and test data [20]. A different approach to enhanced text classification is through feature engineering and semantic representations. These approaches can be corpus-based, like latent semantic indexing (LSI) [45], or knowledge-based, like WordNet-based semantic enrichment [56]. However, engineering of new features or incorporating information from external knowledge bases adds to the computational complexity of these methods. In addition, external knowledge may not be available conveniently for some domains, e.g., legal proceedings, etc., limiting the general applicability of such approaches.
Intuitively, robust classification is obtained when the classes are well separated and invariant to noise in the feature space. Constructing such a feature space is therefore a significant step. For text classification, it is desirable to have a feature space that is readily interpretable through the terms in the document collection. In Fisher linear discriminant analysis, a popular feature extraction approach, the score of a document in the feature space is a non-intuitive combination of terms. Intuitive interpretation also implies that the feature space is low-dimensional as opposed to high-dimensional that can be generated by some kernel transformations.
In this paper, we present a terms-based discriminative information space for robust text classification (DIST). This feature space is constructed from discriminative term weights and linear opinion pooling. The discriminative term weights are supervised term weights that quantify the discriminative power of each term. Linear opinion pool aggregates the discriminative powers of terms to yield discriminative information scores for each document. These scores quantify the suitability of each document for a class over the others. A standard discriminative classifier is then learned in this feature space for the final classification.
Specifically, we make the following contributions in this paper:
- 1.
We present a new text classification methodology, named DIST. It combines feature space construction and linear classifier learning for robust performance in applications involving distribution shift between training and test data.
- 2.
We propose and evaluate five supervised term weighting measures for quantifying each term’s discriminative power. These measures (relative risk, log relative risk, odds, log odds, and Kullback-Leibler divergence) can be computed efficiently from the labeled training data.
- 3.
We show that our methodology is a generalization of common generative, discriminative, and hybrid classification methods. In particular, we relate our methodology to the naive Bayes classifier and support vector machines.
- 4.
We evaluate our methodology on six data sets belonging to different application areas. We also demonstrate the effectiveness on data sets having different training and test distributions. The results are compared with four common text classification methods. We further conduct statistical significance tests that demonstrate the overall effectiveness of our methodology with improved classification accuracies and area under the curve (AUC) values.
- 5.
We show that our methodology is computationally efficient and suitable for modern text classification problems, especially those having (a) distribution shift between training and test data, (b) high-dimensional input (term) space, and (c) time and memory limitations.
The rest of the paper is organized as follows. We discuss the related work in Section 2. Our text classification method, DIST, is described in Section 3. Comparison of DIST with popular generative and discriminative approaches follows in Section 4. We describe the data sets, evaluation setup, and experimental results in Section 5. This section compares our results under varying distribution shift, term weighting measures, and feature selection thresholds. Section 6 analyzes the scalability of DIST, its generalization to a classifier framework, and statistical significance tests of the results. We conclude and state some promising future directions in Section 7.
Section snippets
Related work and motivation
Relevant related work on discriminative term weighting and text classifiers is presented in the following subsections.
DIST: Discriminative information space for text classification
In this section, we describe our text classification methodology, DIST, based on terms-based discriminative information space construction and linear classification. DIST addresses the key issues of high dimensionality, term weighting and selection, and feature construction faced by supervised text classification methods. It uses statistical, information theoretic and probabilistic techniques in a hybrid generative-discriminative model of the classification problem. It is efficient, robust,
Interpretations and comparisons
In this section, we provide a broader interpretation of DIST by comparing it with generative, discriminative, and hybrid classifiers.
Evaluation setup
We evaluate DIST on six commonly-used text classification data sets – personalized spam filtering (ECML), movie review (Movies), 20 Newsgroups (20NG), Simulated Real Auto Annealing (SRAA), ECUE and PU email data set. Three of these data sets have more than one train-test data pairs. Not only the data sets are from a varied domain but have different underlying characteristics, with some data sets having distribution shift while others don’t, some are two class problems while others are
Classifier properties and generalizations
Here we analyze the scalability aspect of DIST by discussing its asymptotic time and space complexity, and some limitations. We then do statistical significance tests to demonstrate that the performance of DIST is significantly better than the compared algorithms. Finally we elaborate on a classifier framework that results as a generalization of the DIST classifier followed by some limitations of DIST.
Conclusion and future direction
In this paper, we present a new text classification methodology, named DIST, based on discriminative information space construction and discrimination information pooling. Each term in the classification problem is assigned a weight that quantifies the discrimination information it provides for category k over the rest. These discriminative term weights are then used to transform the input term space into a new two-dimensional feature space. The transformation is based on a statistical model of
Acknowledgment
This work is in part supported by the ICT R&D program of MSIP/IITP. [B0101-16-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis], South Korea; Lahore University of Management Sciences (LUMS), University of Management and Technology (UMT), and Higher Education Commission (HEC) of Pakistan.
References (73)
- et al.
Learning semantic relatedness from term discrimination information
Expert Syst. with Appl.
(2009) StatApriori: an efficient algorithm for searching statistically significant association rules
Knowl. Inf. Syst.
(2010)- et al.
Decision combination in multiple classifier systems
Pattern Anal. Mach. Intell. IEEE Trans.
(1994) - et al.
Nrc-canada: building the state-of-the-art in sentiment analysis of tweets
Second Joint Conference on Lexical and Computational Semantics (* SEM)
(2013) - et al.
Using maximum entropy for text classification
IJCAI-99 Workshop on Machine Learning for Information Filtering
(1999) - et al.
Selecting the right objective measure for association analysis
Inf. Syst.
(2004) - et al.
Semi-supervised structured output learning based on a hybrid generative and discriminative approach
(ACL 07: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
(2007) - et al.
Building semantic kernels for text classification using wikipedia
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2008) - et al.
A survey of text classification algorithms
Mining text data
(2012) - et al.
On robustness and domain adaptation using SVD for word sense disambiguation
COLING-08: Proceedings of the 22nd International Conference on Computational Linguistics
(2008)
A hybrid markov/semi-markov conditional random field for sequence segmentation
EMNLP 06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages
SIGIR-00: Proceedings of the 23rd Conference on Research and Development in Information Retrieval
Analysis of representations for domain adaptation
NIPS-07: Advances in Neural Information Processing Systems
Ecml-pkdd discovery challenge 2006 overview
Proceedings of ECML-PKDD Discovery Challenge
The trade-off between generative and discriminative classifiers
IASC 04: 16th Symposium of IASC, Proceedings in Computational Statistics
An Introduction to Information Retrieval
Single-pass online learning: performance, voting schemes and online feature selection
KDD 2006: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
A corpus-based approach to comparative evaluation of statistical term association measures
J. Am. Soc. Inf. Sci. Technol.
Auc optimization vs. error rate minimization
Adv. Neural Inf. Process. Syst.
Mistake driven learning in text categorization
EMNLP-97: Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing
Feature selection methods for text classification
KDD-07: Proceedings of 13th International Conference on Knowledge Discovery and Data Mining
Multiclass-multilabel classification with more classes than examples
International Conference on Artificial Intelligence and Statistics
An assessment of case-based reasoning for spam filtering
Artificial Intelligence Review
Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
Semi-supervised classification with hybrid generative/discriminative methods
KDD-07: Proceedings of 13th Conference on Knowledge Discovery and Data Mining
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Learning from little: comparison of classifiers given little training
Knowledge Discovery in Databases: PKDD 2004
Detection of opinion spam with character n-grams
Computational Linguistics and Intelligent Text Processing
A survey of text mining techniques and applications
J. Emerging Technol. Web Intell.
Grade guidelines: 11. making an overall rating of confidence in effect estimates for a single outcome and for all outcomes
J. Clinical Epidemiol.
Clustering and understanding documents via discrimination information maximization
PAKDD 12: Proceedings of the 16th Asia Pacific Conference on knowledge discovery and data mining
Cdim: Document clustering by discrimination information maximization
Inf. Sci.
Estimation of response probabilities from augmented retrospective observations
J. Am. Stat. Assoc.
Text document preprocessing with the bayes formula for classification using the support vector machine
IEEE Trans. Knowl. Data Eng.
Exploiting generative models in discriminative classifiers
NIPS 98: Advances in Neural Information Processing Systems
Methods for combining experts’ probability assessments
Neural Comput.
Cited by (23)
A feature selection method based on term frequency difference and positive weighting factor
2022, Data and Knowledge EngineeringCitation Excerpt :In text classification, textual documents are assigned to the predetermined categories according to their properties. So it is of great importance for us to extract valuable information from massive data and enhance the efficiency of data processing [1,2]. Unfortunately, the phenomenon of “dimensional disaster” will inevitably occur in text classification.
A network-based feature extraction model for imbalanced text data
2022, Expert Systems with ApplicationsCitation Excerpt :The features of text data refer to the language units (words) or phrases (Collobert et al., 2011), and even characters, which also perform good results in specific tasks (Zhang et al., 2015). There are numerous machine learning methods to achieve feature extraction, and their common advantage is that they are well adaptive to different tasks (Hassan et al., 2015; Junejo et al., 2016; Prihatini et al., 2018; Zhao, & Mao, 2018; Gupta, & Gupta, 2021; Yan et al., 2020). In recent years, the success of distributed representation of words (Mikolov et al., 2013; Le, & Mikolov, 2014; Pennington et al., 2014; Devlin et al., 2019) has inspired researchers to learn the features of text through training on the neural network (Kim, 2014; Hu et al., 2014; Yin et al., 2016; Foland, & Martin, 2017, Liang et al, 2017; Young et al., 2018).
Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods
2020, Applied Soft Computing JournalCitation Excerpt :The increasing amount of digitized text from sources such as web pages, emails, blogs, digital libraries, social media, online advertisements, corporate documents, and product reviews improves the value of text classification [1–3].
Evaluation multi label feature selection for text classification using weighted borda count approach
2022, 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems, CFIS 2022Hybrid IG and GA based Feature Selection Approach for Text Categorization
2020, Proceedings of the 4th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2020