Elsevier

Information Sciences

Volume 372, 1 December 2016, Pages 518-538
Information Sciences

Terms-based discriminative information space for robust text classification

https://doi.org/10.1016/j.ins.2016.08.073Get rights and content

Abstract

With the popularity of Web 2.0, there has been a phenomenal increase in the utility of text classification in applications like document filtering and sentiment categorization. Many of these applications demand that the classification method be efficient and robust, yet produce accurate categorizations by using the terms in the documents only. In this paper, we propose a novel and efficient method using terms-based discriminative information space for robust text classification. Terms in the documents are assigned weights according to the discrimination information they provide for one category over the others. These weights also serve to partition the terms into category sets. A linear opinion pool is adopted for combining the discrimination information provided by each set of terms to yield a feature space (discriminative information space) having dimensions equal to the number of classes. Subsequently, a discriminant function is learned to categorize the documents in the feature space. This classification methodology relies upon corpus information only, and is robust to distribution shifts and noise. We develop theoretical parallels of our methodology with generative, discriminative, and hybrid classifiers. We evaluate our methodology extensively with five different discriminative term weighting schemes on six data sets from different application areas. We give a side-by-side comparison with four well-known text classification techniques. The results show that our methodology consistently outperforms the rest, especially when there is a distribution shift from training to test sets. Moreover, our methodology is simple and effective for different application domains and training set sizes. It is also fast with a small and tunable memory footprint.

Introduction

Text classification is witnessing growing interest in recent years. This is due to the availability of digitized text such as Web pages, e-mails, blogs, digital libraries, social media, online advertisements, corporate documents, product reviews and much more [22]. Many applications based on these different data sources can be posed as text classification problems. In these problems, documents need to be categorized into predefined classes representing different semantic groups (e.g. spam and non-spam, topics, sentiments).

Text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional term space (easily in hundreds of thousands for user generated content on the Web), making learning and generalization difficult. Secondly, due to the high cost of labeling documents, researchers are forced to rely upon small training sets or collect training data from sources different from the target domain. This results in a distribution shift between training and test data. Thirdly, documents are of varying quality, languages and lengths, making a uniform knowledge-based approach inefficient or infeasible. For example, an important domain for text classification which embodies these challenges is that of e-mail spam filtering: vocabulary of terms can be huge; users’ preferences for spam and non-spam often differ; non-generic labeled collections are often not available; e-mails come in a wide variety of languages and qualities. Addressing these challenges demands a corpus-based, statistically robust, and computationally efficient text classification method.

There are numerous classification techniques available today that can be utilized for text classification as well. The naive Bayes classifier, which is a probabilistic generative method, and the support vector machine, which is a statistical discriminative method, are generally considered effective for text classification. However, the former only performs better than the latter for very small training data [61], while the latter is very sensitive to distribution shift between the training and test data [20]. A different approach to enhanced text classification is through feature engineering and semantic representations. These approaches can be corpus-based, like latent semantic indexing (LSI) [45], or knowledge-based, like WordNet-based semantic enrichment [56]. However, engineering of new features or incorporating information from external knowledge bases adds to the computational complexity of these methods. In addition, external knowledge may not be available conveniently for some domains, e.g., legal proceedings, etc., limiting the general applicability of such approaches.

Intuitively, robust classification is obtained when the classes are well separated and invariant to noise in the feature space. Constructing such a feature space is therefore a significant step. For text classification, it is desirable to have a feature space that is readily interpretable through the terms in the document collection. In Fisher linear discriminant analysis, a popular feature extraction approach, the score of a document in the feature space is a non-intuitive combination of terms. Intuitive interpretation also implies that the feature space is low-dimensional as opposed to high-dimensional that can be generated by some kernel transformations.

In this paper, we present a terms-based discriminative information space for robust text classification (DIST). This feature space is constructed from discriminative term weights and linear opinion pooling. The discriminative term weights are supervised term weights that quantify the discriminative power of each term. Linear opinion pool aggregates the discriminative powers of terms to yield discriminative information scores for each document. These scores quantify the suitability of each document for a class over the others. A standard discriminative classifier is then learned in this feature space for the final classification.

Specifically, we make the following contributions in this paper:

  • 1.

    We present a new text classification methodology, named DIST. It combines feature space construction and linear classifier learning for robust performance in applications involving distribution shift between training and test data.

  • 2.

    We propose and evaluate five supervised term weighting measures for quantifying each term’s discriminative power. These measures (relative risk, log relative risk, odds, log odds, and Kullback-Leibler divergence) can be computed efficiently from the labeled training data.

  • 3.

    We show that our methodology is a generalization of common generative, discriminative, and hybrid classification methods. In particular, we relate our methodology to the naive Bayes classifier and support vector machines.

  • 4.

    We evaluate our methodology on six data sets belonging to different application areas. We also demonstrate the effectiveness on data sets having different training and test distributions. The results are compared with four common text classification methods. We further conduct statistical significance tests that demonstrate the overall effectiveness of our methodology with improved classification accuracies and area under the curve (AUC) values.

  • 5.

    We show that our methodology is computationally efficient and suitable for modern text classification problems, especially those having (a) distribution shift between training and test data, (b) high-dimensional input (term) space, and (c) time and memory limitations.

The rest of the paper is organized as follows. We discuss the related work in Section 2. Our text classification method, DIST, is described in Section 3. Comparison of DIST with popular generative and discriminative approaches follows in Section 4. We describe the data sets, evaluation setup, and experimental results in Section 5. This section compares our results under varying distribution shift, term weighting measures, and feature selection thresholds. Section 6 analyzes the scalability of DIST, its generalization to a classifier framework, and statistical significance tests of the results. We conclude and state some promising future directions in Section 7.

Section snippets

Related work and motivation

Relevant related work on discriminative term weighting and text classifiers is presented in the following subsections.

DIST: Discriminative information space for text classification

In this section, we describe our text classification methodology, DIST, based on terms-based discriminative information space construction and linear classification. DIST addresses the key issues of high dimensionality, term weighting and selection, and feature construction faced by supervised text classification methods. It uses statistical, information theoretic and probabilistic techniques in a hybrid generative-discriminative model of the classification problem. It is efficient, robust,

Interpretations and comparisons

In this section, we provide a broader interpretation of DIST by comparing it with generative, discriminative, and hybrid classifiers.

Evaluation setup

We evaluate DIST on six commonly-used text classification data sets – personalized spam filtering (ECML), movie review (Movies), 20 Newsgroups (20NG), Simulated Real Auto Annealing (SRAA), ECUE and PU email data set. Three of these data sets have more than one train-test data pairs. Not only the data sets are from a varied domain but have different underlying characteristics, with some data sets having distribution shift while others don’t, some are two class problems while others are

Classifier properties and generalizations

Here we analyze the scalability aspect of DIST by discussing its asymptotic time and space complexity, and some limitations. We then do statistical significance tests to demonstrate that the performance of DIST is significantly better than the compared algorithms. Finally we elaborate on a classifier framework that results as a generalization of the DIST classifier followed by some limitations of DIST.

Conclusion and future direction

In this paper, we present a new text classification methodology, named DIST, based on discriminative information space construction and discrimination information pooling. Each term in the classification problem is assigned a weight that quantifies the discrimination information it provides for category k over the rest. These discriminative term weights are then used to transform the input term space into a new two-dimensional feature space. The transformation is based on a statistical model of

Acknowledgment

This work is in part supported by the ICT R&D program of MSIP/IITP. [B0101-16-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis], South Korea; Lahore University of Management Sciences (LUMS), University of Management and Technology (UMT), and Higher Education Commission (HEC) of Pakistan.

References (73)

  • G. Andrew

    A hybrid markov/semi-markov conditional random field for sequence segmentation

    EMNLP 06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

    (2006)
  • I. Androutsopoulos et al.

    An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages

    SIGIR-00: Proceedings of the 23rd Conference on Research and Development in Information Retrieval

    (2000)
  • S. Ben-David et al.

    Analysis of representations for domain adaptation

    NIPS-07: Advances in Neural Information Processing Systems

    (2007)
  • S. Bickel

    Ecml-pkdd discovery challenge 2006 overview

    Proceedings of ECML-PKDD Discovery Challenge

    (2006)
  • G. Bouchard et al.

    The trade-off between generative and discriminative classifiers

    IASC 04: 16th Symposium of IASC, Proceedings in Computational Statistics

    (2004)
  • P. Raghavan et al.

    An Introduction to Information Retrieval

    (2009)
  • V.R. Carvalho et al.

    Single-pass online learning: performance, voting schemes and online feature selection

    KDD 2006: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2006)
  • ChungY.M. et al.

    A corpus-based approach to comparative evaluation of statistical term association measures

    J. Am. Soc. Inf. Sci. Technol.

    (2001)
  • C. Cortes et al.

    Auc optimization vs. error rate minimization

    Adv. Neural Inf. Process. Syst.

    (2004)
  • I. Dagan et al.

    Mistake driven learning in text categorization

    EMNLP-97: Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing

    (1997)
  • A. Dasgupta et al.

    Feature selection methods for text classification

    KDD-07: Proceedings of 13th International Conference on Knowledge Discovery and Data Mining

    (2007)
  • O. Dekel et al.

    Multiclass-multilabel classification with more classes than examples

    International Conference on Artificial Intelligence and Statistics

    (2010)
  • S.J. Delany et al.

    An assessment of case-based reasoning for spam filtering

    Artificial Intelligence Review

    (2005)
  • J. Demsar

    Statistical comparisons of classifiers over multiple data sets

    J. Mach. Learn. Res.

    (2006)
  • G. Druck et al.

    Semi-supervised classification with hybrid generative/discriminative methods

    KDD-07: Proceedings of 13th Conference on Knowledge Discovery and Data Mining

    (2007)
  • G. Forman

    An extensive empirical study of feature selection metrics for text classification

    The Journal of Machine Learning Research

    (2003)
  • G. Forman et al.

    Learning from little: comparison of classifiers given little training

    Knowledge Discovery in Databases: PKDD 2004

    (2004)
  • D.H. Fusilier et al.

    Detection of opinion spam with character n-grams

    Computational Linguistics and Intelligent Text Processing

    (2015)
  • V. Gupta et al.

    A survey of text mining techniques and applications

    J. Emerging Technol. Web Intell.

    (2009)
  • G. Guyatt et al.

    Grade guidelines: 11. making an overall rating of confidence in effect estimates for a single outcome and for all outcomes

    J. Clinical Epidemiol.

    (2013)
  • M.T. Hassan et al.

    Clustering and understanding documents via discrimination information maximization

    PAKDD 12: Proceedings of the 16th Asia Pacific Conference on knowledge discovery and data mining

    (2012)
  • M.T. Hassan et al.

    Cdim: Document clustering by discrimination information maximization

    Inf. Sci.

    (2015)
  • D.A. Hsieh et al.

    Estimation of response probabilities from augmented retrospective observations

    J. Am. Stat. Assoc.

    (1985)
  • D. Isa et al.

    Text document preprocessing with the bayes formula for classification using the support vector machine

    IEEE Trans. Knowl. Data Eng.

    (2008)
  • T.S. Jaakkola et al.

    Exploiting generative models in discriminative classifiers

    NIPS 98: Advances in Neural Information Processing Systems

    (1998)
  • R.A. Jacobs

    Methods for combining experts’ probability assessments

    Neural Comput.

    (1995)
  • Cited by (23)

    • A feature selection method based on term frequency difference and positive weighting factor

      2022, Data and Knowledge Engineering
      Citation Excerpt :

      In text classification, textual documents are assigned to the predetermined categories according to their properties. So it is of great importance for us to extract valuable information from massive data and enhance the efficiency of data processing [1,2]. Unfortunately, the phenomenon of “dimensional disaster” will inevitably occur in text classification.

    • A network-based feature extraction model for imbalanced text data

      2022, Expert Systems with Applications
      Citation Excerpt :

      The features of text data refer to the language units (words) or phrases (Collobert et al., 2011), and even characters, which also perform good results in specific tasks (Zhang et al., 2015). There are numerous machine learning methods to achieve feature extraction, and their common advantage is that they are well adaptive to different tasks (Hassan et al., 2015; Junejo et al., 2016; Prihatini et al., 2018; Zhao, & Mao, 2018; Gupta, & Gupta, 2021; Yan et al., 2020). In recent years, the success of distributed representation of words (Mikolov et al., 2013; Le, & Mikolov, 2014; Pennington et al., 2014; Devlin et al., 2019) has inspired researchers to learn the features of text through training on the neural network (Kim, 2014; Hu et al., 2014; Yin et al., 2016; Foland, & Martin, 2017, Liang et al, 2017; Young et al., 2018).

    • Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      The increasing amount of digitized text from sources such as web pages, emails, blogs, digital libraries, social media, online advertisements, corporate documents, and product reviews improves the value of text classification [1–3].

    • Evaluation multi label feature selection for text classification using weighted borda count approach

      2022, 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems, CFIS 2022
    • Hybrid IG and GA based Feature Selection Approach for Text Categorization

      2020, Proceedings of the 4th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2020
    View all citing articles on Scopus
    View full text