Terms-based discriminative information space for robust text classification

doi:10.1016/j.ins.2016.08.073

Information Sciences

Volume 372, 1 December 2016, Pages 518-538

https://doi.org/10.1016/j.ins.2016.08.073 Get rights and content

Abstract

With the popularity of Web 2.0, there has been a phenomenal increase in the utility of text classification in applications like document filtering and sentiment categorization. Many of these applications demand that the classification method be efficient and robust, yet produce accurate categorizations by using the terms in the documents only. In this paper, we propose a novel and efficient method using terms-based discriminative information space for robust text classification. Terms in the documents are assigned weights according to the discrimination information they provide for one category over the others. These weights also serve to partition the terms into category sets. A linear opinion pool is adopted for combining the discrimination information provided by each set of terms to yield a feature space (discriminative information space) having dimensions equal to the number of classes. Subsequently, a discriminant function is learned to categorize the documents in the feature space. This classification methodology relies upon corpus information only, and is robust to distribution shifts and noise. We develop theoretical parallels of our methodology with generative, discriminative, and hybrid classifiers. We evaluate our methodology extensively with five different discriminative term weighting schemes on six data sets from different application areas. We give a side-by-side comparison with four well-known text classification techniques. The results show that our methodology consistently outperforms the rest, especially when there is a distribution shift from training to test sets. Moreover, our methodology is simple and effective for different application domains and training set sizes. It is also fast with a small and tunable memory footprint.

Introduction

Text classification is witnessing growing interest in recent years. This is due to the availability of digitized text such as Web pages, e-mails, blogs, digital libraries, social media, online advertisements, corporate documents, product reviews and much more [22]. Many applications based on these different data sources can be posed as text classification problems. In these problems, documents need to be categorized into predefined classes representing different semantic groups (e.g. spam and non-spam, topics, sentiments).

Text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional term space (easily in hundreds of thousands for user generated content on the Web), making learning and generalization difficult. Secondly, due to the high cost of labeling documents, researchers are forced to rely upon small training sets or collect training data from sources different from the target domain. This results in a distribution shift between training and test data. Thirdly, documents are of varying quality, languages and lengths, making a uniform knowledge-based approach inefficient or infeasible. For example, an important domain for text classification which embodies these challenges is that of e-mail spam filtering: vocabulary of terms can be huge; users’ preferences for spam and non-spam often differ; non-generic labeled collections are often not available; e-mails come in a wide variety of languages and qualities. Addressing these challenges demands a corpus-based, statistically robust, and computationally efficient text classification method.

There are numerous classification techniques available today that can be utilized for text classification as well. The naive Bayes classifier, which is a probabilistic generative method, and the support vector machine, which is a statistical discriminative method, are generally considered effective for text classification. However, the former only performs better than the latter for very small training data [61], while the latter is very sensitive to distribution shift between the training and test data [20]. A different approach to enhanced text classification is through feature engineering and semantic representations. These approaches can be corpus-based, like latent semantic indexing (LSI) [45], or knowledge-based, like WordNet-based semantic enrichment [56]. However, engineering of new features or incorporating information from external knowledge bases adds to the computational complexity of these methods. In addition, external knowledge may not be available conveniently for some domains, e.g., legal proceedings, etc., limiting the general applicability of such approaches.

Intuitively, robust classification is obtained when the classes are well separated and invariant to noise in the feature space. Constructing such a feature space is therefore a significant step. For text classification, it is desirable to have a feature space that is readily interpretable through the terms in the document collection. In Fisher linear discriminant analysis, a popular feature extraction approach, the score of a document in the feature space is a non-intuitive combination of terms. Intuitive interpretation also implies that the feature space is low-dimensional as opposed to high-dimensional that can be generated by some kernel transformations.

In this paper, we present a terms-based discriminative information space for robust text classification (DIST). This feature space is constructed from discriminative term weights and linear opinion pooling. The discriminative term weights are supervised term weights that quantify the discriminative power of each term. Linear opinion pool aggregates the discriminative powers of terms to yield discriminative information scores for each document. These scores quantify the suitability of each document for a class over the others. A standard discriminative classifier is then learned in this feature space for the final classification.

Specifically, we make the following contributions in this paper:

1.
We present a new text classification methodology, named DIST. It combines feature space construction and linear classifier learning for robust performance in applications involving distribution shift between training and test data.
2.
We propose and evaluate five supervised term weighting measures for quantifying each term’s discriminative power. These measures (relative risk, log relative risk, odds, log odds, and Kullback-Leibler divergence) can be computed efficiently from the labeled training data.
3.
We show that our methodology is a generalization of common generative, discriminative, and hybrid classification methods. In particular, we relate our methodology to the naive Bayes classifier and support vector machines.
4.
We evaluate our methodology on six data sets belonging to different application areas. We also demonstrate the effectiveness on data sets having different training and test distributions. The results are compared with four common text classification methods. We further conduct statistical significance tests that demonstrate the overall effectiveness of our methodology with improved classification accuracies and area under the curve (AUC) values.
5.
We show that our methodology is computationally efficient and suitable for modern text classification problems, especially those having (a) distribution shift between training and test data, (b) high-dimensional input (term) space, and (c) time and memory limitations.

The rest of the paper is organized as follows. We discuss the related work in Section 2. Our text classification method, DIST, is described in Section 3. Comparison of DIST with popular generative and discriminative approaches follows in Section 4. We describe the data sets, evaluation setup, and experimental results in Section 5. This section compares our results under varying distribution shift, term weighting measures, and feature selection thresholds. Section 6 analyzes the scalability of DIST, its generalization to a classifier framework, and statistical significance tests of the results. We conclude and state some promising future directions in Section 7.

Section snippets

Related work and motivation

Relevant related work on discriminative term weighting and text classifiers is presented in the following subsections.

DIST: Discriminative information space for text classification

In this section, we describe our text classification methodology, DIST, based on terms-based discriminative information space construction and linear classification. DIST addresses the key issues of high dimensionality, term weighting and selection, and feature construction faced by supervised text classification methods. It uses statistical, information theoretic and probabilistic techniques in a hybrid generative-discriminative model of the classification problem. It is efficient, robust,

Interpretations and comparisons

In this section, we provide a broader interpretation of DIST by comparing it with generative, discriminative, and hybrid classifiers.

Evaluation setup

We evaluate DIST on six commonly-used text classification data sets – personalized spam filtering (ECML), movie review (Movies), 20 Newsgroups (20NG), Simulated Real Auto Annealing (SRAA), ECUE and PU email data set. Three of these data sets have more than one train-test data pairs. Not only the data sets are from a varied domain but have different underlying characteristics, with some data sets having distribution shift while others don’t, some are two class problems while others are

Classifier properties and generalizations

Here we analyze the scalability aspect of DIST by discussing its asymptotic time and space complexity, and some limitations. We then do statistical significance tests to demonstrate that the performance of DIST is significantly better than the compared algorithms. Finally we elaborate on a classifier framework that results as a generalization of the DIST classifier followed by some limitations of DIST.

Conclusion and future direction

In this paper, we present a new text classification methodology, named DIST, based on discriminative information space construction and discrimination information pooling. Each term in the classification problem is assigned a weight that quantifies the discrimination information it provides for category k over the rest. These discriminative term weights are then used to transform the input term space into a new two-dimensional feature space. The transformation is based on a statistical model of

Acknowledgment

This work is in part supported by the ICT R&D program of MSIP/IITP. [B0101-16-0525, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis], South Korea; Lahore University of Management Sciences (LUMS), University of Management and Technology (UMT), and Higher Education Commission (HEC) of Pakistan.

References (73)

D. Cai et al.
Learning semantic relatedness from term discrimination information
Expert Syst. with Appl.
(2009)
W. Hämäläïnen
StatApriori: an efficient algorithm for searching statistically significant association rules
Knowl. Inf. Syst.
(2010)
T.K. Ho et al.
Decision combination in multiple classifier systems
Pattern Anal. Mach. Intell. IEEE Trans.
(1994)
S.M. Mohammad et al.
Nrc-canada: building the state-of-the-art in sentiment analysis of tweets
Second Joint Conference on Lexical and Computational Semantics (* SEM)
(2013)
K. Nigam et al.
Using maximum entropy for text classification
IJCAI-99 Workshop on Machine Learning for Information Filtering
(1999)
V. Kumar et al.
Selecting the right objective measure for association analysis
Inf. Syst.
(2004)
J. Suzuki et al.
Semi-supervised structured output learning based on a hybrid generative and discriminative approach
(ACL 07: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
(2007)
WangP. et al.
Building semantic kernels for text classification using wikipedia
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(2008)
C.C. Aggarwal et al.
A survey of text classification algorithms
Mining text data
(2012)
E. Agirre et al.
On robustness and domain adaptation using SVD for word sense disambiguation
COLING-08: Proceedings of the 22nd International Conference on Computational Linguistics
(2008)

G. Andrew

A hybrid markov/semi-markov conditional random field for sequence segmentation

EMNLP 06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

(2006)

I. Androutsopoulos et al.

An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR-00: Proceedings of the 23rd Conference on Research and Development in Information Retrieval

(2000)

S. Ben-David et al.

Analysis of representations for domain adaptation

NIPS-07: Advances in Neural Information Processing Systems

(2007)

S. Bickel

Ecml-pkdd discovery challenge 2006 overview

Proceedings of ECML-PKDD Discovery Challenge

(2006)

G. Bouchard et al.

The trade-off between generative and discriminative classifiers

IASC 04: 16th Symposium of IASC, Proceedings in Computational Statistics

(2004)

P. Raghavan et al.

An Introduction to Information Retrieval

(2009)

V.R. Carvalho et al.

Single-pass online learning: performance, voting schemes and online feature selection

KDD 2006: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

(2006)

ChungY.M. et al.

A corpus-based approach to comparative evaluation of statistical term association measures

J. Am. Soc. Inf. Sci. Technol.

(2001)

C. Cortes et al.

Auc optimization vs. error rate minimization

Adv. Neural Inf. Process. Syst.

(2004)

I. Dagan et al.

Mistake driven learning in text categorization

EMNLP-97: Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing

(1997)

A. Dasgupta et al.

Feature selection methods for text classification

KDD-07: Proceedings of 13th International Conference on Knowledge Discovery and Data Mining

(2007)

O. Dekel et al.

Multiclass-multilabel classification with more classes than examples

International Conference on Artificial Intelligence and Statistics

(2010)

S.J. Delany et al.

An assessment of case-based reasoning for spam filtering

Artificial Intelligence Review

(2005)

J. Demsar

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

(2006)

G. Druck et al.

Semi-supervised classification with hybrid generative/discriminative methods

KDD-07: Proceedings of 13th Conference on Knowledge Discovery and Data Mining

(2007)

G. Forman

An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research

(2003)

G. Forman et al.

Learning from little: comparison of classifiers given little training

Knowledge Discovery in Databases: PKDD 2004

(2004)

D.H. Fusilier et al.

Detection of opinion spam with character n-grams

Computational Linguistics and Intelligent Text Processing

(2015)

V. Gupta et al.

A survey of text mining techniques and applications

J. Emerging Technol. Web Intell.

(2009)

G. Guyatt et al.

Grade guidelines: 11. making an overall rating of confidence in effect estimates for a single outcome and for all outcomes

J. Clinical Epidemiol.

(2013)

M.T. Hassan et al.

Clustering and understanding documents via discrimination information maximization

PAKDD 12: Proceedings of the 16th Asia Pacific Conference on knowledge discovery and data mining

(2012)

M.T. Hassan et al.

Cdim: Document clustering by discrimination information maximization

Inf. Sci.

(2015)

D.A. Hsieh et al.

Estimation of response probabilities from augmented retrospective observations

J. Am. Stat. Assoc.

(1985)

D. Isa et al.

Text document preprocessing with the bayes formula for classification using the support vector machine

IEEE Trans. Knowl. Data Eng.

(2008)

T.S. Jaakkola et al.

Exploiting generative models in discriminative classifiers

NIPS 98: Advances in Neural Information Processing Systems

(1998)

R.A. Jacobs

Methods for combining experts’ probability assessments

Neural Comput.

(1995)

Cited by (23)

A feature selection method based on term frequency difference and positive weighting factor
2022, Data and Knowledge Engineering
Citation Excerpt :
In text classification, textual documents are assigned to the predetermined categories according to their properties. So it is of great importance for us to extract valuable information from massive data and enhance the efficiency of data processing [1,2]. Unfortunately, the phenomenon of “dimensional disaster” will inevitably occur in text classification.
Firstly, a new concept of term frequency difference factor is proposed to balance the influences of term frequency and document frequency on feature selection. Secondly, the idea of positive weighting factor is advanced to balance the roles of the document frequency in the positive and negampared with six popular algorithms on six datasets using two classifiers of Naive Bayes and Support tive categories. And finally, a new feature selection algorithm based on term frequency difference and positive weighting factor, PWTF-TCM, is presented based on the two above concepts. In the experiments, PWTF-TCM is coVector Machines. The experimental results show that PWTF-TCM outperforms by 75% for Macro-F₁ and 58.33% for Micro-F₁. In addition, PWTF-TCM improves the classification accuracy by 4.58% compared with Trigonometric comparison measure.
A network-based feature extraction model for imbalanced text data
2022, Expert Systems with Applications
Citation Excerpt :
The features of text data refer to the language units (words) or phrases (Collobert et al., 2011), and even characters, which also perform good results in specific tasks (Zhang et al., 2015). There are numerous machine learning methods to achieve feature extraction, and their common advantage is that they are well adaptive to different tasks (Hassan et al., 2015; Junejo et al., 2016; Prihatini et al., 2018; Zhao, & Mao, 2018; Gupta, & Gupta, 2021; Yan et al., 2020). In recent years, the success of distributed representation of words (Mikolov et al., 2013; Le, & Mikolov, 2014; Pennington et al., 2014; Devlin et al., 2019) has inspired researchers to learn the features of text through training on the neural network (Kim, 2014; Hu et al., 2014; Yin et al., 2016; Foland, & Martin, 2017, Liang et al, 2017; Young et al., 2018).
The explosive growth of text data has attracted many researchers to explore the efficient method to extract valuable hidden information. Many technologies, especially deep learning methods, have achieved great success in text analysis. However, the most powerful methods always require a considerable quantity of data for training, which may suffer from imbalanced data in some cases. In this paper, we propose a network-based Convolution Neural Network (NCNN) to mitigate the effect of imbalanced data. The proposed model first generates new synthetic samples for the imbalanced data based on the random walking of the network. Then an extra layer called Polar Layer is introduced to connect the output from the network model of the text to the classical CNN. Two electing strategies (n-NCNN and x-NCNN) are proposed to improve the performance of NCNN further. In the experimental section, the proposed model is applied to Reuters 21578 and WebKb. By comparing with six approaches, we prove the effectiveness of the proposed NCNN model on the imbalanced text data.
Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods
2020, Applied Soft Computing Journal
Citation Excerpt :
The increasing amount of digitized text from sources such as web pages, emails, blogs, digital libraries, social media, online advertisements, corporate documents, and product reviews improves the value of text classification [1–3].
The evaluation of feature selection methods for text classification with small sample datasets must consider classification performance, stability, and efficiency. It is, thus, a multiple criteria decision-making (MCDM) problem. Yet there has been few research in feature selection evaluation using MCDM methods which considering multiple criteria. Therefore, we use MCDM-based methods for evaluating feature selection methods for text classification with small sample datasets. An experimental study is designed to compare five MCDM methods to validate the proposed approach with 10 feature selection methods, nine evaluation measures for binary classification, seven evaluation measures for multi-class classification, and three classifiers with 10 small datasets. Based on the ranked results of the five MCDM methods, we make recommendations concerning feature selection methods. The results demonstrate the effectiveness of the used MCDM-based method in evaluating feature selection methods.
Evaluation multi label feature selection for text classification using weighted borda count approach
2022, 2022 9th Iranian Joint Congress on Fuzzy and Intelligent Systems, CFIS 2022
Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
2021, Journal of Big Data
Hybrid IG and GA based Feature Selection Approach for Text Categorization
2020, Proceedings of the 4th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2020

View all citing articles on Scopus

View full text

Terms-based discriminative information space for robust text classification

Abstract

Introduction

Section snippets

Related work and motivation

DIST: Discriminative information space for text classification

Interpretations and comparisons

Evaluation setup

Classifier properties and generalizations

Conclusion and future direction

Acknowledgment

Expert Syst. with Appl.

Knowl. Inf. Syst.

Pattern Anal. Mach. Intell. IEEE Trans.

Inf. Syst.

A survey of text classification algorithms

Mining text data

On robustness and domain adaptation using SVD for word sense disambiguation

COLING-08: Proceedings of the 22nd International Conference on Computational Linguistics

A hybrid markov/semi-markov conditional random field for sequence segmentation

EMNLP 06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR-00: Proceedings of the 23rd Conference on Research and Development in Information Retrieval

Analysis of representations for domain adaptation

NIPS-07: Advances in Neural Information Processing Systems

Ecml-pkdd discovery challenge 2006 overview

Proceedings of ECML-PKDD Discovery Challenge

The trade-off between generative and discriminative classifiers

IASC 04: 16th Symposium of IASC, Proceedings in Computational Statistics

An Introduction to Information Retrieval

Single-pass online learning: performance, voting schemes and online feature selection

KDD 2006: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

A corpus-based approach to comparative evaluation of statistical term association measures

J. Am. Soc. Inf. Sci. Technol.

Auc optimization vs. error rate minimization

Adv. Neural Inf. Process. Syst.

Mistake driven learning in text categorization

EMNLP-97: Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing

Feature selection methods for text classification

KDD-07: Proceedings of 13th International Conference on Knowledge Discovery and Data Mining

Multiclass-multilabel classification with more classes than examples

International Conference on Artificial Intelligence and Statistics

An assessment of case-based reasoning for spam filtering

Artificial Intelligence Review

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

Semi-supervised classification with hybrid generative/discriminative methods

KDD-07: Proceedings of 13th Conference on Knowledge Discovery and Data Mining

An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research

Learning from little: comparison of classifiers given little training

Knowledge Discovery in Databases: PKDD 2004

Detection of opinion spam with character n-grams

Computational Linguistics and Intelligent Text Processing

A survey of text mining techniques and applications

J. Emerging Technol. Web Intell.

Grade guidelines: 11. making an overall rating of confidence in effect estimates for a single outcome and for all outcomes

J. Clinical Epidemiol.

Clustering and understanding documents via discrimination information maximization

PAKDD 12: Proceedings of the 16th Asia Pacific Conference on knowledge discovery and data mining

Cdim: Document clustering by discrimination information maximization

Inf. Sci.

Estimation of response probabilities from augmented retrospective observations

J. Am. Stat. Assoc.

Text document preprocessing with the bayes formula for classification using the support vector machine

IEEE Trans. Knowl. Data Eng.

Exploiting generative models in discriminative classifiers

NIPS 98: Advances in Neural Information Processing Systems

Methods for combining experts’ probability assessments

Neural Comput.