Skip to main content
Log in

Utility-based feature selection for text classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Feature selection is a significant step before a classification task used to reduce excessive computational costs and enhance classification performance. This paper illustrates a novel feature selection method based on the concept of utility that is grounded in economics theory. In particular, we focus on a utility-based feature selection method for enhancing text classification. Different from existing feature selection methods, the proposed method selects discriminative semantic terms according to how authors utilize terms to express the main ideas in textual documents, i.e., the “utility of terms,” a criteria that can be used to measure the usefulness of terms on expressing authors’ main ideas. To our best knowledge, our work represents the successful research on the leveraging economics theory for developing a semantically rich feature selection method to improve text classification. Our empirical tests based on six UCI benchmark datasets confirm that the proposed method often outperforms other state-of-the-art feature selection methods in text classification. Moreover, our method provides an economics explanation of term weighting for information retrieval and semantic information acquisition in textual documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

    Article  Google Scholar 

  2. Abualigah LM, Khader AT, Al-Betar MA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84(C):24–36

    Article  Google Scholar 

  3. Aghdam MH, Heidari S (2015) Feature selection using particle swarm optimization in text categorization. J Artif Intell Soft Comput Res 5(4):38–43

    Article  Google Scholar 

  4. Agnihotri D (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281

    Article  Google Scholar 

  5. Azzopardi L (2011) The economics in interactive information retrieval. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. ACM, Beijing, China, pp 15–24

  6. Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of the 2012 IEEE 12th international conference on data mining workshops. IEEE Computer Society, Brussels, Belgium, pp 918–925

  7. Bharti KK, Singh PK (2014) A survey on filter techniques for feature selection in text mining. In: Proceedings of the 2nd international conference on soft computing for problem solving. Springer, Jaipur, pp 1545–1559

  8. Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114

    Article  Google Scholar 

  9. Bharti KK, Singh PK (2016) Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl Soft Comput 43:20–34

    Article  Google Scholar 

  10. Chao S, Cai J, Yang S et al. (2016) A clustering based feature selection method using feature information distance for text data. In: Proceedings of international conference on intelligent computing. Springer, Lanzhou, China, pp 122–132

  11. Chen K, Gao S, Zhu Y et al (2006) Music genres classification using text categorization method. In: Proceedings of IEEE workshop on multimedia signal processing. IEEE, Victoria, BC, Canada, pp 221–224

  12. Chen J, Huang H, Tian S et al (2009) Feature selection for text classification with naïve bayes. Expert Syst Appl 36(3):5432–5435

    Article  Google Scholar 

  13. Duric A, Song F (2012) Feature selection for sentiment analysis based on content and syntax models. Decis Support Syst 53(4):704–711

    Article  Google Scholar 

  14. Fei G, Liu B (2015) Social media text classification under negative covariate shift. In: Proceedings: conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 2347–2356

  15. Feldman R, Dagan I (1995) Knowledge discovery in textual databases (KDT). In: Proceedings of the 1st international conference on knowledge discovery and data mining. AAAI Press, Montréal, Québec, Canada, pp 112–117

  16. Feng G, Guo J, Jing BY et al (2012) A bayesian feature selection paradigm for text classification. Inf Process Manage 48(2):283–302

    Article  Google Scholar 

  17. Feng G, Guo J, Jing BY et al (2015) Feature subset selection using naive Bayes for text classification. Pattern Recogn Lett 65:109–115

    Article  Google Scholar 

  18. Feng G, An B, Yang F et al (2017) Relevance popularity: a term event model based feature selection scheme for text classification. PLoS ONE 12(4):1–15

    Article  Google Scholar 

  19. Ganesan K, Zhai CX (2012) Opinion-based entity ranking. Inf Retr 15(2):116–150

    Article  Google Scholar 

  20. Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47

    Article  Google Scholar 

  21. György S (2008) Hedge classification in biomedical texts with a weakly supervised selection of keywords. In: Proceedings of the 46th meeting of the association for computational linguistics. Association for Computational Linguistics, Columbus, Ohio, USA, pp 281–289

  22. Hai NT, Le TD, Nghia NH et al (2015) A hybrid feature selection method for vietnamese text classification. In: Proceedings of the 7th international conference on knowledge and systems engineering. IEEE, Ho Chi Minh City, Vietnam, pp 91–96

  23. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Waltham, pp 341–342

    MATH  Google Scholar 

  24. Havrlant L, Kreinovich V (2014) A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). Int J Gen Syst 46(1):27–36

    Article  MathSciNet  Google Scholar 

  25. Hearst MA (1999) Untangling text data mining. In: Proceedings of the 37th annual meeting of the association for computational linguistics on computational linguistics. Association for Computational Linguistics, Maryland, USA, pp 3–10

  26. Javed K, Maruf S, Babri HA (2015) A two-stage markov blanket based feature selection algorithm for text classification. Neurocomputing 157:91–104

    Article  Google Scholar 

  27. Jin J, Yan X, Yu Y et al (2013) Service failure complaints identification in social media: a text classification approach. In: Proceedings of the 34th international conference of information systems, Milan, Italy

  28. Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the 16th international conference on machine learning. Morgan Kaufmann Publishers, Bled, Slovenia, pp 200–209

  29. Kilinç D, Özçift A, Bozyiğit F et al (2015) Ttc-3600: a new benchmark dataset for turkish text categorization. J Inf Sci 43(2):174–185

    Article  Google Scholar 

  30. Kotzias D, Denil M, De Freitas N et al (2015) From group to individual labels using deep features. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Sydney, NSW, Australia, pp 597–606

  31. Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, Sheffield, UK, pp 297–304

  32. Lamirel JC, Cuxac P, Chivukula AS et al (2015) Optimizing text classification through efficient feature selection based on quality metric. J Intell Inf Syst 45(3):1–18

    Article  Google Scholar 

  33. Langley P, Sage S (2013) Induction of selective Bayesian classifiers. In: Proceedings of the 10th international conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers, Seattle, WA, USA, pp 399–406

  34. Lau RYK, Li C, Liao S (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94

    Article  Google Scholar 

  35. Lehnert W, Soderland S, Aronow D et al (1995) Inductive text classification for medical applications. J Exp Theor Artif Intell 7(1):49–80

    Article  Google Scholar 

  36. Li Z, Lu W, Sun Z et al (2016) A parallel feature selection method study for text classification. Neural Comput Appl 28(Supp l):S513–S524

    Google Scholar 

  37. Liu M, Lu X, Song J (2016) A new feature selection method for text categorization of customer reviews. Commun Stat Simul Comput 45(4):1397–1409

    Article  MathSciNet  MATH  Google Scholar 

  38. Lu Y, Chen Y (2017) A text feature selection method based on the small world algorithm. Procedia Comput Sci 107:276–284

    Article  Google Scholar 

  39. Lu Y, Liang M, Ye Z et al (2015) Improved particle swarm optimization algorithm and its application in text feature selection. Appl Soft Comput 35:629–636

    Article  Google Scholar 

  40. Maldonado S, Bravo C, López J et al (2017) Integrated framework for profit-based feature selection and SVM classification in credit scoring. Decis Support Syst 104:113–121

    Article  Google Scholar 

  41. Mankiw NG (2011) Principles of economics, 6th edn. South-Western Cengage Learning, Mason, pp 424–425

    Google Scholar 

  42. Mladenić D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87

    Article  Google Scholar 

  43. Mojaveriyan M, Ebrahimpour-Komleh H, Mousavirad SJ (2016) IGICA: a hybrid feature selection approach in text categorization. Int J Intell Syst Technol Appl 8(3):42–47

    Google Scholar 

  44. Nam LNH, Quoc HB (2017) The hybrid filter feature selection methods for improving high-dimensional text categorization. Int J Uncertain Fuzziness Knowl Based Syst 25(2):235–265

    Article  Google Scholar 

  45. Novovičová J, Malik A (2005) Information-theoretic feature selection algorithms for text classification. In: IEEE international joint conference on neural networks. IEEE, Montreal, Canada, pp 3272–3277

  46. Onan A, Korukoğlu S (2015) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 39(5):1103–1107

    Google Scholar 

  47. Pandey U, Chakravarty S (2010) A survey on text classification techniques for e-mail filtering. In: Proceedings of the 2nd international conference on machine learning and computing, Bangalore, India, pp 32–36

  48. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics, Doha, Qatar, pp 1532–1543

  49. Pinheiro RHW, Cavalcanti GDC, Ren TI (2015) Data-driven global-ranking local feature selection methods for text categorization. Expert Syst Appl 42(4):1941–1949

    Article  Google Scholar 

  50. Rashid TA, Mustafa AM, Saeed AM (2017) A robust categorization system for Kurdish Sorani text documents. Inf Technol J 16(1):27–34

    Article  Google Scholar 

  51. Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manage 53(2):473–489

    Article  Google Scholar 

  52. Roul RK, Sahay SK (2016) K-means and Wordnet based feature selection combined with extreme learning machines for text classification. In: Proceedings of international conference on distributed computing and internet technology. Springer, Bhubaneswar, India, pp 103–112

  53. Sanchez-Pi N, Martí L, Garcia ACB (2014) Text classification techniques in oil industry applications. In: Proceedings of international joint conference SOCO’13-CISIS’13-ICEUTE’13. Springer, pp 211–220

  54. Shravankumar B, Ravi V (2014) Text classification using ensemble features selection and data mining techniques. In: Proceedings of international conference on swarm, evolutionary, and memetic computing. Springer, pp 176–186

  55. Tang B, Kay S, He H (2016) Toward optimal feature selection in naive bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521

    Article  Google Scholar 

  56. Torii M, Yin L, Nguyen T et al (2011) An exploratory study of a text classification framework for internet-based surveillance of emerging epidemics. Int J Med Inf 80(1):56–66

    Article  Google Scholar 

  57. Tutkan M, Ganiz MC, Akyokuş S (2016) Helmholtz principle based supervised and unsupervised feature selection methods for text mining. Inf Process Manage 52(5):885–910

    Article  Google Scholar 

  58. Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92

    Article  Google Scholar 

  59. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl Based Syst 36(6):226–235

    Article  Google Scholar 

  60. Verma I, Dey L, Srinivasan RS, Singh L (2015) Event detection from business news. In: Proceedings of the 6th international conference on pattern recognition and machine intelligence. Springer, Warsaw, Poland, pp 575–585

  61. Wang H, Hong M (2015) Distance variance score: an efficient feature selection method in text classification. Math Probl Eng 2015:1–10

    Google Scholar 

  62. Wang H, Hong M (2017) Probability and variance score: an efficient supervised feature selection method for text classification. J Residuals Sci Technol 14(3):218–232

    MathSciNet  Google Scholar 

  63. Wang D, Zhang H, Liu R et al (2014) T-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45(1):1–10

    Article  Google Scholar 

  64. Wang Y, Zhou Z, Jin S et al (2017) Comparisons and selections of features and classifiers for short text classification. In: IOP conference series: materials science and engineering. IOP Publishing

  65. Wei G, Agnihotri L, Dimitrova N (2000) TV program classification based on face and text processing, In: Proceedings of the 1st IEEE international conference on multimedia and expo. IEEE, New York, USA, pp 1345–1348

  66. Witten IH, Frank E, Hall MA et al (2017) Data mining: practical machine learning tools and techniques, 4th edn. Morgan Kaufmann, Cambridge, pp 179–181

    Google Scholar 

  67. Wu L, Wang Y, Zhang S et al (2017) Fusing Gini index and term frequency for text feature selection. In: Proceedings of IEEE 3rd international conference on multimedia big data. IEEE, Laguna, Hills, CA, USA, pp 280–283

  68. Xu Y, Chen L (2010) Term-frequency based feature selection methods for text categorization. In: Proceedings of the 2010 4th international conference on genetic and evolutionary computing. IEEE, Shenzhen, China, pp 280–283

  69. Yao H, Liu C, Zhang P et al (2017) A feature selection method based on synonym merging in text classification system. Eurasip J Wirel Commun Netw 2017:1–8

    Article  Google Scholar 

  70. Yao L, Qin S, Zhu H (2017) Feature selection algorithm for hierarchical text classification using Kullback–Leibler divergence. In: Proceedings of 2nd IEEE international conference on cloud computing and big data analysis. IEEE, Chengdu, China, pp 421–424

  71. Yi J, Yang G, Wan J (2016) Category discrimination based feature selection algorithm in Chinese text classification. J Inf Sci Eng 32(5):1145–1159

    MathSciNet  Google Scholar 

  72. Zeng L, Li Z (2015) Text classification based on paragraph distributed representation and extreme learning machine. In: Proceedings of the 6th international conference on advances in swarm and computational intelligence. Springer, Beijing, China, pp 81–88

  73. Zhang L, Jiang L, Li C (2016) A new feature selection approach to Naive Bayes text classifiers. Int J Pattern Recogn Artif Intell 30(2):1650003-1–1650003-17

    MathSciNet  Google Scholar 

  74. Zhang L, Mistry K, Lim C-P et al (2018) Feature selection using firefly optimization for classification and regression models. Decis Support Syst 106:64–85

    Article  Google Scholar 

  75. Zheng Z (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80–89

    Article  MathSciNet  Google Scholar 

  76. Žižka J, Dařena F (2017) The comparison of effects of relevant-feature selection algorithms on certain social-network text-mining viewpoints. In: Proceedings of the 6th computer science on-line conference. Zlin, Czech Republic, pp 354–363

Download references

Acknowledgements

This research was supported by Project of National Nature Science Foundation of China, Grant No. 71731006, and Natural Science Foundation of Guangdong Province, Grant No. 2018A030313795. Lau’s work was supported by Grants from the Research Grant Council of the Hong Kong SAR (Projects: CityU 11502115 and CityU 11525716), the NSFC Basic Research Program (Project: 71671155), and the Shenzhen Municipal Science and Technology Innovation Fund (Project: JCYJ20160229165300897).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heyong Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Hong, M. & Lau, R.Y.K. Utility-based feature selection for text classification. Knowl Inf Syst 61, 197–226 (2019). https://doi.org/10.1007/s10115-018-1281-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1281-z

Keywords

Navigation