A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization

Sharef, Nurfadhlina Mohd; Martin, Trevor; Kasmiran, Khairul Azhar; Mustapha, Aida; Sulaiman, Md. Nasir; Azmi-Murad, Masrah Azrifah

doi:10.1007/s00500-014-1358-x

A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization

Methodologies and Application
Published: 06 July 2014

Volume 19, pages 1701–1714, (2015)
Cite this article

Soft Computing Aims and scope Submit manuscript

Nurfadhlina Mohd Sharef¹,
Trevor Martin¹,
Khairul Azhar Kasmiran¹,
Aida Mustapha¹,
Md. Nasir Sulaiman¹ &
…
Masrah Azrifah Azmi-Murad¹

511 Accesses
4 Citations
Explore all metrics

Abstract

Several methods have been studied in text categorization and mostly are inspired by the statistical distribution features in the texts, such as the implementation of Machine Learning (ML) methods. However, there is no work available that investigates the performance of ML-based methods against the text expression-based method, especially for incident and medical case categorization. Meanwhile, these two domains are becoming ever more popular, due to a growing interest of automation in security intelligence and health services. This paper presents a text expression-based method called Evolving Fuzzy Grammar (EFG) and evaluates its performance against the conventional ML methods of Naïve Bayes, support vector machine, \(k\)-nearest neighbor, adaptive booting, and decision tree. The incident dataset used is a real dataset that was taken from the World Incidents Tracking System, while ImageCLEF 2009 was used as the source for radiology case reports. The results suggested variations of strength and weakness of each method in both categorization tasks, where a standard evaluation technique (i.e., recall, precision, and \(F\)-measure) was used. In both domains, the SMO and IBk methods were the best, while AdaBoost was the worst. It was also observed that the medical dataset was easier to categorize than the incident. Although EFG was ranked second lowest, it obtained the highest precision score in the bombing categorization, the highest score in armed attack recall, and was averagely ranked in the top three for the medical case categorization. It was also noted that the text expression-based method used in EFG was the most verbose and expressive, when compared to the ML methods. This indicates that EFG is a viable method in text categorization and may serve as an alternative approach to such a task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection and Class-Weight Tuning Using Genetic Algorithm for Bio-molecular Event Extraction

GSEL: A Genetic Stacking-Based Ensemble Learning Approach for Incident Classification

A Study on Different Text Representation Methods for the Negative Selection Algorithm

Notes

Available at http://wits-classic.nctc.gov/. Last accessed on 25 March 2008.

References

Abulaish M, Dey L (2007) Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining. Data Knowl Eng 61:228–262
Article Google Scholar
Achananuparp P, Hu X, Shen X (2008) The evaluation of sentence similarity measures. Data Wareh Knowl Discov 305–316
Al Zamil MGH, Can AB (2011) ROLEX-SP?: Rules of lexical syntactic patterns for free text categorization. Knowl Based Syst 24(1):58–65. doi:10.1016/j.knosys.2010.07.005
Apté C, Damerau F, Weiss M (1994) Automated learning of decision rules for text categorization. J ACM Trans Inform Syst (TOIS) 12(3):233–251
Article Google Scholar
Apte C, Damerau F, Weiss S (1998) Text mining with decision rules and decision trees. In: Proceedings of the conference on automated learning and discovery, workshop 6: learning from text and the web
Baoli L, Shiwen Y, Qin L (2003) An improved k-nearest neighbor algorithm. In: Proceeding of the international conference on computer processing of oriental languages
Bharati A, Venkatapathy S, Reddy P (2005) Inferring semantic roles using sub-categorization frames and maximum entropy model. In: Proceedings of the ninth conference on computational natural language learning—CONLL ’05. Morristown, NJ, USA Association for Computational Linguistics , pp 165–168
Biébow B, Szulman S, Clément AJB (1999) TERMINAE: a linguistics-based tool for the building of a domain ontology. Lecture Notes in Computer Science, pp 49–66
Budanitsky A, Hirst G (2006) Evaluating WordNet-based measures of lexical semantic relatedness. J Comput Linguist 32(1):13–47
Article MATH Google Scholar
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5432–5435
Article Google Scholar
Chew C, Eysenbach G (2009) Pandemics in the age of twitter: content analysis of tweets during the 2009 H1N1 outbreak. PLoS One 5(11)
Chiang D, Keh H, Huang H, Chyr D (2008) The Chinese text categorization system with association rule and category priority. Expert Syst Appl 35(1–2):102–110
Article Google Scholar
Chieu HL, Ng HT (2002) Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th international conference on computational linguistics, pp 1–7
Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K, De Groen PC (2009) Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. J Biomed Inform 42(5):937–949
Article Google Scholar
Frasconi P, Soda G, Vullo A (2001) Text categorization for multi-page documents: a hybrid naive bayes HMM approach. In: Proceedings of the first ACM/IEEE-CS joint conference on digital libraries, pp 11–20
Fuller CM, Biros DP, Delen D (2011) An investigation of data and text mining methods for real world deception detection. Expert Syst Appl 38(7):8392–8398
Article Google Scholar
Gomide J, Veloso A, Jr. WM, Almeida V, Benevenuto F, Ferraz F, Teixeira M (2011) Dengue surveillance based on a computational model of spatio-temporal locality of Twitter. In: Proceedings of the ACM WebSci’11, pp 1–8
Gooch P, Roudsari A (2012) Lexical patterns, features and knowledge resources for coreference resolution in clinical notes. J Biomed Inform 45(5):901–912. doi:10.1016/j.jbi.2012.02.012
Article Google Scholar
Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430
Article Google Scholar
Guo Y, Shao Z, Hua N (2010) Automatic text categorization based on content analysis with cognitive situation models. Inform Sci 180(5):613–630. doi:10.1016/j.ins.2009.11.012
Article MathSciNet Google Scholar
Han ES, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor classification. Adv Knowl Discov Data Min 53–65
Hu Y, Li H, Cao Y, Teng L, Meyerzon D, Zheng Q (2006) Automatic extraction of titles from general documents using machine learning. Inform Proc Manag 42(5):1276–1293. doi:10.1016/j.ipm.2005.12.001
Article Google Scholar
Hung S-H, Lin C-H, Hong J-S (2010) Web mining for event-based commonsense knowledge using lexico-syntactic pattern matching and semantic role labeling. Expert Syst Appl 37(1):341–347. doi:10.1016/j.eswa.2009.05.060
Article Google Scholar
IJntema W, Sangers J, Hogenboom F, Frasincar F (2012) A lexico-semantic pattern language for learning ontology instances from text. Science, Services and Agents on the World Wide Web, Web Semantics. doi:10.1016/j.websem.2012.01.002
Jiang C, Coenen F, Sanderson R, Zito M (2010) Text classification using graph mining-based feature extraction. Knowl Based Syst 23(4):302–308
Article Google Scholar
Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn 2–7
Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J 41(3):428–437
Article Google Scholar
Khoo CSG, Na J, Wang W (2008) Pattern mining for information extraction using lexical, syntactic and semantic information?: preliminary results. In: Proceedings of the 4th Asia information retrieval conference on Information retrieval technology, pp 676–681
Kiyavitskaya N, Zeni N, Cordy JR, Mich L, Mylopoulos J (2009) Cerno: light-weight tool support for semantic annotation of textual documents. Data Knowl Eng 68(12):1470–1492
Article Google Scholar
Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31(2007):249–268
MATH MathSciNet Google Scholar
Lampos V, Cristianini N (2010). Tracking the flu pandemic by monitoring the Social Web. Inform Syst
Leite D, Gomide F (2012) Evolving linguistic fuzzy models from data streams. Comb Exp Theory 209–223
Li Z, Xiong Z, Zhang Y, Liu C, Li K (2011) Fast text categorization using concise semantic analysis. Pattern Recognit Lett 32(3):441–448
Article Google Scholar
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Article Google Scholar
Li Z, Xiong Z, Zhang Y, Liu C, Li K, Zhixing L, Kuan L (2011) Fast text categorization using concise semantic analysis. Pattern Recognit Lett 32(3):441–448. doi:10.1016/j.patrec.2010.11.001
Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38(10):12708–12716. doi:10.1016/j.eswa.2011.04.058
Article Google Scholar
Martinez-Gil J (2012) An overview of textual semantic similarity measures based on web intelligence. Artif Intell Rev. doi:10.1007/s10462-012-9349-8
Martin T, Shen Y, Azvine B (2008a) Automated semantic tagging using fuzzy grammar fragments. In: Proceeding of the IEEE international conference on fuzzy systems, pp 2224–2229
Martin T, Shen Y, Azvine B (2008b) Incremental evolution of fuzzy grammar fragments to enhance instance matching and text mining. IEEE Trans Fuzzy Syst 16(6):1425–1438
Article Google Scholar
Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Proceedings of the ninth conference on European chapter of the association for computational linguistics
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvist Investig 30(1):3–26
Article Google Scholar
Pedrycz W (2010) Evolvable fuzzy systems: some insights and challenges. Evol Syst 1(2):73–82. doi:10.1007/s12530-010-9002-1
Article MathSciNet Google Scholar
Pedrycz W, Berezowski J, Jamal I (2012) Learning in non-stationary environments. A granular description of data: a study in evolvable systems. In: Sayed-Mouchaweh M, Lughofer E (eds) Learning in non-stationary environments. Springer, New York, pp 57–75. doi:10.1007/978-1-4419-8020-5
Chapter Google Scholar
Pestian J, Nasrallah H, Matykiewicz P, Bennett A, Leenaars A (2010) Suicide note classification using natural language processing: a content analysis. Biomed Inform Insights 2010(3):19–28
Article Google Scholar
Petasis G, Spyropoulos CD, Halatsis C (2004) eg-GRIDS: context free grammatical inference from positive examples using genetic search. Lecture Notes in Artificial Intelligence, p 3264
Preot D, Cohn T, Gibbins N, Niranjan M (2012) Trendminer?: an architecture for real time analysis of social media text. In: Proceeding of the international AAAI conference on weblogs and social media, pp 4–7
Qiu Q, Zhang Y, Zhu J, Qu W (2009) Building a text classifier by a keyword and wikipedia knowledge. In: Proceedings of the 5th international conference on advanced data mining and applications, pp 277–287
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manag 24:513–523
Article Google Scholar
Schapire RE, Singer Y (2000) BoosTexter: a boosting-based system for text categorization. Mach Learn 135–168
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. doi:10.1145/505282.505283
Article Google Scholar
Sebastiani F, Sperduti A, Valdambrini N (2000) An improved boosting algorithm and its application to text categorization. In: Proceedings of the ninth international conference on Information and knowledge management, pp 78–85
Sharef NM (2011) Location recognition with fuzzy grammar. In: Proceedings of the third semantic technology and knowledge engineering conference, Putrajaya, pp 75–83
Sharef NM, Martin T, Shen Y (2009) Order independent incremental evolving fuzzy grammar fragment learner. In: Proceeding of the ninth international conference on intelligent systems design and applications, Pisa, pp 1221–1226. Retrieved from http://dblp.uni-trier.de/db/conf/eusflat/eusflat2009.html#SharefMS09
Sharef NM, Shen Y (2010) Text fragment extraction using incremental evolving fuzzy grammar fragments learner. In: Proceeding of the world congress on computational intelligence, Barcelona, pp 18–23
Sharef NM (2010) Text fragment identification with evolving fuzzy grammars. University of Bristol, UK
Google Scholar
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34:1–44
Article Google Scholar
Stavrianou A, Andritsos P, Nicoloyannis N (2007) Overview and semantic issues of text mining. ACM SIGMOD Rec 36(3):23
Article Google Scholar
Sun A, Naing M, Lim E, Lam W (2003) Using support vector machines for terrorism. Lecture Notes in Computer Science, vol 2665, pp 1–12
Todorovic BT, Rancic SR, Markovic IM, Mulalic EH, Ilic VM (2008) Named entity recognition and classification using context Hidden Markov Model. In: Proceeding of the 2008 ninth symposium on neural network applications in electrical engineering, vol 1, pp 43–46
Torii M, Yin L, Nguyen T, Mazumdar CT, Liu H, Hartley DM, Nelson NP (2011) An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. Int J Med Inform 80(1): 56–66. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/21134784
Uematsu S, Tsujii J (2009) Evaluating contribution of deep syntactic information to shallow semantic analysis. In: Proceedings of the 11th international conference on parsing technologies—IWPT ‘09, (October), 85. Retrieved from http://portal.acm.org/citation.cfm?doid=1697236.1697254
Unold O, Ciel L (2007) Learning context-free grammars from partially structured examples: juxtaposition of GCS with TBL. In: Proceeding of the seventh international conference on hybrid intelligent systems (HIS 2007), pp 348–352. Retrieved from http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4344077
Unold O, Jaworski M (2010) Learning context-free grammar using improved tabular representation. Appl Soft Comput 10(1): 44–52. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/S1568494609000696
Wang P, Hu J, Zeng H-J, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inform Syst 19(3):265–282
Article Google Scholar
Wenyin L, Quan X, Feng M, Qiu B (2010) A short text modeling method combining semantic and statistical information. Inform Sci 180(20):4031–4041. doi:10.1016/j.ins.2010.06.021
Article Google Scholar
Xu Y (2010) A study for important criteria of feature selection in text categorization. In: Proceeding of the second international workshop on intelligent systems and applications, vol 1, pp 1–4. doi:10.1109/IWISA.2010.5473381
Xue X, Zhou Z, Member S (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21(3):428–442
Article Google Scholar
Yu B, Xu Z, Li C (2008) Latent semantic analysis for text categorization using neural network. Knowl Based Syst 21(8):900–904. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/S0950705108000993

Download references

Acknowledgments

This project is part of the progress of a research grant, funded under the University of Putra Malaysia Research University Grant Scheme.

Author information

Authors and Affiliations

Faculty of Computer Science and Information Technology, University of Putra Malaysia, 43400 , Serdang, Selangor, Malaysia
Nurfadhlina Mohd Sharef, Trevor Martin, Khairul Azhar Kasmiran, Aida Mustapha, Md. Nasir Sulaiman & Masrah Azrifah Azmi-Murad

Authors

Nurfadhlina Mohd Sharef
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Martin
View author publications
You can also search for this author in PubMed Google Scholar
Khairul Azhar Kasmiran
View author publications
You can also search for this author in PubMed Google Scholar
Aida Mustapha
View author publications
You can also search for this author in PubMed Google Scholar
Md. Nasir Sulaiman
View author publications
You can also search for this author in PubMed Google Scholar
Masrah Azrifah Azmi-Murad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nurfadhlina Mohd Sharef.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharef, N.M., Martin, T., Kasmiran, K.A. et al. A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization. Soft Comput 19, 1701–1714 (2015). https://doi.org/10.1007/s00500-014-1358-x

Download citation

Published: 06 July 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s00500-014-1358-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization

Abstract

Access this article

Similar content being viewed by others

Feature Selection and Class-Weight Tuning Using Genetic Algorithm for Bio-molecular Event Extraction

GSEL: A Genetic Stacking-Based Ensemble Learning Approach for Incident Classification

A Study on Different Text Representation Methods for the Negative Selection Algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization

Abstract

Access this article

Similar content being viewed by others

Feature Selection and Class-Weight Tuning Using Genetic Algorithm for Bio-molecular Event Extraction

GSEL: A Genetic Stacking-Based Ensemble Learning Approach for Incident Classification

A Study on Different Text Representation Methods for the Negative Selection Algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation