Skip to main content
Log in

A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Several methods have been studied in text categorization and mostly are inspired by the statistical distribution features in the texts, such as the implementation of Machine Learning (ML) methods. However, there is no work available that investigates the performance of ML-based methods against the text expression-based method, especially for incident and medical case categorization. Meanwhile, these two domains are becoming ever more popular, due to a growing interest of automation in security intelligence and health services. This paper presents a text expression-based method called Evolving Fuzzy Grammar (EFG) and evaluates its performance against the conventional ML methods of Naïve Bayes, support vector machine, \(k\)-nearest neighbor, adaptive booting, and decision tree. The incident dataset used is a real dataset that was taken from the World Incidents Tracking System, while ImageCLEF 2009 was used as the source for radiology case reports. The results suggested variations of strength and weakness of each method in both categorization tasks, where a standard evaluation technique (i.e., recall, precision, and \(F\)-measure) was used. In both domains, the SMO and IBk methods were the best, while AdaBoost was the worst. It was also observed that the medical dataset was easier to categorize than the incident. Although EFG was ranked second lowest, it obtained the highest precision score in the bombing categorization, the highest score in armed attack recall, and was averagely ranked in the top three for the medical case categorization. It was also noted that the text expression-based method used in EFG was the most verbose and expressive, when compared to the ML methods. This indicates that EFG is a viable method in text categorization and may serve as an alternative approach to such a task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Available at http://wits-classic.nctc.gov/. Last accessed on 25 March 2008.

References

  • Abulaish M, Dey L (2007) Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining. Data Knowl Eng 61:228–262

    Article  Google Scholar 

  • Achananuparp P, Hu X, Shen X (2008) The evaluation of sentence similarity measures. Data Wareh Knowl Discov 305–316

  • Al Zamil MGH, Can AB (2011) ROLEX-SP?: Rules of lexical syntactic patterns for free text categorization. Knowl Based Syst 24(1):58–65. doi:10.1016/j.knosys.2010.07.005

  • Apté C, Damerau F, Weiss M (1994) Automated learning of decision rules for text categorization. J ACM Trans Inform Syst (TOIS) 12(3):233–251

    Article  Google Scholar 

  • Apte C, Damerau F, Weiss S (1998) Text mining with decision rules and decision trees. In: Proceedings of the conference on automated learning and discovery, workshop 6: learning from text and the web

  • Baoli L, Shiwen Y, Qin L (2003) An improved k-nearest neighbor algorithm. In: Proceeding of the international conference on computer processing of oriental languages

  • Bharati A, Venkatapathy S, Reddy P (2005) Inferring semantic roles using sub-categorization frames and maximum entropy model. In: Proceedings of the ninth conference on computational natural language learning—CONLL ’05. Morristown, NJ, USA Association for Computational Linguistics , pp 165–168

  • Biébow B, Szulman S, Clément AJB (1999) TERMINAE: a linguistics-based tool for the building of a domain ontology. Lecture Notes in Computer Science, pp 49–66

  • Budanitsky A, Hirst G (2006) Evaluating WordNet-based measures of lexical semantic relatedness. J Comput Linguist 32(1):13–47

    Article  MATH  Google Scholar 

  • Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5432–5435

    Article  Google Scholar 

  • Chew C, Eysenbach G (2009) Pandemics in the age of twitter: content analysis of tweets during the 2009 H1N1 outbreak. PLoS One 5(11)

  • Chiang D, Keh H, Huang H, Chyr D (2008) The Chinese text categorization system with association rule and category priority. Expert Syst Appl 35(1–2):102–110

    Article  Google Scholar 

  • Chieu HL, Ng HT (2002) Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th international conference on computational linguistics, pp 1–7

  • Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K, De Groen PC (2009) Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. J Biomed Inform 42(5):937–949

    Article  Google Scholar 

  • Frasconi P, Soda G, Vullo A (2001) Text categorization for multi-page documents: a hybrid naive bayes HMM approach. In: Proceedings of the first ACM/IEEE-CS joint conference on digital libraries, pp 11–20

  • Fuller CM, Biros DP, Delen D (2011) An investigation of data and text mining methods for real world deception detection. Expert Syst Appl 38(7):8392–8398

    Article  Google Scholar 

  • Gomide J, Veloso A, Jr. WM, Almeida V, Benevenuto F, Ferraz F, Teixeira M (2011) Dengue surveillance based on a computational model of spatio-temporal locality of Twitter. In: Proceedings of the ACM WebSci’11, pp 1–8

  • Gooch P, Roudsari A (2012) Lexical patterns, features and knowledge resources for coreference resolution in clinical notes. J Biomed Inform 45(5):901–912. doi:10.1016/j.jbi.2012.02.012

    Article  Google Scholar 

  • Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430

    Article  Google Scholar 

  • Guo Y, Shao Z, Hua N (2010) Automatic text categorization based on content analysis with cognitive situation models. Inform Sci 180(5):613–630. doi:10.1016/j.ins.2009.11.012

    Article  MathSciNet  Google Scholar 

  • Han ES, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor classification. Adv Knowl Discov Data Min 53–65

  • Hu Y, Li H, Cao Y, Teng L, Meyerzon D, Zheng Q (2006) Automatic extraction of titles from general documents using machine learning. Inform Proc Manag 42(5):1276–1293. doi:10.1016/j.ipm.2005.12.001

    Article  Google Scholar 

  • Hung S-H, Lin C-H, Hong J-S (2010) Web mining for event-based commonsense knowledge using lexico-syntactic pattern matching and semantic role labeling. Expert Syst Appl 37(1):341–347. doi:10.1016/j.eswa.2009.05.060

    Article  Google Scholar 

  • IJntema W, Sangers J, Hogenboom F, Frasincar F (2012) A lexico-semantic pattern language for learning ontology instances from text. Science, Services and Agents on the World Wide Web, Web Semantics. doi:10.1016/j.websem.2012.01.002

  • Jiang C, Coenen F, Sanderson R, Zito M (2010) Text classification using graph mining-based feature extraction. Knowl Based Syst 23(4):302–308

    Article  Google Scholar 

  • Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509

    Article  Google Scholar 

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn 2–7

  • Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J 41(3):428–437

    Article  Google Scholar 

  • Khoo CSG, Na J, Wang W (2008) Pattern mining for information extraction using lexical, syntactic and semantic information?: preliminary results. In: Proceedings of the 4th Asia information retrieval conference on Information retrieval technology, pp 676–681

  • Kiyavitskaya N, Zeni N, Cordy JR, Mich L, Mylopoulos J (2009) Cerno: light-weight tool support for semantic annotation of textual documents. Data Knowl Eng 68(12):1470–1492

    Article  Google Scholar 

  • Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31(2007):249–268

    MATH  MathSciNet  Google Scholar 

  • Lampos V, Cristianini N (2010). Tracking the flu pandemic by monitoring the Social Web. Inform Syst

  • Leite D, Gomide F (2012) Evolving linguistic fuzzy models from data streams. Comb Exp Theory 209–223

  • Li Z, Xiong Z, Zhang Y, Liu C, Li K (2011) Fast text categorization using concise semantic analysis. Pattern Recognit Lett 32(3):441–448

    Article  Google Scholar 

  • Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701

    Article  Google Scholar 

  • Li Z, Xiong Z, Zhang Y, Liu C, Li K, Zhixing L, Kuan L (2011) Fast text categorization using concise semantic analysis. Pattern Recognit Lett 32(3):441–448. doi:10.1016/j.patrec.2010.11.001

  • Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38(10):12708–12716. doi:10.1016/j.eswa.2011.04.058

    Article  Google Scholar 

  • Martinez-Gil J (2012) An overview of textual semantic similarity measures based on web intelligence. Artif Intell Rev. doi:10.1007/s10462-012-9349-8

  • Martin T, Shen Y, Azvine B (2008a) Automated semantic tagging using fuzzy grammar fragments. In: Proceeding of the IEEE international conference on fuzzy systems, pp 2224–2229

  • Martin T, Shen Y, Azvine B (2008b) Incremental evolution of fuzzy grammar fragments to enhance instance matching and text mining. IEEE Trans Fuzzy Syst 16(6):1425–1438

    Article  Google Scholar 

  • Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Proceedings of the ninth conference on European chapter of the association for computational linguistics

  • Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvist Investig 30(1):3–26

    Article  Google Scholar 

  • Pedrycz W (2010) Evolvable fuzzy systems: some insights and challenges. Evol Syst 1(2):73–82. doi:10.1007/s12530-010-9002-1

    Article  MathSciNet  Google Scholar 

  • Pedrycz W, Berezowski J, Jamal I (2012) Learning in non-stationary environments. A granular description of data: a study in evolvable systems. In: Sayed-Mouchaweh M, Lughofer E (eds) Learning in non-stationary environments. Springer, New York, pp 57–75. doi:10.1007/978-1-4419-8020-5

    Chapter  Google Scholar 

  • Pestian J, Nasrallah H, Matykiewicz P, Bennett A, Leenaars A (2010) Suicide note classification using natural language processing: a content analysis. Biomed Inform Insights 2010(3):19–28

    Article  Google Scholar 

  • Petasis G, Spyropoulos CD, Halatsis C (2004) eg-GRIDS: context free grammatical inference from positive examples using genetic search. Lecture Notes in Artificial Intelligence, p 3264

  • Preot D, Cohn T, Gibbins N, Niranjan M (2012) Trendminer?: an architecture for real time analysis of social media text. In: Proceeding of the international AAAI conference on weblogs and social media, pp 4–7

  • Qiu Q, Zhang Y, Zhu J, Qu W (2009) Building a text classifier by a keyword and wikipedia knowledge. In: Proceedings of the 5th international conference on advanced data mining and applications, pp 277–287

  • Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manag 24:513–523

    Article  Google Scholar 

  • Schapire RE, Singer Y (2000) BoosTexter: a boosting-based system for text categorization. Mach Learn 135–168

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. doi:10.1145/505282.505283

    Article  Google Scholar 

  • Sebastiani F, Sperduti A, Valdambrini N (2000) An improved boosting algorithm and its application to text categorization. In: Proceedings of the ninth international conference on Information and knowledge management, pp 78–85

  • Sharef NM (2011) Location recognition with fuzzy grammar. In: Proceedings of the third semantic technology and knowledge engineering conference, Putrajaya, pp 75–83

  • Sharef NM, Martin T, Shen Y (2009) Order independent incremental evolving fuzzy grammar fragment learner. In: Proceeding of the ninth international conference on intelligent systems design and applications, Pisa, pp 1221–1226. Retrieved from http://dblp.uni-trier.de/db/conf/eusflat/eusflat2009.html#SharefMS09

  • Sharef NM, Shen Y (2010) Text fragment extraction using incremental evolving fuzzy grammar fragments learner. In: Proceeding of the world congress on computational intelligence, Barcelona, pp 18–23

  • Sharef NM (2010) Text fragment identification with evolving fuzzy grammars. University of Bristol, UK

    Google Scholar 

  • Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34:1–44

    Article  Google Scholar 

  • Stavrianou A, Andritsos P, Nicoloyannis N (2007) Overview and semantic issues of text mining. ACM SIGMOD Rec 36(3):23

    Article  Google Scholar 

  • Sun A, Naing M, Lim E, Lam W (2003) Using support vector machines for terrorism. Lecture Notes in Computer Science, vol 2665, pp 1–12

  • Todorovic BT, Rancic SR, Markovic IM, Mulalic EH, Ilic VM (2008) Named entity recognition and classification using context Hidden Markov Model. In: Proceeding of the 2008 ninth symposium on neural network applications in electrical engineering, vol 1, pp 43–46

  • Torii M, Yin L, Nguyen T, Mazumdar CT, Liu H, Hartley DM, Nelson NP (2011) An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. Int J Med Inform 80(1): 56–66. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/21134784

  • Uematsu S, Tsujii J (2009) Evaluating contribution of deep syntactic information to shallow semantic analysis. In: Proceedings of the 11th international conference on parsing technologies—IWPT ‘09, (October), 85. Retrieved from http://portal.acm.org/citation.cfm?doid=1697236.1697254

  • Unold O, Ciel L (2007) Learning context-free grammars from partially structured examples: juxtaposition of GCS with TBL. In: Proceeding of the seventh international conference on hybrid intelligent systems (HIS 2007), pp 348–352. Retrieved from http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4344077

  • Unold O, Jaworski M (2010) Learning context-free grammar using improved tabular representation. Appl Soft Comput 10(1): 44–52. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/S1568494609000696

  • Wang P, Hu J, Zeng H-J, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inform Syst 19(3):265–282

    Article  Google Scholar 

  • Wenyin L, Quan X, Feng M, Qiu B (2010) A short text modeling method combining semantic and statistical information. Inform Sci 180(20):4031–4041. doi:10.1016/j.ins.2010.06.021

    Article  Google Scholar 

  • Xu Y (2010) A study for important criteria of feature selection in text categorization. In: Proceeding of the second international workshop on intelligent systems and applications, vol 1, pp 1–4. doi:10.1109/IWISA.2010.5473381

  • Xue X, Zhou Z, Member S (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21(3):428–442

    Article  Google Scholar 

  • Yu B, Xu Z, Li C (2008) Latent semantic analysis for text categorization using neural network. Knowl Based Syst 21(8):900–904. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/S0950705108000993

Download references

Acknowledgments

This project is part of the progress of a research grant, funded under the University of Putra Malaysia Research University Grant Scheme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nurfadhlina Mohd Sharef.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharef, N.M., Martin, T., Kasmiran, K.A. et al. A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization. Soft Comput 19, 1701–1714 (2015). https://doi.org/10.1007/s00500-014-1358-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-014-1358-x

Keywords

Navigation