Abstract
Several methods have been studied in text categorization and mostly are inspired by the statistical distribution features in the texts, such as the implementation of Machine Learning (ML) methods. However, there is no work available that investigates the performance of ML-based methods against the text expression-based method, especially for incident and medical case categorization. Meanwhile, these two domains are becoming ever more popular, due to a growing interest of automation in security intelligence and health services. This paper presents a text expression-based method called Evolving Fuzzy Grammar (EFG) and evaluates its performance against the conventional ML methods of Naïve Bayes, support vector machine, \(k\)-nearest neighbor, adaptive booting, and decision tree. The incident dataset used is a real dataset that was taken from the World Incidents Tracking System, while ImageCLEF 2009 was used as the source for radiology case reports. The results suggested variations of strength and weakness of each method in both categorization tasks, where a standard evaluation technique (i.e., recall, precision, and \(F\)-measure) was used. In both domains, the SMO and IBk methods were the best, while AdaBoost was the worst. It was also observed that the medical dataset was easier to categorize than the incident. Although EFG was ranked second lowest, it obtained the highest precision score in the bombing categorization, the highest score in armed attack recall, and was averagely ranked in the top three for the medical case categorization. It was also noted that the text expression-based method used in EFG was the most verbose and expressive, when compared to the ML methods. This indicates that EFG is a viable method in text categorization and may serve as an alternative approach to such a task.
Similar content being viewed by others
Notes
Available at http://wits-classic.nctc.gov/. Last accessed on 25 March 2008.
References
Abulaish M, Dey L (2007) Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining. Data Knowl Eng 61:228–262
Achananuparp P, Hu X, Shen X (2008) The evaluation of sentence similarity measures. Data Wareh Knowl Discov 305–316
Al Zamil MGH, Can AB (2011) ROLEX-SP?: Rules of lexical syntactic patterns for free text categorization. Knowl Based Syst 24(1):58–65. doi:10.1016/j.knosys.2010.07.005
Apté C, Damerau F, Weiss M (1994) Automated learning of decision rules for text categorization. J ACM Trans Inform Syst (TOIS) 12(3):233–251
Apte C, Damerau F, Weiss S (1998) Text mining with decision rules and decision trees. In: Proceedings of the conference on automated learning and discovery, workshop 6: learning from text and the web
Baoli L, Shiwen Y, Qin L (2003) An improved k-nearest neighbor algorithm. In: Proceeding of the international conference on computer processing of oriental languages
Bharati A, Venkatapathy S, Reddy P (2005) Inferring semantic roles using sub-categorization frames and maximum entropy model. In: Proceedings of the ninth conference on computational natural language learning—CONLL ’05. Morristown, NJ, USA Association for Computational Linguistics , pp 165–168
Biébow B, Szulman S, Clément AJB (1999) TERMINAE: a linguistics-based tool for the building of a domain ontology. Lecture Notes in Computer Science, pp 49–66
Budanitsky A, Hirst G (2006) Evaluating WordNet-based measures of lexical semantic relatedness. J Comput Linguist 32(1):13–47
Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5432–5435
Chew C, Eysenbach G (2009) Pandemics in the age of twitter: content analysis of tweets during the 2009 H1N1 outbreak. PLoS One 5(11)
Chiang D, Keh H, Huang H, Chyr D (2008) The Chinese text categorization system with association rule and category priority. Expert Syst Appl 35(1–2):102–110
Chieu HL, Ng HT (2002) Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th international conference on computational linguistics, pp 1–7
Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K, De Groen PC (2009) Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. J Biomed Inform 42(5):937–949
Frasconi P, Soda G, Vullo A (2001) Text categorization for multi-page documents: a hybrid naive bayes HMM approach. In: Proceedings of the first ACM/IEEE-CS joint conference on digital libraries, pp 11–20
Fuller CM, Biros DP, Delen D (2011) An investigation of data and text mining methods for real world deception detection. Expert Syst Appl 38(7):8392–8398
Gomide J, Veloso A, Jr. WM, Almeida V, Benevenuto F, Ferraz F, Teixeira M (2011) Dengue surveillance based on a computational model of spatio-temporal locality of Twitter. In: Proceedings of the ACM WebSci’11, pp 1–8
Gooch P, Roudsari A (2012) Lexical patterns, features and knowledge resources for coreference resolution in clinical notes. J Biomed Inform 45(5):901–912. doi:10.1016/j.jbi.2012.02.012
Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430
Guo Y, Shao Z, Hua N (2010) Automatic text categorization based on content analysis with cognitive situation models. Inform Sci 180(5):613–630. doi:10.1016/j.ins.2009.11.012
Han ES, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor classification. Adv Knowl Discov Data Min 53–65
Hu Y, Li H, Cao Y, Teng L, Meyerzon D, Zheng Q (2006) Automatic extraction of titles from general documents using machine learning. Inform Proc Manag 42(5):1276–1293. doi:10.1016/j.ipm.2005.12.001
Hung S-H, Lin C-H, Hong J-S (2010) Web mining for event-based commonsense knowledge using lexico-syntactic pattern matching and semantic role labeling. Expert Syst Appl 37(1):341–347. doi:10.1016/j.eswa.2009.05.060
IJntema W, Sangers J, Hogenboom F, Frasincar F (2012) A lexico-semantic pattern language for learning ontology instances from text. Science, Services and Agents on the World Wide Web, Web Semantics. doi:10.1016/j.websem.2012.01.002
Jiang C, Coenen F, Sanderson R, Zito M (2010) Text classification using graph mining-based feature extraction. Knowl Based Syst 23(4):302–308
Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn 2–7
Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J 41(3):428–437
Khoo CSG, Na J, Wang W (2008) Pattern mining for information extraction using lexical, syntactic and semantic information?: preliminary results. In: Proceedings of the 4th Asia information retrieval conference on Information retrieval technology, pp 676–681
Kiyavitskaya N, Zeni N, Cordy JR, Mich L, Mylopoulos J (2009) Cerno: light-weight tool support for semantic annotation of textual documents. Data Knowl Eng 68(12):1470–1492
Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31(2007):249–268
Lampos V, Cristianini N (2010). Tracking the flu pandemic by monitoring the Social Web. Inform Syst
Leite D, Gomide F (2012) Evolving linguistic fuzzy models from data streams. Comb Exp Theory 209–223
Li Z, Xiong Z, Zhang Y, Liu C, Li K (2011) Fast text categorization using concise semantic analysis. Pattern Recognit Lett 32(3):441–448
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Li Z, Xiong Z, Zhang Y, Liu C, Li K, Zhixing L, Kuan L (2011) Fast text categorization using concise semantic analysis. Pattern Recognit Lett 32(3):441–448. doi:10.1016/j.patrec.2010.11.001
Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38(10):12708–12716. doi:10.1016/j.eswa.2011.04.058
Martinez-Gil J (2012) An overview of textual semantic similarity measures based on web intelligence. Artif Intell Rev. doi:10.1007/s10462-012-9349-8
Martin T, Shen Y, Azvine B (2008a) Automated semantic tagging using fuzzy grammar fragments. In: Proceeding of the IEEE international conference on fuzzy systems, pp 2224–2229
Martin T, Shen Y, Azvine B (2008b) Incremental evolution of fuzzy grammar fragments to enhance instance matching and text mining. IEEE Trans Fuzzy Syst 16(6):1425–1438
Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Proceedings of the ninth conference on European chapter of the association for computational linguistics
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvist Investig 30(1):3–26
Pedrycz W (2010) Evolvable fuzzy systems: some insights and challenges. Evol Syst 1(2):73–82. doi:10.1007/s12530-010-9002-1
Pedrycz W, Berezowski J, Jamal I (2012) Learning in non-stationary environments. A granular description of data: a study in evolvable systems. In: Sayed-Mouchaweh M, Lughofer E (eds) Learning in non-stationary environments. Springer, New York, pp 57–75. doi:10.1007/978-1-4419-8020-5
Pestian J, Nasrallah H, Matykiewicz P, Bennett A, Leenaars A (2010) Suicide note classification using natural language processing: a content analysis. Biomed Inform Insights 2010(3):19–28
Petasis G, Spyropoulos CD, Halatsis C (2004) eg-GRIDS: context free grammatical inference from positive examples using genetic search. Lecture Notes in Artificial Intelligence, p 3264
Preot D, Cohn T, Gibbins N, Niranjan M (2012) Trendminer?: an architecture for real time analysis of social media text. In: Proceeding of the international AAAI conference on weblogs and social media, pp 4–7
Qiu Q, Zhang Y, Zhu J, Qu W (2009) Building a text classifier by a keyword and wikipedia knowledge. In: Proceedings of the 5th international conference on advanced data mining and applications, pp 277–287
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manag 24:513–523
Schapire RE, Singer Y (2000) BoosTexter: a boosting-based system for text categorization. Mach Learn 135–168
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. doi:10.1145/505282.505283
Sebastiani F, Sperduti A, Valdambrini N (2000) An improved boosting algorithm and its application to text categorization. In: Proceedings of the ninth international conference on Information and knowledge management, pp 78–85
Sharef NM (2011) Location recognition with fuzzy grammar. In: Proceedings of the third semantic technology and knowledge engineering conference, Putrajaya, pp 75–83
Sharef NM, Martin T, Shen Y (2009) Order independent incremental evolving fuzzy grammar fragment learner. In: Proceeding of the ninth international conference on intelligent systems design and applications, Pisa, pp 1221–1226. Retrieved from http://dblp.uni-trier.de/db/conf/eusflat/eusflat2009.html#SharefMS09
Sharef NM, Shen Y (2010) Text fragment extraction using incremental evolving fuzzy grammar fragments learner. In: Proceeding of the world congress on computational intelligence, Barcelona, pp 18–23
Sharef NM (2010) Text fragment identification with evolving fuzzy grammars. University of Bristol, UK
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34:1–44
Stavrianou A, Andritsos P, Nicoloyannis N (2007) Overview and semantic issues of text mining. ACM SIGMOD Rec 36(3):23
Sun A, Naing M, Lim E, Lam W (2003) Using support vector machines for terrorism. Lecture Notes in Computer Science, vol 2665, pp 1–12
Todorovic BT, Rancic SR, Markovic IM, Mulalic EH, Ilic VM (2008) Named entity recognition and classification using context Hidden Markov Model. In: Proceeding of the 2008 ninth symposium on neural network applications in electrical engineering, vol 1, pp 43–46
Torii M, Yin L, Nguyen T, Mazumdar CT, Liu H, Hartley DM, Nelson NP (2011) An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. Int J Med Inform 80(1): 56–66. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/21134784
Uematsu S, Tsujii J (2009) Evaluating contribution of deep syntactic information to shallow semantic analysis. In: Proceedings of the 11th international conference on parsing technologies—IWPT ‘09, (October), 85. Retrieved from http://portal.acm.org/citation.cfm?doid=1697236.1697254
Unold O, Ciel L (2007) Learning context-free grammars from partially structured examples: juxtaposition of GCS with TBL. In: Proceeding of the seventh international conference on hybrid intelligent systems (HIS 2007), pp 348–352. Retrieved from http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4344077
Unold O, Jaworski M (2010) Learning context-free grammar using improved tabular representation. Appl Soft Comput 10(1): 44–52. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/S1568494609000696
Wang P, Hu J, Zeng H-J, Chen Z (2009) Using Wikipedia knowledge to improve text classification. Knowl Inform Syst 19(3):265–282
Wenyin L, Quan X, Feng M, Qiu B (2010) A short text modeling method combining semantic and statistical information. Inform Sci 180(20):4031–4041. doi:10.1016/j.ins.2010.06.021
Xu Y (2010) A study for important criteria of feature selection in text categorization. In: Proceeding of the second international workshop on intelligent systems and applications, vol 1, pp 1–4. doi:10.1109/IWISA.2010.5473381
Xue X, Zhou Z, Member S (2009) Distributional features for text categorization. IEEE Trans Knowl Data Eng 21(3):428–442
Yu B, Xu Z, Li C (2008) Latent semantic analysis for text categorization using neural network. Knowl Based Syst 21(8):900–904. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/S0950705108000993
Acknowledgments
This project is part of the progress of a research grant, funded under the University of Putra Malaysia Research University Grant Scheme.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Sharef, N.M., Martin, T., Kasmiran, K.A. et al. A comparative study of evolving fuzzy grammar and machine learning techniques for text categorization. Soft Comput 19, 1701–1714 (2015). https://doi.org/10.1007/s00500-014-1358-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1358-x