Abstract
Text classification pertains to the automated procedure of assigning predefined labels or categories to textual data. A comprehensive review of the existing literature on Arabic text classification (ATC) reveals that most research concentrates on methodologies and approaches, with no thorough evaluation of ATC. Consequently, this systematic review aims to offer a comprehensive understanding of the state-of-the-art in ATC, illuminate the present challenges, and discuss prominent trends in large-scale research. From a collection of 2875 studies, 60 were determined to satisfy the eligibility criteria and were rigorously analyzed. The selected studies were divided into three categories: topic areas, tasks/applications, and ATC phases. The topic areas were classified into six primary sectors: healthcare, legal, security and cybersecurity, history, culture and religion, social media, and agriculture. The ATC tasks/applications were classified into nine groups: gender identification, author identification, disease detection, threat and spam detection, dialect identification, hierarchical categorization, news article classification, web page clustering, and question classification. The ATC phases were organized into five categories: corpus creation, preprocessing (stemming and tokenization), feature selection, feature extraction, and classifiers/approaches. The review emphasizes the proposed solutions in each ATC study and offers insights for future research. This review also underscores the potential applications of ATC in addressing current challenges across various industries and highlights the significance of developing a benchmark dataset for ATC to facilitate model comparison. The review concludes by proposing areas where further research is required, such as addressing the unbalanced dataset issue, enhancing the preprocessing phase, and exploring human factors’ role in utilizing ATC systems.
Similar content being viewed by others
Data availability
The data presented in this study are available on request from the authors.
References
Abdeen MAR, AlBouq S, Elmahalawy A, Shehata S (2019) A closer look at arabic text classification. Int J Adv Comput Sci Appl 10(11):677–688. https://doi.org/10.14569/IJACSA.2019.0101189
Abdelaal HM, Elmahdy AN, Halawa AA, Youness HA (2018) Improve the automatic classification accuracy for Arabic tweets using ensemble methods. J Electr Syst Inf Technol 2017:1–8. https://doi.org/10.1016/j.jesit.2018.03.001
Abdul-Mageed M, Elmadany AR, Nagoudi EMB (2021) ARBERT & MARBERT: deep bidirectional transformers for Arabic. In: ACL-IJCNLP 2021—59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, proceedings of the conference, i, 7088–7105. https://doi.org/10.18653/v1/2021.acl-long.551
Abdulaziz S, Abdul A, Jilani K (2020) Machine learning approach for threat detection on social media posts containing Arabic text. Evol Intell. https://doi.org/10.1007/s12065-020-00458-w
Abooraig R, Al-zu’bi S, Kanan T, Hawashin B, Al Ayoub M, Hmeidi I (2018) Automatic categorization of Arabic articles based on their political orientation Automatic categorization of Arabic articles based on their political orientation. Digit Investig. https://doi.org/10.1016/j.diin.2018.04.003
Abuzeina D, Al-anzi FS (2017) Employing fisher discriminant analysis for Arabic text classification. Comput Electr Eng. https://doi.org/10.1016/j.compeleceng.2017.11.002
Al-Janabi S, Salman MA, Mohammad M (2019) Multi-level network construction based on intelligent big data analysis. Stud Big Data 53:102–118. https://doi.org/10.1007/978-3-030-12048-1_13/COVER
Al-Janabi S, Salman MA, Mohammed M (2020) Pragmatic text mining method to find the topics of citation network. Lect Notes Netw Syst 81:190–205. https://doi.org/10.1007/978-3-030-23672-4_15/COVER
Al-anzi FS, Abuzeina D (2017) Towards an enhanced Arabic text classification using cosine similarity and latent semantic indexing. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2016.04.001
Al-anzi FS, Abuzeina D (2018) Beyond vector space model for hierarchical Arabic text classi fi cation: a Markov chain approach. Inf Process Manag 54(1):105–115. https://doi.org/10.1016/j.ipm.2017.10.003
Al-badarneh A, Al-Shawakfa E, Bani-Ismail B, Al-Rababah K, Shatnawi S (2017) The impact of indexing approaches on Arabic text classification. J Inf Sci 43(2):159–173. https://doi.org/10.1177/0165551515625030
Al-Emran M, Zaza S, Shaalan K (2015) Parsing modern standard Arabic using Treebank resources. In: 2015 International conference on information and communication technology research, ICTRC 2015. https://doi.org/10.1109/ICTRC.2015.7156426
Al-Janabi S (2021) Overcoming the main challenges of knowledge discovery through tendency to the intelligent data analysis. In: 2021 international conference on data analytics for business and industry, ICDABI 2021, 286–294. https://doi.org/10.1109/ICDABI53623.2021.9655916
Al-Janabi S, Alkaim AF, Adel Z (2020a) An Innovative synthesis of deep learning techniques (DCapsNet & DCOM) for generation electrical renewable energy from wind energy. Soft Comput 24(14):10943–10962. https://doi.org/10.1007/S00500-020-04905-9/METRICS
Al-Janabi S, Mohammad M, Al-Sultan A (2020b) A new method for prediction of air pollution based on intelligent computation. Soft Comput 24(1):661–680. https://doi.org/10.1007/S00500-019-04495-1/METRICS
Al-Janabi S, Patel A, Fatlawi H, Kalajdzic K, Al Shourbaji I (2015) Empirical rapid and accurate prediction model for data mining tasks in cloud computing environments. In: 2014 international congress on technology, communication and knowledge, ICTCK 2014. https://doi.org/10.1109/ICTCK.2014.7033495
Al-Radaideh QA, Al-Abrat MA (2019) An Arabic text categorization approach using term weighting and multiple reducts. Soft Comput 23(14):5849–5863. https://doi.org/10.1007/s00500-018-3249-z
Al-Saleh AB, Menai MEB (2016) Automatic Arabic text summarization: a survey. Artif Intell Rev 45:203–234. https://doi.org/10.1007/S10462-015-9442-X/TABLES/5
Al-Sarem M, Emara A-H (2019) The effect of training set size in authorship attribution: application on short Arabic texts. Int J Electr Comput Eng 9(1):652–659. https://doi.org/10.11591/ijece.v9i1.pp652-659
Al-sarem M, Emara AH, Wahab A (2020) Performance of authorship attribution classifiers with short texts application of religious Arabic fatwas. Int J Data Min Model Manag 12:350–364
Al-shaibani MS, Alyafeai Z, Ahmad I (2020) Meter classification of Arabic poems using deep bidirectional recurrent neural networks. Pattern Recognit Lett 136:1–7. https://doi.org/10.1016/j.patrec.2020.05.028
Al-Tamimi A-K, Bani-Isaa E, Al-Alami A (2021) Active learning for Arabic text classification. Int Conf Comput Intell Knowl Econ (ICCIKE) 2021:123–126
Alabbas W, Al-Khateeb HM, Mansour A (2016) Arabic text classification methods: systematic literature review of primary studies. In: 2016 4th IEEE international colloquium on information science and technology (CiSt), 361–367
Alahmadi A, Joorabchi A, Mahdi AE (2017) Combining words and concepts for automatic Arabic. In: International conference on Arabic language processing, 105–119
Alammary AS (2022) BERT models for Arabic text classification: a systematic review. Appl Sci (Switz). https://doi.org/10.3390/app12115720
Alanazi SA (2019) Towards identifying features for automatic gender detection: a corpus creation and analysis. IEEE Access 7:111931–111943. https://doi.org/10.1109/ACCESS.2019.2932026
Alayba AM, Palade V, England M, Iqbal R (2017) Arabic language sentiment analysis on health services. In: 2017 IEEE international workshop on Arabic script analysis and recognition (ASAR), 114–118
Alghamdi N (2019) Monitoring mental health using smart devices with text analytical tool. In: 2019 6th International conference on control, decision and information technologies (CoDIT), 2046–2051
Alhaj YA, Al-qaness M, Udara W, Hussain A, Abdelaal H (2018) Efficient feature representation based on the effect of words frequency for arabic documents classification. In: Proceedings of the 2nd international conference on telecommunications and communication engineering, 397–401
Alharthi R, Alhothali A, Moria K (2021) A real-time deep-learning approach for filtering Arabic low-quality content and accounts on Twitter. Inf Syst 99:101740
Alhawarat M, Aseeri AO (2020) A superior Arabic text categorization deep model (SATCDM). IEEE Access 8:24653–24661. https://doi.org/10.1109/ACCESS.2020.2970504
Alhozaimi A, Almishari M (2018) Arabic Twitter profiling for Arabic-speaking users. In: 2018 21st Saudi computer society national computer conference (NCC), 1–6
Alhumoud SO, Al Wazrah AA (2022) Arabic sentiment analysis using recurrent neural networks: a review. Artif Intell Rev 55:707–748. https://doi.org/10.1007/S10462-021-09989-9
Aljedani N, Alotaibi R, Taileb M (2020) HMATC: hierarchical multi-label Arabic text classification model using machine learning. Egypt Inform J. https://doi.org/10.1016/j.eij.2020.08.004
Alkhatib M, El Barachi M, Shaalan K (2019) An Arabic social media based framework for incidents and events monitoring in smart cities. J Clean Prod 220:771–785. https://doi.org/10.1016/j.jclepro.2019.02.063
Almuzaini HA, Azmi AM (2020) Impact of stemming and word embedding on deep learning-based Arabic text categorization. IEEE Access 8:127913–127928. https://doi.org/10.1109/ACCESS.2020.3009217
Alnemer KA, Alhuzaim WM, Alnemer AA, Alharbi BB (2015) Are health-related tweets evidence based? Review and analysis of health-related tweets on Twitter. J Med Internet Res 17:1–6. https://doi.org/10.2196/jmir.4898
Alorini D, Rawat D (2019) Automatic spam detection on Gulf dialectical arabic tweets. In: 2019 international conference on computing, networking and communications (ICNC), 448–452
Alqudsi A, Omar N, Shaker K (2014) Arabic machine translation: a survey. Artif Intell Rev 42:549–572. https://doi.org/10.1007/S10462-012-9351-1
Alruily M, Fazal AM, Mostafa AM (2023) Automated Arabic long-tweet classification using transfer learning with BERT. Appl Sci (Switz) 13(6):3482
AlSaleh D, Larabi-Marie-Sainte S (2021) Arabic text classification using convolutional neural network and genetic algorithms. IEEE Access 9:91670–91685. https://doi.org/10.1109/ACCESS.2021.3091376
Alshaer HN, Otair MA, Abualigah L (2021) Feature selection method using improved CHI square on Arabic text classifiers: analysis and application. Multimed Tools Appl 80:10373–10390
Alshalabi H, Tiun S, Omar N, Al-Aswadi FN, Ali Alezabi K (2021) Arabic light-based stemmer using new rules. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2021.08.017
Alsudias L, Rayson P (2019) Classifying information sources in Arabic Twitter to support online monitoring of infectious diseases. In: Proceedings of the 3rd workshop on Arabic corpus linguistics, 22–30
Altamimi M, Teahan WJ (2019) Arabic dialect identification of Twitter text using PPM compression. Int J Comput Linguist (IJCL) 10(4):47–59
Alwaneen TH, Azmi AM, Aboalsamh HA, Cambria E, Hussain A (2022) Arabic question answering system: a survey. Artif Intell Rev 55:207–253. https://doi.org/10.1007/S10462-021-10031-1
Alyafeai Z, Al-shaibani MS, Ghaleb M, Ahmad I (2021) Evaluating various tokenizers for Arabic text classification. ArXiv Preprint arxiv: 2106.07540, 5
Ameur MSH, Aliane H (2021) AraCOVID19-MFH: Arabic COVID-19 multi-label fake news & hate speech detection dataset. Procedia Comput Sci 189:232–241. https://doi.org/10.1016/j.procs.2021.05.086
Antoun W, Baly F, Hajj H (2020) Arabert: transformer-based model for arabic language understanding. ArXiv Preprint arXiv:2003.00104
Atwan J, Wedyan M, Bsoul Q, Hamadeen A, Alturki R, Ikram M (2021) The effect of using light stemming for Arabic text classification. Int J Adv Comput Sci Appl 12(5):768–773. https://doi.org/10.14569/IJACSA.2021.0120589
Ayed R, Labidi M, Maraoui M (2017) Arabic text classification: new study. In: 2017 international conference on engineering & MIS (ICEMIS). IEEE, 1–7
Badaro G, Baly R, Hajj H, El-Hajj W, Shaban KB, Habash N, Al-Sallab A, Hamdi A (2019) A survey of opinion mining in Arabic: a comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3295662
Bahassine S, Madani A, Kissi M (2017) A new similarity measure for automatic text categorization based on vector space model. In: Proceedings of the second international conference on advanced wireless information, data, and communication technologies, 1–7
Bekkali M, Lachkar A (2017) Web search engine-based representation for Arabic tweets categorization. Soc Data Min Anal Predict Commun Detect. https://doi.org/10.1007/978-3-319-51367-6
Chantar H, Mafarja M, Alsawalqah H, Asghar A, Aljarah I, Aljarah I (2019) Feature selection using binary grey wolf optimizer with elite-based crossover for Arabic text classification. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04368-6
Chouigui A, Khiroun OB, Elayeb B (2017) ANT corpus: an Arabic news text collection for textual classification. In: 2017 IEEE/ACS 14th international conference on computer systems and applications (AICCSA). IEEE. https://doi.org/10.1109/AICCSA.2017.22
Daif M, Kitada S, Iyatomi H (2020) AraDIC: Arabic document classification using image-based character embeddings and class-balanced loss. In: The 58th annual meeting of the association for computational linguistics, 214–221. https://doi.org/10.18653/v1/2020.acl-srw.29
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019–2019 conference of the North American Chapter of the association for computational linguistics: human language technologies—Proceedings of the conference, 1(Mlm), 4171–4186
El-Alami F-Z, El Alaoui S (2018) Word sense representation based-method for Arabic text categorization. In: 2018 9th international symposium on signal, image, video and communications (ISIVC), 141–146
El-Alami FZ, El Alaoui SO, Nahnahi NE (2021) A multilingual offensive language detection method based on transfer learning from transformer fine-tuning model. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2021.07.013
El-Alami FZ, El Mahdaouy A, El Alaoui SO, En-Nahnahi N (2020) A deep autoencoder-based representation for Arabic text categorization. J Inf Commun Technol 19(3):381–398
El-Alami FZ, Ouatik El Alaoui S, En Nahnahi N (2021) Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2021.02.005
Elayeb B (2019) Arabic word sense disambiguation: a review. Artif Intell Rev 52:2475–2532. https://doi.org/10.1007/S10462-018-9622-6
Elhassan R, Ahmed M (2015) Arabic Text Classification review. 4(1), 1–5
Elnagar A, Al-Debsi R, Einea O (2020) Arabic text classification using deep learning models. Inf Process Manag 57(1):102121. https://doi.org/10.1016/j.ipm.2019.102121
Farghaly A, Shaalan K (2009) Arabic natural language processing: challenges and solutions. ACM Trans Asian Lang Inf Process 8:1–10. https://doi.org/10.1145/1644879.1644881.http
Faris H, Habib M, Faris M, Alomari A, Castillo PA, Alomari M (2021) Classification of Arabic healthcare questions based on word embeddings learned from massive consultations: a deep learning approach. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-021-02948-w
Ghourabi A, Mahmood MA, Alzubi Q (2020) A hybrid CNN-LSTM model for SMS spam detection in Arabic and English messages. Future Internet 12(9):156. https://doi.org/10.3390/fi12090156
Gouiouez M, Hadni M (2017) Multi-agent system for arabic text categorization. In: Proceedings of the Mediterranean symposium on smart city applications, 161–174. https://doi.org/10.1007/978-3-319-74500-8
Guru DS, Ali M, Suhil M (2018) A Novel term weighting scheme and an approach for classification of agricultural Arabic text complaints. In: 2018 IEEE 2nd international workshop on arabic and derived script analysis and recognition (ASAR), 24–28
Hadni M, Gouiouez M (2017) Graph based representation for Arabic text categorization. In: Proceedings of the 2nd international conference on big data, cloud and applications, 1–7
Hassan F, Satori K, Yahyaouy A, El Moubtahij H, Lamtougui H (2020) Applications of deep learning in Arabic sentiment analysis: research perspective. In: 2020 1st international conference on innovative research in applied science, engineering and technology (IRASET), May, 20–25. https://doi.org/10.1109/IRASET48871.2020.9092163
Hegazi MO, Al-Dossari Y, Al-Yahy A, Al-Sumari A, Hilal A (2021) Preprocessing Arabic text on social media. Heliyon 7(2):e06191. https://doi.org/10.1016/j.heliyon.2021.e06191
Hijazi M, Zeki A, Ismail A (2021) Arabic text classification using hybrid feature selection method using chi-square binary artificial bee colony algorithm. Int J Math Comput Sci 16(1):213–228
Hriez S, Awajan A (2020) Authorship identification for Arabic texts using logistic model tree classification. Sci Inf Conf. https://doi.org/10.1007/978-3-030-52246-9
Hussein S, Farouk M, Hemayed E (2019) Gender identification of egyptian dialect in twitter. Egypt Inform J 20(2):109–116. https://doi.org/10.1016/j.eij.2018.12.002
Ikram AY, Chakir L (2019) Arabic text classification in the legal domain. In: 2019 3rd international conference on intelligent computing in data sciences, ICDS 2019, 1–6. https://doi.org/10.1109/ICDS47004.2019.8942343
Khedher MI, Jmila H, El-yacoubi MA (2020) Automatic processing of historical Arabic documents: a comprehensive survey. Pattern Recognit 100(November):107144. https://doi.org/10.1016/j.patcog.2019.107144
Kitchenham B, Pearl Brereton O, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering—A systematic literature review. Inf Softw Technol. https://doi.org/10.1016/j.infsof.2008.09.009
Lippincott T, Mcnamee P, Duh K (2019) JHU system description for the MADAR Arabic dialect identification shared task. In: Proceedings of the fourth Arabic natural language processing workshop, 264–268
Madhfar MAH, Al-Hagery MAH (2019) Arabic text classification: a comparative approach using a big dataset. In: 2019 international conference on computer and information sciences, ICCIS 2019, 4, 1–5. https://doi.org/10.1109/ICCISci.2019.8716479
Mahdi MA, Al-Janabi S (2020) A novel software to improve healthcare base on predictive analytics and mobile services for cloud data centers. Lect Notes Netw Syst 81:320–339. https://doi.org/10.1007/978-3-030-23672-4_23/COVER
Marie-Sainte SL, Alalyani N (2020) Firefly algorithm based feature selection for Arabic text classification. J King Saud Univ Comput Inf Sci 32(3):320–328. https://doi.org/10.1016/j.jksuci.2018.06.004
Mohammad AH, Al-momani O (2016) Arabic text categorization using support vector machine, Naïve Bayes and Neural Network. Research Gate. https://doi.org/10.5176/2251-3043
Moher D, Liberati A, Tetzlaff J, Altman DG, Altman D, Antes G, Atkins D, Barbour V, Barrowman N, Berlin JA, Clark J, Clarke M, Cook D, D’Amico R, Deeks JJ, Devereaux PJ, Dickersin K, Egger M, Ernst E, Tugwell P (2009) Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. https://doi.org/10.1371/journal.pmed.1000097
Omar A, Mahmoud TM, Abd-El-Hafeez T, Mahfouz A (2021) Multi-label Arabic text classification in online social networks. Inf Syst 100:101785. https://doi.org/10.1016/j.is.2021.101785
Salama RA, Youssef A, Fahmy A (2018) Morphological word embedding for Arabic. Procedia Comput Sci 142:83–93. https://doi.org/10.1016/j.procs.2018.10.463
Sundus K, Al-Haj F, Hammo B (2019) A Deep learning approach for Arabic text classification. In: 2019 2nd international conference on new trends in computing sciences, ICTCS 2019—Proceedings, 1–7. https://doi.org/10.1109/ICTCS.2019.8923083
Wahdan A, Hantoobi S, Salloum SA, Shaalan K (2020) A systematic review of text classification research based on deep learning models in Arabic language. Int J Electr Comput Eng (IJECE). https://doi.org/10.11591/ijece.v10i6.pp6629-6643
Zahidi Y, El Younoussi Y, Azroumahli C (2019) Comparative study of the most useful Arabic-supporting natural language processing and deep learning libraries. In: 2019 international conference on optimization and applications, ICOA 2019, 1–10. https://doi.org/10.1109/ICOA.2019.8727617
Zaza S, Al-Emran M (2015) Mining and exploration of credit cards data in UAE. In: Proceedings—2015 5th international conference on e-learning, ECONF 2015, 275–279. https://doi.org/10.1109/ECONF.2015.57
Funding
There is no funding received for this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wahdan, A., Al-Emran, M. & Shaalan, K. A systematic review of Arabic text classification: areas, applications, and future directions. Soft Comput 28, 1545–1566 (2024). https://doi.org/10.1007/s00500-023-08384-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-023-08384-6