Abstract
This paper describes Persianp Toolbox, an integrated Persian text processing system and easily used in other software applications. The toolbox which provides fundamental Persian text processing steps includes several modules. In developing some modules of the toolbox such as normalizer, tokenizer, sentencizer, stop word detector, and Part-Of-Speech tagger previous studies are applied. In other modules i.e. Persian lemmatizer and NP chunker, new ideas in preparing required training data and/or applying new techniques are presented. Experimental results show the strong performance of the toolbox in each part. The accuracies of the tokenizer, the POS tagger, the lemmatizer and the NP chunker are 97%, 95.6%, 97%, 97.2%, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F.: Hamshahri: a standard Persian text collection. Knowl. Based Syst. 22(5), 382–387 (2009)
APLL (Academy of Persian Language and Literate): Persian Writing Style. Asar Press, Iran (2006)
Bijankhan, M., Sheykhzadegan, J., Bahrani, M., Ghayoomi, M.: Lessons from building a Persian written corpus: Peykare. Lang. Resour. Eval. 45(2), 143–164 (2011)
Bijankhan, M., Mohseni, M.: Frequency Dictionary: According to a Written Corpus of Today Persian Language. University of Tehran Press, Tehran (2012). (in Persian), ISBN:978-964-03-6296-9
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 168–175 (2002)
Dehdari, J., Lonsdale, D.: A Link Grammar Parser for Persian. Aspects of Iranian Linguistics, vol. 1. Cambridge Scholars Press, Cambridge (2008)
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 327–348 (2004)
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd edn. Prentice-Hall, Upper Saddle River (2009)
Kiani, S., Akhavan, T., Shamsfard, M.: Developing a Persian Chunker using a hybrid approach. In: International Multiconference on Computer Science and Information Technology, IMCSIT 2009, pp. 227–234 (2009)
Manning, C., Klein, D.: Optimization, Maxent models, and conditional estimation without magic. Tutorial at HLT-NAACL. ACL (2003)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL (System Demonstrations), pp. 55–60 (2014)
Megerdoomian, K., Zajac, R.: Tokenization in the Shiraz project. Technical report, NMSU, CRL, Memoranda in Computer and Cognitive Science (2000)
Mohseni, M., Motalebi, H., Minaei-Bidgoli, B., Shokrollahi-far, M.: A Farsi part-of-speech tagger based on Markov model. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1588–1589. ACM (2008)
Mohseni, M., Minaei-bidgoli, B.: A Persian part-of-speech tagger based on morphological analysis. In: Language Resources and Evaluation Conference (LREC) (2010)
Raja, F., Amiri, H., Tasharofi, S., Sarmadi, M., Hojjat, H., Oroumchian, F.: Evaluation of part of speech tagging on Persian text. In: The Second Workshop on Computational Approaches to Arabic Script-Based Languages, Linguistic Institute Stanford University (2007)
Rahimtoroghi, E., Hesham F., Shakery, A.: A structural rule-based stemmer for Persian. In: 5th International Symposium on Telecommunications (IST), pp. 574–578 (2010)
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Proceedings of the Third Annual Workshop on Very Large Corpora, pp. 82–94 (1995)
Rasooli, M.S., Kouhestani, M., Moloodi, A.: Development of a Persian syntactic dependency treebank. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 306–314 (2013)
Sarabi, Z., Mahyar, H., Farhoodi, M.: ParsiPardaz: Persian language processing toolkit. In: Proceedings of the 3rd International eConference on Computer and Knowledge Engineering (ICCKE), pp. 73–79 (2013)
Shamsfard, M., Jafari, H.S., Ilbeygi, M.: STeP-1: a set of fundamental tools for Persian text processing. In: Language Resources and Evaluation Conference (LREC) (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Mohseni, M., Ghofrani, J., Faili, H. (2018). Persianp: A Persian Text Processing Toolbox. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-75477-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)