Skip to main content

Persianp: A Persian Text Processing Toolbox

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9623))

Abstract

This paper describes Persianp Toolbox, an integrated Persian text processing system and easily used in other software applications. The toolbox which provides fundamental Persian text processing steps includes several modules. In developing some modules of the toolbox such as normalizer, tokenizer, sentencizer, stop word detector, and Part-Of-Speech tagger previous studies are applied. In other modules i.e. Persian lemmatizer and NP chunker, new ideas in preparing required training data and/or applying new techniques are presented. Experimental results show the strong performance of the toolbox in each part. The accuracies of the tokenizer, the POS tagger, the lemmatizer and the NP chunker are 97%, 95.6%, 97%, 97.2%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F.: Hamshahri: a standard Persian text collection. Knowl. Based Syst. 22(5), 382–387 (2009)

    Article  Google Scholar 

  2. APLL (Academy of Persian Language and Literate): Persian Writing Style. Asar Press, Iran (2006)

    Google Scholar 

  3. Bijankhan, M., Sheykhzadegan, J., Bahrani, M., Ghayoomi, M.: Lessons from building a Persian written corpus: Peykare. Lang. Resour. Eval. 45(2), 143–164 (2011)

    Article  Google Scholar 

  4. Bijankhan, M., Mohseni, M.: Frequency Dictionary: According to a Written Corpus of Today Persian Language. University of Tehran Press, Tehran (2012). (in Persian), ISBN:978-964-03-6296-9

    Google Scholar 

  5. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 168–175 (2002)

    Google Scholar 

  6. Dehdari, J., Lonsdale, D.: A Link Grammar Parser for Persian. Aspects of Iranian Linguistics, vol. 1. Cambridge Scholars Press, Cambridge (2008)

    Google Scholar 

  7. Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 327–348 (2004)

    Article  Google Scholar 

  8. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd edn. Prentice-Hall, Upper Saddle River (2009)

    Google Scholar 

  9. Kiani, S., Akhavan, T., Shamsfard, M.: Developing a Persian Chunker using a hybrid approach. In: International Multiconference on Computer Science and Information Technology, IMCSIT 2009, pp. 227–234 (2009)

    Google Scholar 

  10. Manning, C., Klein, D.: Optimization, Maxent models, and conditional estimation without magic. Tutorial at HLT-NAACL. ACL (2003)

    Google Scholar 

  11. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL (System Demonstrations), pp. 55–60 (2014)

    Google Scholar 

  12. Megerdoomian, K., Zajac, R.: Tokenization in the Shiraz project. Technical report, NMSU, CRL, Memoranda in Computer and Cognitive Science (2000)

    Google Scholar 

  13. Mohseni, M., Motalebi, H., Minaei-Bidgoli, B., Shokrollahi-far, M.: A Farsi part-of-speech tagger based on Markov model. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 1588–1589. ACM (2008)

    Google Scholar 

  14. Mohseni, M., Minaei-bidgoli, B.: A Persian part-of-speech tagger based on morphological analysis. In: Language Resources and Evaluation Conference (LREC) (2010)

    Google Scholar 

  15. Raja, F., Amiri, H., Tasharofi, S., Sarmadi, M., Hojjat, H., Oroumchian, F.: Evaluation of part of speech tagging on Persian text. In: The Second Workshop on Computational Approaches to Arabic Script-Based Languages, Linguistic Institute Stanford University (2007)

    Google Scholar 

  16. Rahimtoroghi, E., Hesham F., Shakery, A.: A structural rule-based stemmer for Persian. In: 5th International Symposium on Telecommunications (IST), pp. 574–578 (2010)

    Google Scholar 

  17. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Proceedings of the Third Annual Workshop on Very Large Corpora, pp. 82–94 (1995)

    Google Scholar 

  18. Rasooli, M.S., Kouhestani, M., Moloodi, A.: Development of a Persian syntactic dependency treebank. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 306–314 (2013)

    Google Scholar 

  19. Sarabi, Z., Mahyar, H., Farhoodi, M.: ParsiPardaz: Persian language processing toolkit. In: Proceedings of the 3rd International eConference on Computer and Knowledge Engineering (ICCKE), pp. 73–79 (2013)

    Google Scholar 

  20. Shamsfard, M., Jafari, H.S., Ilbeygi, M.: STeP-1: a set of fundamental tools for Persian text processing. In: Language Resources and Evaluation Conference (LREC) (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahdi Mohseni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mohseni, M., Ghofrani, J., Faili, H. (2018). Persianp: A Persian Text Processing Toolbox. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75477-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75476-5

  • Online ISBN: 978-3-319-75477-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics