Maximum entropy modeling for diacritization of Arabic text

Sarikaya, Ruhi; Emam, Ossama; Zitouni, Imed; Gao, Yuqing

doi:10.21437/Interspeech.2006-37

Maximum entropy modeling for diacritization of Arabic text

Ruhi Sarikaya, Ossama Emam, Imed Zitouni, Yuqing Gao

We propose a novel modeling framework for automatic diacritization of Arabic text. The framework is based on Markov modeling where each grapheme is modeled as a state emitting a diacritic (or none) from the diacritic space. This space is exactly defined using 13 diacritics and a null-diacritic and covers all the diacritics used in any Arabic text. The state emission probabilities are estimated using maximum entropy (MaxEnt) models. The diacritization process is formulated as a search problem where the most likely diacritization realization is assigned to a given sentence. We also propose a diacritization parse tree (DPT) for Arabic that allows joint representation of diacritics, graphemes, words, word contexts, morphologically analyzed units, syntactic (parse tree), semantic (parse tree), part-of-speech tags and possibly other information sources. The features used to train MaxEnt models are obtained from the DPT. In our evaluation we obtained 7.8% diacritization error rate (DER) and 17.3% word diacritization error rate (WDER) on a dialectal Arabic data using the proposed framework.

doi: 10.21437/Interspeech.2006-37

Cite as: Sarikaya, R., Emam, O., Zitouni, I., Gao, Y. (2006) Maximum entropy modeling for diacritization of Arabic text. Proc. Interspeech 2006, paper 1418-Mon1BuP.11, doi: 10.21437/Interspeech.2006-37

@inproceedings{sarikaya06_interspeech,
  author={Ruhi Sarikaya and Ossama Emam and Imed Zitouni and Yuqing Gao},
  title={{Maximum entropy modeling for diacritization of Arabic text}},
  year=2006,
  booktitle={Proc. Interspeech 2006},
  pages={paper 1418-Mon1BuP.11},
  doi={10.21437/Interspeech.2006-37}
}