Morphological Disambiguation of Classical Sanskrit

Hellwig, Oliver

doi:10.1007/978-3-319-23980-4_3

Morphological Disambiguation of Classical Sanskrit

Oliver Hellwig¹²

Conference paper
First Online: 09 December 2015

319 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 537))

Abstract

Sanskrit, the “sacred language” of Ancient India, is a morphologically rich Indo-Iranian language that has received some attention in NLP during the last decade. This paper describes a system for the tokenization and morphosyntactic analysis of Sanskrit. The system combines a fixed morphological rule base with a statistical selection of the most probable analysis of an input text. After an introduction into the research history and the linguistic peculiarities of Sanskrit that are relevant to the task, the paper describes the present architecture of the system and new extensions that increase its accuracy when analyzing morphologically ambiguous forms. The algorithms are tested on a gold-annotated data set of 3,587,000 words.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Bloch [2] gives an introduction into the linguistic history of Sanskrit. More details about the Vedic layer are found in Witzel [34].
2.
In a recent research project, the Aṣṭādhyāyī has been fully annotated on the morphological, lexical and word-semantic level to make it easier accessible for Western researchers without knowledge of Sanskrit [25]. A web platform that gives access to this database is available at http://panini.phil-fak.uni-duesseldorf.de/panini/.
3.
The rules of the Aṣṭādhyāyī are not given in the order in which they need to be applied for generating a valid Sanskrit word. Instead, it is generally assumed that their order minimizes the resulting rule base. The Indian grammar uses the concept of anuvṛtti (“following”) rules for regulating the order in which rules and their elements are applied. These rules are not part of the text of the Aṣṭādhyāyī, but are recorded – and heavily discussed – in the commentary literature; refer to [4, 187ff.] for details about rule order in the Aṣṭādhyāyī, and to [26] for the proof of minimality in a subset of Pāṇinian rules.
4.
Refer to page Subsect. 3.3 for the phonological phenomenon of Sandhi.
5.
The GRETIL web repository (http://gretil.sub.uni-goettingen.de/) contains less than 20 million strings. Several of the texts are not usable for automatic processing due to excessive formatting of their editors, as described in Sect. 3.6.
6.
The following abbreviations are used in this paper: Nom.: nominative; Acc.: accusative; Ins.: instrumental; Dat.: dative; Gen.: genitive; Loc.: locative; Voc.: vocative; Co.: compound; Sg.: singular; Du.: dual; Pl.: plural; Msc.: masculine; Fem.: feminine; Neu.: neuter; Ind.: indeclinable; Pres.: present; Impf.: imperfect; Perf.: perfect tense; Proh.: prohibitive (a kind of imperative that is only used in negated phrases); PastPart.: past participle, frequently with a passive sense; PresPart.: present participle
Ambiguities in a morphological analysis are expressed by a regex-style notation, with | denoting the operator OR and round brackets a set of options. So, (Nom.|Acc.|Voc.)Pl. Neu.means that a form is a neuter plural either in nominative or accusative or vocative.
The plus operator + is used to separate elements of compounds, the ampersand sign & to indicate Sandhi at word boundaries (Sect. 3.3).
Further abbreviations: tri: trigram based model for morphological disambiguation; crf: Conditional Random Fields; me: Maximum Entropy.
7.
Note that the word bahuvrīhi is itself an example of a bahuvrīhi compound. In its “default interpretation” as a so-called tatpuruṣa (“his man”, an instance of relational compounding) compound, it means just “much rice.”.
8.
From a purely grammatical point of view, the sentence can also be translated as “... destroyed by these bad actions.” Numerous references of the bahuvrīhi solution with unambiguous case endings (e.g., in Nom. Pl. Msc.) make the proposed interpretation much more plausible.
9.
Though slightly outdated, the grammar of Stenzler still provides a good introduction into Sanskrit Sandhi rules [33, 3ff.].
10.
Refer to [24, 1ff.] for a detailed linguistic description with several examples. Brockington locates the epics, especially the Mahābhārata, in a continuum “of dialects and language registers from classical or Pāṇinian Sanskrit at one end to colloquial MIA [Middle Indo-Aryan] at the other” [3, 83] and makes this linguistic situation responsible for the irregular application of Sandhi in epic texts.
11.
Emeneau describes the basic parameters of the interaction between Indo-Iranian and Dravidian languages [5]. A quantitative overview of the major influences that is based on Mayrhofer’s etymological dictionary [20] is given in [9].
12.
A quantitative evaluation of the reuse of Pāṇinian vocabulary is presented in [11].
13.
A member of a low caste.
14.
http://opencyc.org/.
15.
As these data are only checked by one annotator and have not been adjudicated, they should rather be called semi-gold annotations.
16.
The Mahābhārata and the Rāmāyaṇa are the two central epic texts written in Sanskrit. The term Purāṇa (“old (story)”) denotes a group of works dealing with virtually everything; refer to Rocher for an introduction [28].
17.
The TTRs found in the third column of Table 1 are obtained by calculating the TTRs for each text, and then averaging these values over the topic levels. Because text lengths have not been used as normalizing factors, the TTRs of underrepresented topic levels such as śruti or Buddhist literature are most probably too high.
18.
The one-solution case predicts the correct morphological category in about 99.8 % of all cases. The errors are caused by irregular word forms.
19.
The parameter 3 for the window size was chosen after comparing disambiguation results for window sizes between 1 and 7. Window sizes above 3 did not consistently increase the accuracy, but required higher training times.
20.
The final Sandhi ṃ has been transformed into the pausa form m.
21.
Used in the Java implementation of the OpenNLP package; settings: smoothing factor: 0.001, 100 iterations.
22.
Used in the C++ implementation from http://www.chokkan.org/software/crfsuite/; settings: L2 regularization: 2.0, one-dimensional architecture.

References

Adler, M., Elhalad, M.: An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In: Proceedings of the 21st International Conference on Computational Linguistics, pp. 665–672 (2006)
Google Scholar
Bloch, J.: Indo-Aryan from the Vedas to Modern Times. Librarie d’Amérique et d’Orient, Paris (1965)
Google Scholar
Brockington, J.: The Sanskrit Epics. Brill, Leiden (1998)
Google Scholar
Cardona, G.: . A Survey of Research. Mouton, The Hague - Paris (1976)
Google Scholar
Emeneau, M.: Dravidian and indo-aryan: the indian linguistic area. In: Emeneau, M.B. (ed.) Language and Linguistic Area, pp. 167–196. Stanford University Press, Stanford (1980)
Google Scholar
Gillon, B.S.: Review of “Natural Language Processing: A Paninian Perspective" by A. Bharati, V. Chaitanya, and R. Sangal. Prentice-Hall of India 1995. Computational Linguistics 21(3), 419–421 (1995)
Google Scholar
Gillon, B.S.: Word order in classical Sanskrit. Indian Linguist. 57(1–4), 1–35 (1996)
Google Scholar
Hellwig, O.: \(\mathtt{{SadnskritTagger}}\): a stochastic lexical and POS tagger for Sanskrit. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) Sanskrit CL 2007/2008. LNCS, vol. 5402, pp. 266–277. Springer, Heidelberg (2009)
Chapter Google Scholar
Hellwig, O.: Etymological trends in the Sanskrit vocabulary. Literary Linguist. Comput. 25(1), 105–118 (2010)
Article Google Scholar
Hellwig, O.: Performance of a lexical and POS tagger for Sanskrit. In: Jha, G.N. (ed.) SCL. LNCS, vol. 6465, pp. 162–172. Springer, Heidelberg (2010)
Chapter Google Scholar
Hellwig, O., Petersen, W.: What’s got to do with it? The use of -headers from the Aṣṭādhyāyī in Sanskrit literature from the perspective of corpus linguistics. In: Proceedings of the WCS 2015 (forthcoming)
Google Scholar
Huet, G.: A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. J. Funct. Program. 15(04), 573–614 (2005)
Article MATH Google Scholar
Kiparsky, P.: On the architecture of grammar. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) SCL. Lecture Notes in Computer Science, vol. 5402, pp. 33–94. Springer, Heidelberg (2009)
Google Scholar
Knauth, J., Alfter, D.: A dictionary data processing environment and its application in algorithmic processing of Pali dictionary data for future NLP tasks. In: Proceedings of the 5th Workshop on South and Southeast Asian NLP, pp. 65–73 (2014)
Google Scholar
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 181–184 (1995)
Google Scholar
Kulkarni, A., Shukla, D.: Sanskrit morphological analyser: some issues. Indian Linguist. 70(1–4), 169–177 (2009)
Google Scholar
Kulkarni, M.: Phonological overgeneration in paninian system. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) SCL. LNCS, vol. 5402, pp. 306–319. Springer, Heidelberg (2009)
Chapter Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Lee, J., Naradowsky, J., Smith, D.A.: A discriminative model for joint morphological disambiguation and dependency parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 885–894 (2011)
Google Scholar
Mayrhofer, M.: Kurzgefaßtes etymologisches Wörterbuch des Altindischen. Carl Winter Universitätsverlag, Heidelberg (1982)
Google Scholar
Mishra, A.: Simulating the system of Sanskrit grammar. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) SCL. LNCS, vol. 5402. Springer, Heidelberg (2009)
Google Scholar
Mittal, V.: Automatic Sanskrit segmentizer using finite state transducers. In: Proceedings of the ACL 2010 Student Research Workshop, pp. 85–90. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Monier-Williams, M.: -English Dictionary, 3rd edn. Munshiram Manoharlal Publishers Pvt. Ltd., New Delhi (1988)
Google Scholar
Oberlies, T.: A Grammar of Epic Sanskrit. De Gruyter (2003)
Google Scholar
Petersen, W., Soubusta, S.: Structure and implementation of a digital edition of the Aṣṭādhyāyī. In: Kulkarni, M. (ed.) Recent Researches in Sanskrit Computational Linguistics, pp. 84–103. D.K. Printworld, New Delhi (2013)
Google Scholar
Petersen, W.: Zur Minimalität von Śivasūtras: eine Untersuchung mit Methoden der formalen Begriffsanalyse. Ph.D. thesis, Universität Düsseldorf (2008)
Google Scholar
Ratnaparkhi, A.: Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania (1998)
Google Scholar
Rocher, L.: The , A History of Indian Literature, vol. II, Fasc. 3. Otto Harrassowitz, Wiesbaden (1986)
Google Scholar
Scharfe, H.: Grammatical Literature. A History of Indian Literature, Volume 5, Fasc. 2, Otto Harrassowitz, Wiesbaden (1977)
Google Scholar
Shacham, D., Wintner, S.: Morphological disambiguation of Hebrew: a case study in classifier combination. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 439–447. Association for Computational Linguistics, Prague (2007)
Google Scholar
Shukla, P., Kulkarni, A., Shukl, D.: Geeta: Gold standard annotated data, analysis and its application. In: Proceedings of ICON (2013)
Google Scholar
Staal, J.: Word Order in Sanskrit and Universal Grammar. Foundations of Language, Supplementary Series, vol. 5. D. Reidel Publishing Company, Dordrecht (1967)
Book Google Scholar
Stenzler, A.F.: Elementarbuch der Sanskrit-Sprache. Max Mälzer, Breslau (1872)
Google Scholar
Witzel, M.: Early indian history: linguistic and textual parametres. In: Erdosy, G. (ed.) The Indo-Aryans of Ancient South Asia. Language, Material Culture and Ethnicity, vol. 1, pp. 85–125. Walter de Gruyter, Berlin (1995)
Google Scholar
Yuret, D., Türe, F.: Learning morphological disambiguation rules for Turkish. In: Proceedings of HLT-NAACL (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Düsseldorf, Düsseldorf, Germany
Oliver Hellwig

Authors

Oliver Hellwig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oliver Hellwig .

Editor information

Editors and Affiliations

Institut für Deutsche Sprache, Mannheim, Germany
Cerstin Mahlow
Leibniz Institute of European History, Mainz, Germany
Michael Piotrowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hellwig, O. (2015). Morphological Disambiguation of Classical Sanskrit. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2015. Communications in Computer and Information Science, vol 537. Springer, Cham. https://doi.org/10.1007/978-3-319-23980-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-23980-4_3
Published: 09 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23978-1
Online ISBN: 978-3-319-23980-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics