Some Challenges of the West Circassian Polysynthetic Corpus
21 Pages Posted: 30 Dec 2015
Date Written: December 29, 2015
Abstract
Although there exist comprehensive morphologically annotated corpora for many morphologically rich languages, there have been no such corpora for any polysynthetic language so far. Polysynthetic languages raise a variety of theoretical and practical challenges for corpus linguistics. Some of these challenges have been partly addressed when developing corpora for e. g. Turkic or Uralic languages, while others are unique for this kind of languages. Our paper identifies the most prominent challenges that we are facing in the course of development of West Circassian (Adyghe) corpus, and offer possible solutions. These include the tokenization problem, which involves delimiting morphology from syntax, the problem with lemmatization and part-of-speech tagging, and a number of glossing and search problems.
Keywords: language corpora, polysynthesis, West Circassian
JEL Classification: Z
Suggested Citation: Suggested Citation