Tocharian and Historical Sociolinguistics: Evidence from a Fragmentary Corpus

Abstract The two extinct languages Tocharian A and Tocharian B comprise a small text corpus that consists almost exclusively of fragments. Nevertheless, the corpus shows linguistic variation. This paper will argue that there are three reasons for genuine linguistic variation in the Tocharian corpus: diachronic variation, dialectal variation and sociolectal variation. Accordingly, it is not only possible to apply sociolinguistic methods to a fragmentary corpus, but it is even essential. Furthermore, the Tocharian material confirms patterns of sound change known from other sources and can therefore offer insights for the principles of language change.


Introduction
Tocharian is a small branch of the Indo-European language family, consisting of only two languages commonly called Tocharian A and Tocharian B. Both languages are extinct today. Evidence for Tocharian comes exclusively from original manuscripts (Chinese paper, wooden tablets) and mural inscriptions mainly written in a specially adapted variety of the Indic Brahmi abugida. All texts were found in the Tarim Basin in Chinese Turkestan, today's Xinjiang Uyghur Autonomous Region. The earliest manuscripts can be dated from the late 4 th or early 5 th century CE (Malzahn 2007: 277); Tocharian document writing stopped sometime after the turn of the millenium. Almost all literary documents are Buddhist in nature, but there are also some profane writings like caravan travel passes, accounts and commercial letters. Most manuscripts were discovered or purchased during the Turfan expeditions of the 19 th and early 20 th century, either randomly or as part of (often unsystematic) excavations. Most documents are the remnants of what must have once been large Buddhist libraries consisting of hundreds of books with several hundred pages. However, not one singly complete Tocharian book has been preserved; even complete pages or complete letters are rare -the average Tocharian text document is a fragment of a page or letter. The number of known Tocharian fragments amounts to approximately 10,000 pieces, of which only approximately 2,000 are large enough to be of greater philological relevance. The Tocharian A corpus is the smaller of the two, consisting of 1,150 fragments. Today about 75% of the excavated Tocharian language material is accessible, approximately 25% is edited, but only approximately 10% is translated. Most Tocharian language material can be easily accessed on-line at the CEToM project hosted by the University of Vienna in cooperation with the Berlin-Brandenburgische Akademie der Wissenschaften (BBAW), the International Dunhuang project at the British Library (IDP) and the Thesaurus Indogermanischer Text und Sprachmaterialien (TITUS).
The fragmentary state of the corpus makes philological work difficult; the first philological studies and editions therefore neither paid much attention to linguistic variation nor allowed for scribal or copying errors. Today, however, it is certain that variation found in the Tocharian B (TB) corpus mirrors linguistic variation due to various chronological stages, geographical variation and sociolects. It can further be shown that sociolinguistic variation in the TB corpus conforms to general principles of socio-dialectal variation on the one hand and gives insight into the use of literacy in the Buddhist monasteries on the other hand.

Method
Many Tocharian fragments can be localized because their find spots were recorded by the excavators. Systematic correspondence of linguistic variants and certain find spots points to the existence of three dialectal regions inside Tocharian B: a Western dialect around the former political center Kuča, a Central dialect around the site Šorčuq and an Eastern dialect in the geographically remote Turfan region, as first indicatedby Winter (1955Winter ( ,1958. Tocharian is written in a variety of Indic Brahmi abugida; this is due to the fact that the Buddhist religious centers in the Tarim Basin imported manuscripts from India and Gandhara, which were then locally copied on Chinese paper. Subsequently, speakers of Tocharian used the same script type to write down their own vernacular and Sanskrit manuscripts are regularly found beside Tocharian ones at the same site spots. Accordingly, the development of the Tocharian script type can be dated in comparison with Sanskrit texts (Sander 1968). It can further be shown that there is a systematic correspondence of certain linguistic variants with script types, i.e. that there is a diachronic development of the language. Based on the characteristics of the script type and on linguistic features, three main chronological stages of Tocharian B can be distinguished (see, e.g., Malzahn 2007, Peyrot 2008, Tamai 2011, Sander 2013): Archaic TB (5 th century), Classical TB (6 th -7 th centuries), and Late TB (after 8 th century); all Tocharian A documents are written in the late script type.
Thirdly, one can compare linguistic features and text genres. It can be shown that business documents such as letters and graffiti show linguistic features rarely found in literary texts, i.e. that there is a colloquial TB language (as per Schmidt 1986).

Discussion and results
It has been discussed for some time how the chronological, dialectal and sociolectal variation in the TB corpus can be systematized and which features can be attributed to which development. The question has been complicated by the fact that colloquial texts are attested only from the Late TB period and mostly in the Western region. However, colloquial linguistic features found in these texts are also found in literary texts, mostly in those from the Eastern region of TB; only very rarely can colloquial variants be seen in literary texts from the other two dialect areas. In his study of evidence of linguistic variation in the TB corpus, Peyrot (2008: 15-21) argued that there is indeed a chronological development of the TB language (i.e., an archaic, classical and late stage of TB), that the classical language is a type of artificial literary language taught by the literate elite and that the colloquial texts in the late period show a progressive development of the spoken language. Literacy as such spread from the Western region to the geographically remote Turfan region only after the classical literary language had developed there (see also Malzahn 2007: 289-29). Therefore, we do not have "dialectal" variation in the usual sense, but rather sociolectal variation that has broader acceptance in the Eastern region and is less accepted in the Western region.
In the following study, I will discuss an example that will show that the phonological development of the Late/Colloquial TB variety conforms to general principles of phonological linguistic variation and that the sound changes involved can be explained by the concept of register borrowing or as a lexical diffusion phenomenon (as per Labov 1994).
The types of sound changes outlined above can be characterized as progressive or so-called minor sound change phenomena that are often connected with socio-linguistic variation ('dialect or register borrowing''), as per Labov (1994: esp. 444-71). Furthermore, these progressive sound change phenomena are often connected with certain word classes, which means that they tend to occur earlier in non-lexical words than in lexical words, (Dressler 1980: 117-118); finally, they are connected to frequency (Bybee 2007;Phillips 2006). As a case study, I am going to concentrate on the raising of the TB low central vowel /ǝ/ to /i/ in palatal environment. Stumpf (1990: 138) noted that this sound change is attested earlier and more broadly with the pronoun ñǝś / ñiś 'I, me' than with other examples. This would confirm that a non-lexical and very frequent forms do indeed show a minor sound change earlier than lexical words and less frequently attested word classes do.
The following survey illustrates the distribution of the TB pronoun variants ñǝś / ñiś and a comparable test sample of lexical words in the TB corpus. The data is taken form the CEToM database (retrieved 21.12.2016). Figure 1 shows the distribution of the TB pronoun variants ñǝś 'I, me' and ñiś in archaic, classical and late texts, figure 2 the attestation of a comparable sample of lexical words that show the same sound change. These latter examples are forms of the TB noun pudñǝkte and its variant pañǝkte 'Buddha', the adjective/adjectival compound °ñǝktǝññe / °ñǝktiññe 'divine', and inflected forms of the noun yakne 'way, manner' (yikne-/ yǝkne-). The precise attestations can be found in the data section below. The figures prove that the distribution of raised and non-raised forms of the pronoun on one hand and of other forms on the other hand is uneven. The raised variant ñiś of the pronoun is already much more common in classical texts than the non-raised variant ñǝś; there is one attestation of ñiś found in the THT 3597 b 3, a text that is otherwise archaic. Figures 3 and 4 show the same data according to regional distribution instead of chronological stage. These figures confirm that the sound law is not simply a dialect phenomenon.
The attestation of raised ñiś in THT 3597 is diagnostic. The language of the text is still archaic (see the discussion in Peyrot 2010: 161-166) but the ductus lacks the most diagnostic shibboleth features of the archaic script type (Peyrot 2010: 144-145 and p. 161), and therefore classifies the script as "early standard". Note that we have a text parallel in THT 239 which is written in classical script and has standard classical language. In THT 239 a 7, the same line, expectedly, shows the non-raised variant ñǝś. Peyrot (2010: 163) correctly states that the attestation of ñiś in THT 3597 b 3 "proves that the text was copied at a time when the later variant ñiś has already come about". Peyrot (2010: 164) further notices that THT 3597 has another possible attestation of a colloquial-language feature in line b4 (mokauśka 'she-monkey' for expected mokoṃśka). Since there is also a clear misspelling in line b6 (śekasta; see Peryot 2010: 165), one has to conclude that this manuscript was copied from a still archaic manuscript during the transition period between archaic and early standard script by an untrained scribe who was prone to introduce phonological features of his own vernacular, i.e. the colloquial variety of TB. Accordingly, we can date the existence of the raised variant of the pronoun ñiś (and also of monophthongization) at least into the period between the archaic and classical language, i.e. into the early 6 th century.
However, the raising of /ǝ/ to /i/ in palatal environment is still very rare (but not inexistent) in the classical period and becomes predominant only in late texts; but even in the later period, there are still many spellings with non-raised /ǝ/, no doubt a consequence of the influence of the classical literary standard.
The distribution of forms with non-raised /ǝ/ and raised /i/ in palatal environment in the TB corpus can be explained as a lexical diffusion phenomenon: it is attested earlier in a high-frequency, non-lexical word and only subsequently spreads to other word classes. One can further show that this sound change is stylistically marked. It occurs more commonly in business documents or in literary texts from the Eastern area. This is not merely due to the fact that both kinds of texts are mostly attested in the Late TB period; it is more likely that there was a higher acceptance of colloquial forms in non-literary texts and in the periphery, i.e. less pressure of a literary standard. Lack of such pressure also explains an adjective form with /i/ for /ǝ/ (pañǝktiñe 'pertaining to the Buddha' for regular pañǝktǝññe) in a classical text from the Central area in Šorčuq. The form is found in a TB colophon to the Sanskrit text SHT 436. 8   The mixture of a high-prestige standard and a low-prestige/colloquial variant usually also generates hypercorrect forms, and this can indeed be shown in the Tocharian corpus as well. For instance, one finds non-etymological diphthongs (such as TB oraucce in the classical text PK AS 8C b 4 instead of regular orocce 'great' or TB alyaik in the classical text PK NS 34 b 4 instead of regular alyek 'other'). There is also an attestation of a non-etymological /ǝ/ for regular /i/ precisely in palatal environment, i.e. a hypercorrect spelling (peñǝyai in THT 237 a 3 for regular peñiyai, acc.sg. of peñiya 'splendor'; the same manuscript shows examples for raising /ǝ/ > /i/; see Peyrot 2008: 55, fn. 44).
Finally, I would suggest that the occurrence of the spelling ñaś 'I, me' in some archaic texts (THT 241, THT 291.a, THT 1192, THT 1340.b and THT 1540.g) and in THT 205 (which is a classical copy of an archaic text) is also a hypercorrect form and not an accented variant or mere orthographic variant of the pronoun (as per Peyrot 2008: 56). In my opinion, the a-vocalism is precisely due to a lento pronunciation of ñǝś at a time when progressive ñiś had already started to enter the language. There are no hypercorrect ñaś forms in classical texts because by this time, ñiś was already gaining acceptance even in the literate style. If this is correct, this example can be added to similar lento-style phenomena as per Dressler (1980: 117-118).
Acknowledgements: This paper is based on a talk given to the Historical Sociolinguistics Circle of the Austrian Academy of Sciences in April 2016; I would like to thank the audience for their suggestions and comments; the usual disclaimer applies.
The Tocharian data has been retrieved from the CEToM database; the CEToM project is generously funded by the START Program of the Austrian Science Fund (FWF) (project number Y492) and is available via open access at http://www.univie.ac.at/tocharian (retrieved 21.12.2016).