Where corpus methods hit their limits : the case of separable adjectives in Bambara

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Where corpus methods hit their limits: the case of separable adjectives in Bambara Valentin Vydrin

Where corpus methods hit their limits: The case of separable adjectives in Bambara Separable adjectives represent a morphosyntactic subcategory of the part of speech of adjectives in Bambara (< Manding < Mande < Niger-Congo, Mali, West Africa).A separable adjective is a compound lexeme consisting of a noun root designating most often a body part, a qualitative verb root and a connector -la-~ -lanor -ma-~ -man-.When used predicatively, the final component of a separable adjective (the qualitative verb root) is split from the rest of the form by the auxiliary word ka or man.Separable adjectives express mainly human qualities (moral or physical), and their semantics are very often idiomatic.The productivity of this subclass is limited.
In order to establish an inventory of the separable adjectives, two approaches have been followed: elicitation and a search in the Bambara Reference Corpus (which included roughly 4,110,000 words at the time of this study).The potentially imaginable number of lexemes of this type equals 570 (15 noun roots × 19 qualitative verb roots × 2 connectors).Elicitation provided 75 separable adjectives, and the corpus study, 25, 3 of which are absent from the elicitated list.This experiment proves that in studies of derivative morphology, when a linguist needs to fill out a matrix, elicitation cannot simply be replaced by a corpus study.On the other hand, the corpus data provides invaluable supplementary data that cannot be obtained through elicitation Key words: elicitation, corpus study, adjective, qualitative verbs, Bambara, Mande languages.

Introduction
In works on field linguistics methodology and in the current linguistic practice, elicitation is sometimes regarded as a rather inappropriate way of acquiring language data, and elicitated data is viewed as second-rate.Some colleagues tend to reject the elicitation altogether, in a more or less explicit way, and the following quotation is representative of this trend: 1Interview fieldwork is justified if there is nothing else to be done.It is a very poor option if a speech community is available -but some researchers opt to concentrate on interview fieldwork with a few speakers conveniently placed in a city or in a township.A grammar of a language spoken by a few million people which is based on the work with one consultant in an urban environment could be interesting, but is unlikely to be comprehensive and fully reliable [Aikhenvald, 2007, p. 5].
According to this approach, only natural texts, only spontaneous data can be regarded as reliable, and any use of elicitation is just a desecration of linguistic fieldwork.
This negativity is a natural reaction to an inappropriate and maladroit application of elicitation methods.The position of other authors may be less categorical.Most often, they recognize the usefulness of elicitation for obtaining certain types of language data; cf. a detailed analysis of various elicitation methods in [Chelliah, De Reuse, 2011].
The value of the data from natural texts cannot be contested (although here too, one should beware of ungrammatical forms spontaneously used by speakers).There are however some aspects of language structure where one can hardly attain satisfactory results without elicitation.Among such spheres are word formation and derivation, verbal lability, fine-grained syntactic studies; in fact, any research topic where an exhaustive checking of numerous options in a matrix is necessary.
In this sense, I recollect Vladimir Nedjalkov's (p.c.) position on the utility of elicitation in the verbal derivation studies: "If you check a matrix with your language consultant, and if even in 20% of cases he produces wrong answers or fails to answer, you obtain 80% of correct data.And if you work exclusively with natural texts, you are lucky if you obtain data for 30 or 40% of positions in the matrix".
I heard this argument from Vladimir Nedjalkov some 25 years ago, when big electronic text corpora were rare.Since that time, the progress in language documentation has been impressive.After the world-biggest languages, many mid-size languages were provided with multimillion text corpora accessible on line, and now, more and more new corpora for minor languages and those without official status in their countries, are becoming available.
Could it be that the easy access to great amounts of searchable natural texts has made Vladimir Nedjalkov's stance obsolete?
In this study, I attempt to answer this question drawing on Bambara data for the averagely-productive word-compounding model for adjectives.

The Bambara language
Bambara (< Manding < Western Mande < Mande < Niger-Congo) is the biggest language of Mali (West Africa).It is spoken, mainly in Mali (but also in its diaspora), by some 4 million L1 speakers and at least 10 million L2 speakers.Bambara has some written literature and periodicals; it is widely used in the literacy programs and, to some extent, in the primary and secondary education.Bambara is a relatively well-described language: there is a reference grammar [Dumestre, 2003], a number of university courses and textbooks [Bird, Kante, 1976;Bird, Hutchison, Kante, 1977;Kastenholz, 1989;Bailleul, 2000;Vydrin, 2008], big dictionaries [Vydrine, 1999a;Bailleul, 2007;Bailleul et al., 2011;Dumestre, 2011]; many dozens (or even hundreds) of research articles have been published.
Since 2011, there exists an electronic annotated corpus of Bambara texts freely accessible on line [Vydrin, Maslinsky, Méric, 2011].In 2011, it contained about 1,100,000 tokens (of these, about 28,000 tokens in the disambiguated sub-corpus); in November 2018, it reached the size of 9,146,875 tokens (of these, 1,122,416 tokens in the disambiguated subcorpus), and it continues to grow.A language corpus of some 9 million tokens may seem unimpressive when compared with corpora of big and even mid-sized European or Asian languages which comprise hundreds of millions words; however, for an African language almost unrepresented on the Internet, this amount of data represents a revolutionary breakthrough and opens bright perspectives for language studies.Since its publication in 2011, the Bambara Reference Corpus is broadly used in Bambara grammar studies, lexicographic research and language teaching.
Bambara is a tonal language with two tones at the underlying level, low and high. 2 The basic word order is S AUX (O) V X, where S is a subject, V is a verb, AUX is an auxiliary word expressing grammatical semantics of tense, aspect, mode and polarity (in the Mandeist tradition, AUX are named "predicative markers"), O is direct a object (whose absence makes the verb intransitive), X is an oblique (indirect object or adjunct), most often represented by a postpositional phrase.The word order in NP is N 2 -N 1 (N 1 is head noun, N 2 is dependent noun), N -Adj (the adjectival modifier follows the head noun).

Adjectives and qualitative verbs
Adjectives in Bambara represent a part of speech of its own [Vydrine, 1999b;Tröbs, 2008]. 3An adjective follows the modified noun (1a); the tonal article (designated by a suspended acute diacritic) and the plural marker /-u/ (in the standard orthography represented by -w) follows the adjective (1b). 4  (1) a. só jέ house white\art 5 'white house' b. só jέ-w house white\art-pl 'white houses' There are several morphological subclasses of adjectives; some of these are tonally compact with the modified nouns (i.e., they are prosodically nonautonomous), other subclasses are non-compact.
There is another class of lexemes in Bambara specialized in the expression of property values: qualitative verbs. 6They can be regarded as a separate part of speech or as a subclass of verbs.Their syntactic behavior is similar to that of "dynamic verbs" (i.e.all the other verbs of the language), but 3 The part-of-speech status of adjectives in Bambara has been amply discussed in the special literature, and it is hardly appropriate to resume this discussion here. 4The inverse word order, adjective-noun, appears in the inversive construction [Dumestre, 1987, р. 249-259;Vydrin, in press, leçon 30].The article and the plural marker follow the noun immediately when the adjective is used as secondary predicate [Vydrin,in press,leçon 31].Both these constructions are relatively rare in texts and represent no special interest for the topic of the current study.
-they can be only intransitive (while dynamic verbs are very often labile, and imtransitiva tantum are relatively rare among the dynamic verbs), -they can be accompanied by only two predicative markers, ka affirmative and man negative, which express no tense, aspect or modal semantics.
To the contrary, the dynamic verbs can appear with a whole set of predicative markers expressing various TAM meanings, but not with ka and man.
The qualitative verbs represent an unproductive closed class of about 60 lexemes (of these some 40 are frequently used, the others are more or less rare).Many qualitative verbs produce adjectives by the means of conversion.

Morphosyntax of separable adjectives
Separable adjectives represent one of the tonally non-compact classes of the adjectives.They are formed according to the following formula: N + Conn + QV, where: -N is a nominal stem, most often the name of a body part (or a term belonging to a semantically adjacent lexicon); -Conn is a connector la (variants: lan, na, nan) or ma (a variant: man), both connectors stem from locative postpositions; -QV is a stem of a qualitative verb.
The forms of connectors -ma-and -man-, -la-and -lan-are, in fact, phonetic variants (-na-and -nan-are in complementary distribution with -la-and -lan-: they appear when the preceding component ends on a nasal vowel); there are sometimes duplicates, as for example bólomandɔgɔ and bólomadɔgɔ 'poor, weak'.There seems to be no evident semantic difference either between the connector -la-and -ma-, their distribution is lexical.
These adjectives express mainly human properties (physical or moral), their meanings are often (but not always) idiomatic (2a).They are very easily convertible into nouns (2b) designating persons imparted with the quality in question.
(2) a. mɔ1 gɔ dá-la-fɛgɛn human mouth-in-light 'indescreet person' b. dálafɛgɛn 'indiscreet person' A peculiar morphosyntactic feature of the separable adjectives (not attested for other word classes) is that, when used predicatively, their final component is separated from the rest by the predicative marker of qualitative verbs ka or man, so that the initial bicomponent constituent appears as a part Лингвистика ISSN 2500-2953 of the subject NP, but it carries no tonal article (even in the contexts where, normally, one would expect the article).Cf. an adjectival attributive use (3a) and a predicative use (3b).
( The element sènna cannot be used in any other syntactic context as an autonomous word; it is a quasi-lexeme.

Inventorization of separable adjectives
The separable adjectives were described for the first time by Gérard Dumestre who tried to inventory the lexemes of this class [Dumestre, 1987, р. 235-248].According to him, 15 nominal roots and 11 qualitative verbs roots take part in the word formation of this type.Since then, more items have been found (mainly through elicitation), and currently [Vydrin,  The number of theoretically imaginable combinations of these components is 285.This number should be multiplied by 2 (the number of connectors), which brings us to the figure of 570.However, in reality, most of theoretically imaginable separable adjectives do not exist.The task is to establish the inventory of really existing items.
Elicitation [Dumestre, 1987, р. 239;Vydrin, in press, leçon 29] has produced 75 separable adjectives.This study was carried out with several native speakers during different periods of time; different informants produced similar results, therefore, they can be regarded as reliable.Here is the full list of the lexemes.

Elicitation and corpus study: Comparison of the results
Let us compare the results of both approaches, cf.Table 1.The noun roots are in vertical columns, the qualitative verbs stems are in horizontal lines; the connectors are in the cases where the combinations of N and QV are attested.The connector is black if the form has been obtained through elicitation, and it is red if the form has been found in the Corpus; in the latter case, the number of occurences is indicated.
In this study, elicitation has proved to be three times more effective (with respect to the inventorizing the lexemes in question) than the corpus study: 75 elicitated lexemes vs. 25 ones found in the Corpus; and if we take into account only reliable lexemes (5 occurrences or more), the elicitation turns to be five times more effective.
Another advantage is the fact that during an elicitation session, semantic information about the lexeme is normally produced at the same time as the form.In the case of a corpus study, the semantics of a form (especially if we have just a single occurrence) may be obscure.This may be especially true for the separable adjectives whose sense is often idiomatic and not directly derivable from the meanings of the components.
in press, Leçon 29], we have 15 nominal roots and 19 qualitative verbs in play.

Table 1 Separable adjectives in Bamana through elicitation cáman dídɔ2 gɔ jέ júgu fárin fέgɛn fìn gὲlɛn gírin gógòni kálan kègun kólon kúnan mìsɛn súma téli(n)
According to the statistical standards, such potential lexemes should be regarded as unreliable.This leaves us with 19 lexemes; of these only 14 occur more than 5 times which can be regarded as a reliable sample size for our Corpus: