Chinese Sentence-Initial Indefinites: What Corpora Reveal

While the sentence-initial position in Chinese is generally related to givenness/definiteness, instances of informationally new or indefinite sentence-initial NPs may be found in language in use. This paper systematically explores the phenomenon of sentence-initial indefinites (SIIs), their statistical relevance, and the interaction with features typically connected to linear order, such as animacy or locatability. Results of a quantitative and qualitative analysis conducted on three major big-size, generalised corpora show that SIIs in Chinese are not only possible, but also statistically relevant. Animacy and locatability are found to play a key role in increasing SIIs acceptability. Finally, data reveal a new pattern featuring SIIs with proper nouns.


1
Introduction 1 The sentence-initial position in Chinese is generally associated with, and often defined in terms of, a specific information status, i.e. that of givenness/identifiability and, consequently, definiteness. This association is widely accepted in the literature (Xu 1995) and is supported by the fact that bare nouns in Chinese receive a definite reading when preverbal (1a). Furthermore, it is often maintained that indefinite NPs cannot occur in the sentence-initial position (1b): to be first introduced, indefinites should be preceded by an existential or presentational verb, and then predicated upon, hence the construction in In Li and Thompson's grammar, the sentence-initial position is the position for the topic, which "always refers either to something that the hearer already knows about -that is, it is definite -or to a class of entities -that is, it is generic" (1981,85). Newly-introduced referents cannot be topics, hence they "must follow the main verb of the presentative sentence" (1981,509), as in (1c). Most subsequent literature on topic-comment structures and word order makes similar observations (Chu 2006;Li 2005;Shyu 2016;Tsao 1977Tsao , 1989Xu 1995;Xu, Liu 2007;Zhu 1982, among others); Ho (1993) holds that the fact that the sentence-initial position should be occupied by a definite el-
This challenges the widely accepted association of the sentence-initial position with topichood, givenness, and definiteness, as well as analyses that postulate a definiteness restriction on the sentence-initial position. However, several aspects of sentence-initial indefinites (henceforth SIIs) in Chinese have not yet been fully explored: how widespread is this phenomenon? How does it interact with other features typically connected to the sentence-initial position (such as animacy and locatability)? Crucially, corpus-based studies on the topic remain the minority and are usually conducted on relatively small, genre-specific corpora.
This paper adopts corpus methodologies and tools to investigate SIIs, with a particular focus on determining (i) the statistical relevance of SIIs of the type of '一 yī CLF N' in big-size corpora and (ii) its interaction with the semantic feature of animacy and, secondly, with the referential property of locatability. To this end, it proposes the results of a large-scale, quantitative and qualitative analysis conducted on three major big-size, generalised corpora, namely the PKU CCL corpus (Centre for Chinese Linguistics, Peking University, 470 million characters, henceforth PKU), the BCC corpus of Modern Chinese (Beijing Language and Culture University, 15 billion characters, henceforth BCC), and the Sketch Engine ZHTenTen (Stanford Tagger) simplified Chinese corpus (13,5 billion characters, henceforth ZHTenTen (ST)). A corpus approach is chosen as it contributes to grounding the analysis on empirical, natural data: corpora allow adhering more to real language in use; moreover, they may help reveal new patterns or phenomena, thus contributing towards deeper and more complete linguistic descriptions even for languages that are over-described, like Chinese.
The rest of the article is organised as follows: § 2 provides an overview of the literature on Chinese SIIs and their characteristics. § 3 presents the study, its research questions, methodology, and linguistic data. § § 4 and 5 discuss the findings of the quantitative and qualitative analyses, respectively. § 6 draws the conclusions and briefly discusses the implications of such findings on theoretical accounts of the sentence structure of Chinese and onto Chinese as a second/ foreign language teaching.

(In)definiteness and the Sentence-Initial Position in the Literature
The term 'definiteness' denotes a grammatical category featuring a formal distinction that marks an NP as identifiable: 2 this formal distinction may consist of a variety of grammatical means, "including phonological, lexical, morphological, and word order" (Chen 2015, 408). Among the first linguists that associated definiteness with word order in Chinese is Chao, who claims that the encoding of definite/indefinite reference is not much connected to grammatical functions (subject/object): rather, it is the "position in an earlier or later part of the sentence that makes the difference" (1968,. Crucially, Chao himself proposes a counterexample of SII, of the type of a thetic judgement (3a), commenting that it is a less preferred pattern if compared to the definite>verb>indefinite pattern displayed by (3b): 2 Identifiability is an addressee-oriented notion relating to the speaker's assumptions as to whether the addressee "is able to identify the particular entity in question among other entities of the same or different class in the context" (Chen 2015, 408). Li and Thompson (1981, 167-8) also identify exceptions to their abovementioned definiteness restriction to the preverbal position, which they illustrate with sentences in (4a)-(4d). All four sentences feature sentence-initial NPs of the type of '一 yī CLF N'; however, Li and Thompson hold that such exceptions are only apparent: all sentenceinitial NPs in (4) are indeed formally indefinite, but according to them they all receive a definite reading. In (4a), 一 yī refers to a specific "absolute quantity" and is therefore definite; in (4b), 一 yī in fact means "each", hence, it is not indefinite; in (4c)-(4d), they maintain, 一 yī introduces "something that is part of an entity already known by the hearer" (i.e. the leg of a known person, the peasants of a known village) and "can therefore be considered a definite noun phrase": ge rén jiù gòu le one clf person then (be).enough pfv/cos 'One person will be enough'.
Indeed, the examples above show that not all sentence-initial NPs of the type of '一 yī CLF N' are true indefinites. They may emphasise the quantity (4a) or receive a distributive reading (4b) (see also Lu, Pan 2009). Other readings are possible, e.g. generic reference (to a specific class), as in (5)  However, the underlined NPs in (4c)-(4d) can hardly be labelled as definite. In (4c), the implicit body-part (or possession/containment etc.) relationship might enable the hearer to identify the referent the leg belongs to; however, which specific leg is broken (left/right?) is not identifiable. Similarly, in (4d), 一个农夫 yí ge nóngfū 'a peasant' might be assumed to be specific (known by the speaker) but can hardly be considered identifiable by the hearer, especially with no context. On the other hand, the context of these utterances may render the referent locatable (Morbiato 2018;Wu 1998), i.e. located within a given/identifiable set (i.e. the two legs) or setting (i.e. the village where the peasant lives; the notion of locatability will be discussed in more depth below). Moreover, none of Li and Thompson's explanations account for Chao's example in (3), a SII tout court.
Some scholars put forward a more nuanced view of the definiteness-preverbal position association: Chen (2015, 410), for example, talks about definiteness-and indefiniteness-inclined positions, holding that preverbal NPs are overwhelmingly, but not exclusively, definite. Hole (2012, 61-2), after commenting on (1) that "subject DPs in Chinese must be interpreted as definite", adds that indefinite subjects are barred from the sentence-initial position in non-thetic (i.e. all-focus, topicless) sentences, thus implying that SIIs may occur in thetic judgements. However, examples of thetic sentences he includes, such as 一张床睡三个人 yì zhāng chuáng shuì sān ge rén 'one bed accommodates three people', do not display an indefinite reading, but rather a distributive one. Lu, Zhang and Bisang (2015) and Bisang (2016) go one step further, arguing that subjects, unlike topics, may be indefinite (they see indefiniteness as a subjecthood test): in thetic sentences, they claim, "preverbal indefinite subjects are acceptable" (Bisang 2016, 356): (Bisang 2016, 356) yí ge bēizi bèi wǒ dǎ-suì-le one clf cup bei 1sg hit-break-pfv/cos 'A cup was broken by me'.
Major contributions to the literature on SIIs come from Chinese scholars. In his influential paper, Fan (1985) notes that SIIs are not only possible, but also rather common in some genres such as news reports: sentences with indefinite subject NPs, he claims, do constitute a sentence pattern in Chinese -they are neither uncommon nor peculiar. Since then, a number of studies have followed (Fang 2019;Fu 2013;Liu 2018;Liu, Zhang 2004;Lu, Pan 2009;Tang 2011;Wang 2003;Xu 1997Xu , 1999Zhang 2007;Zhou, Chen 2013, among others), mostly focusing on the semantic and syntactic characteristics that license or increase the acceptability of SIIs. Generally, these regard: (i) the type of predicate -highly transitive, dynamic, and stage-level predicates are preferred over low-transitive, stative, and individual-level ones; (ii) the referential characteristics of the SII -the more information is provided that increases the referent's identifiability, the higher the SII's acceptability; and (iii) information structure -thetic sentences may host SIIs, especially when the referent is locatable in clear spatio-temporal frames. In what follows, main contributions will be briefly presented, with particular reference to corpus-based studies.
Several scholars focused on singling out properties and related licensing conditions to SIIs. Tang (2005) holds that SIIs are acceptable only in highly transitive sentences. Zhang (2007) concludes that SIIs occur in topicless (非主题判断 fēi zhǔtí pànduàn) -i.e. thetic -judgements, whereby the entire clause is a single unit conveying new information. Lu and Pan (2009) elaborate on this and claim that SIIs occur in (a) thetic sentences, where the whole predicate is projected into the core domain and is constrained by an existence operator, and (b) with stage-level predicates (expressing an event), but not with individual-level predicates (that express some judgement). Chen (2015) also remarks that SIIs are more acceptable with dynamic predicates but hardly occur as subject with stative ones (7): 7. *一个人很聪明。 (Chen 2015, 410) yí ge rén hěn cōngming one clf person very smart 'One person is very smart'.
With reference to the above considerations, Wang (2003), Huang (2004), Wei and Chu (2007), and Lu and Pan (2009), among others, put forward a number of corollary licensing conditions to SIIs -e.g. SIIs cannot occur with modal verbs, negative adverbs, and tense. However, corpus studies found that most of these conditions are only tendencies, as counterexamples can be found for each parameter. Specifically, Zhou and Chen (2013) measured the descriptional accuracy of such licensing conditions with the method of parameter setting and measurement against a relatively small test corpus (i.e. a 1,000-sentence subcorpus of the PKU). From their analysis, it appears that all factors indeed contribute through a complex interplay to increasing SII's identifiability, and hence acceptability rate, but none constitutes an absolute restriction.
A widely accepted generalisation on SIIs is that the greater the amount of information on the referent (e.g. by means of longer nominal modifiers), the higher its degree of identifiability and, hence, its acceptability (Xu 1999). Wang (2003), for example, talks about degree of (cognitive) accessibility (可及度 kějídù) and of identifiability (个体化 程度 gètǐhuà chéngdù). Indeed, the acceptability difference between (8a) and (8b) lies in the long, informationally-rich (complex relative clause plus noun) modifier of the SII: 8. a. *一种方法最近问世。 (Zhou, Chen 2013, 373) yì A very interesting perspective is provided by Fu's (2013) corpusbased, diachronic study, which reveals that SIIs very likely originated during the Song Dynasty (960-1279) and evolved from earlier constructions whereby an indefinite NP is the subject of the sentence following a perceptual verb, like 见 jiàn 'see'. Early instances of 'see' + indefinite NP patterns -e.g. (9) from Zhuangzi -also specify the scene witness (the <seer>, in this case King Wen). Later, the construction became impersonal, by means of markers that express the idea of 'seeing', such as 只见 zhǐjiàn and 则见 zéjiàn: sentences like (10) are interpreted as if the witness were an omniscient narrator. Later, these markers disappeared (11)  Locatability. From the data in the literature analysed so far, an important feature of SIIs that scholars, however, never explicitly mention seems to be locatability, intended as identifiability of the referent's setting rather than identifiability of the referent itself. An example of non-identifiable, locatable referent is the sentence-initial NP in a person in the airplane started shouting: the hearer (and even the speaker) might not know who this person is, but they are definitely able to locate the referent within the group of people on that specific airplane. In other words, the referent itself is not identifiable: what can be identified is the scene/setting/set/frame where the referent is located. Locatability is typically granted by the presence of a phrase that expresses a temporal or spatial frame for the utterance, which is an inherent characteristic of Chinese topics (Chafe 1976;Her 1991;Morbiato 2018;Paul 2015) and is the property Li and Thompson tried to recall with respect to (4c)-(4d): the referents are not identifiable/ definite, but rather locatable within a known set -one of two legs of an individual in (4c) -or a temporal/spatial setting -one of the peasants of a known village in (4d). This also suggests that locatability, rather than givenness and identifiability, is a more accurate restriction to the preverbal position in Chinese (see Morbiato 2018Morbiato , 2020 for discussion). This is confirmed by Liu and Zhang's (2004) corpus investigation of eight novels and children stories: most (although no statistics are provided) of the SIIs they detected feature a temporal or spatial reference occurring before the indefinite NP. Such tem-poral or spatial reference situates the referent within identifiable spatio-temporal coordinates. It may be either a phrase (12) or a sentence/clause (13). Other sentences may feature no explicit temporal reference, but according to Liu and Zhang (2004, 99)  An account in terms of locatability also explains Xiong's (2008) claim that SIIs admissibility depends on the presence of a specific component that meets the topic's needs: what Xiong actually means is that some contextual element is needed that renders the topic referent locatable; such an element may be a temporal/locative phrase, even an implicit one (stage topic). It also sheds light on Liu's (2003) observation that the role of SIIs within the narration is to create a plot transition: in this case, the new topic also involves a shift of setting (for example, a new scene or a new time reference, with different spatiotemporal coordinates).
All the above studies highlight significant features of SIIs. However, they reveal little about their statistical relevance, as most corpusbased studies are qualitative and/or conducted on small-size corpora. Furthermore, little is said on another rather significant cross-linguis-6 Given an utterance, stage topics are its implicit spatio-temporal coordinates that allow the assessment of its truth value. This captures the fact that a sentence like it is snowing! is true and informative only with reference to the temporal and spatial setting of its discourse. According to Erteschik-Shir, "thetic sentences are viewed as having implicit 'stage' topics indicating the spatio-temporal parameters of the sentence (hereand-now of the discourse). These are contextually defined" (2007,16). tic feature of the sentence-initial position, i.e. animacy: does this semantic trait interact at all with SIIs in Chinese?

3
The Study. What Corpora Tell on SIIs As said earlier, this study adopts a corpus approach, with the aim to ground the analysis on empirical, natural data. Specifically, corpora contribute towards: (i) verifiability and reproducibility as monitoring mechanisms for a given analysis, as results can be checked by repeating the same query; and (ii) highlighting facts, data, or details that had not been observed before and have not yet been integrated in linguistic descriptions. Let us now turn to corpus data: a banal query with the string '。 一位' (. yí wèi) in the PKU corpus gives 5,751 results; the first 5 occurrences are reported in   Such very preliminary data have little statistical relevance but open up interesting perspectives. First, SIIs do exist and are not statistically insignificant: results in all corpora are of the order of thousands; moreover, five out of five sentences in table 1 present sentence-initial NPs that receive a true indefinite reading. Second, corpora are tools that must be used cum grano salis: in table 2, the first four NPs are in fact generic, while only the fifth is a true indefinite. Hence, quantitative data will need to be filtered through a subsequent qualitative examination, to assess the extent to which sentence-initial NPs of the type of '一 yī CLF N' are true indefinites. Third, a striking difference is highlighted between a very common, generic classifier like 个 ge 'unit' and the highly specific classifier 位 wèi, i.e. the polite classifier for people: although 个 ge is much more frequent in absolute terms (its total occurrences as classifier in the ZHTenTen (ST) corpus is 9,265,680, as compared to 1,007,191 for 位 wèi -see table 3 below), the former occurs just little above twice as the latter in the '。 一 yī CLF' pattern. This, together with the different ratio of true SIIs (100% vs 20%, respectively), suggests that the semantics of the classifier (e.g. the trait ±animate/±human) might also be relevant with respect to the acceptability degree/statistical relevance of SIIs. This hypothesis is supported by the cross-linguistic tendency of animate NPs to occur sentence-initially, regardless of their semantic role, syntactic function, and information status (non-agent, non-subject, and non-given animates still display this tendency). An experimental study carried out by Verhoeven on a sample of heterogeneous languages (German, Greek, Turkish, and Chinese) shows that "animate-first effects occur across languages" (2014,129). This, according to Verhoeven, is an expected result under the view that "these effects come from asymmetries in the mental representation of the referents", which are independent from language-specific characteristics (2014, 129) -see also Van Bergen (2011) for a cross-linguistic overview of animacy and word order and Iemmolo and Arcodia (2014) for Chinese.

Research Questions and Scope
Against the background laid out so far, this study aims at answering the following research questions: RQ1 How significant is the phenomenon of SIIs from a quantitative/statistical perspective? RQ2: Does the trait of animacy play a role in the phenomenon?
The study focuses on indefinite NPs marked through the major indefiniteness encoding means in Chinese (Chen 2015, 409), i.e. a noun phrase containing the string 一 yī 'one' + classifier (CLF), 7 that occurs sentence-initially. In fact, indefiniteness may be conveyed, more in general, by the string numeral + classifier (Li 1997, 18, among many others); however, indefinite NPs with numerals other than 一 yī 'one' (e.g. 三/几个学生 sān/jǐ ge xuéshēng 'three/some students') are excluded from the study, for two main reasons: the first is that the study itself would be more complex in terms of corpus queries; moreover, it would involve relying more on the accuracy of the tagging, which is not always high (see discussion in § 6) and is different in each corpus (e.g. the PKU is not POS-tagged), thus not allowing a comparison between the three corpora. Finally, numerals other than 'one' often emphasise the quantity or receive a distributive reading, as discussed by Li and Thompson with reference to (4a)-(4b) above, while the focus here is mainly on true indefinite readings. This implies that this study only accounts for singular indefinite NPs of the type of '一 yī CLF (N)' and that the number of SIIs identified in this study is smaller than those actually existing in the corpora.
Possible indefinite NPs may consist of simple patterns of the type of '一 yī CLF (N)', where the head noun may be overt or omitted. In some cases, the classifier may also be omitted; however, these cases are comparatively rarer and harder to detect, and thus will not be considered. This also implies that, again, the number of SIIs identified in this study is smaller than those existing in the corpora. Indefinite NPs may also include modifiers (nouns, adjectives, verbs, relative clauses etc.). These generally occur in two positions: between the classifier and the noun (14b) and to the left of the '一 yī CLF N' string (14c) -the former suggests a descriptive reading, the latter a restrictive one, see e.g. Chao (1968, 286- Below are examples of SII types above. For pattern (14c), modifiers may include nouns/adjectives (15c), but also verbal elements occurring, for example, within a relative clause (15c'). Finally, other elements, such as time/location phrases, may occur to the left of the NP -see e.g. (12) above: 7 Indefinite NPs in Chinese may take two forms: nouns modified by a number + classifier structure and bare nouns, when postverbal (Li 1997, 18). Since the present article investigates the sentence-initial position, it focuses on the pattern '一 yī CLF N'.

Methodology and Data
Quantitative analysis. Identifying SIIs as described above involves examination of complex strings, including punctuation and sentence boundaries. Hence, for the quantitative analysis, three generalised, big-size corpora were chosen that allow such a query: the PKU corpus (470 million characters), the BCC corpus (15 billion characters), and the ZHTenTen simplified Chinese corpus mounted at Sketch Engine (Stanford Tagger subcorpus, 1,73 billion characters). Each corpus involves a different query system, and only the BCC and the ZHT-enTen (Stanford Tagger, henceforth ST) 9 are POS-tagged; hence, the results are more or less fine-grained depending on the corpus. Specifically, while the BCC and the ZHTenTen (ST) corpora also allow queries through the POS tag for classifiers (q and M, respectively), in the 8 https://bit.ly/37wXhFe. 9 The ZHTenTen Stanford Tagger is POS tagged following the Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank. The corpus allows a rather detailed interrogation, lends itself to concordancing, collocation, and term extraction (Xu 2015).
PKU corpus the number of occurrences needs to be collected for each single classifier. To this end, Sketch Engine's wordlist tool was used to obtain a frequency list of the nominal classifiers listed in the 汉语 量词词典 Hanyu liangci cidian (Chen et al. 1988): a total of 36 classifiers with more than 20 thousand occurrences as classifier in the ZHT-enTen (ST) were identified. Units of measure, e.g. 元 yuán (RMB), 分 fēn (unit of length/area/money/time), 吨 dūn (ton), 亩 mǔ (unit of area), 公里 gōnglǐ (km) were excluded, in that they are mainly used to express specific quantities rather than indefiniteness. To tackle RQ2 ( § 3.1), particular attention was devoted to classifiers denoting animate nouns -marked as +A(nimate) -including 名 míng, 位 wèi, 只 zhī and 头 tóu (for animals), and 伙 huǒ (collective). Other classifiers used with people but also with inanimate nouns (±A) such as 个 ge, 行 háng (row), 家 jiā (for families and for shops), and 排 pái (line) were treated separately, as it is not possible to verify whether their frequency is connected with the occurrence of animate nouns. The classifier 对 duì 'couple', while compatible both with animates and inanimates, was marked as +A, in that a cursory examination of 150 random tokens of sentence-initial '一对 yí duì' NPs in all three corpora reveals that 90% of tokens introduce animate nouns. Table 3 shows the resulting list of examined classifiers, along with their frequency: 一"][word="CLF"] in Sketch Engine). On the other hand, detection of pattern (c), where the modifier(s) occur(s) between the punctuation mark and the '一 yī CLF' string, is more complex and, in some cases, problematic. Specifically, modifiers such as relative clauses cannot be detected, as queries including verbs before the '一 yī clf' string may both identify SIIs, as (15c'), but also postverbal indefinites, as in the following example: 16. 刚来了一位天津大厨 gāng lái-le yí wèi Tiānjīn dàchú just arrive-pfv one clf Tianjin cook 'A cook from Tianjin has just arrived' To avoid that, the queries exclude verbal elements, but include adjectival and nominal modifiers (e.g. <s>[tag="JJ"][tag="N.*"]{1,7} [word="一"][word="CLF"&tag="M"], in the ZHTenTen (ST)). Finally, SIIs with leftmost time/location phrases separated by commas, as in (12), are hard to identify quantitatively and are not considered either. Again, this implies that the number of SIIs identified in the quantitative analysis does not include all possible patterns.

Qualitative analysis.
As noted in § 2, while the string '一 yī CLF' is the most common formal marker for Chinese indefinite NPs, it does not always involve a true indefinite meaning, as the NP may display a quantitative (4a), distributive (4b), or generic (5) reading. The quantitative analysis as described above necessarily identifies all types, as they are formally identical. To determine the average ratio of true indefinites, as well as of NPs receiving a quantitative, distributive, or generic reading, a qualitative analysis was conducted on a random sample of sentences from the ZHTenTen (ST) corpus, collected 10 with the following query: <s>[tag="JJ|N.*"]{0,7}[word=" 一"][word="CLF1| CLF2|… "]. Each sample consists of 100 sentences for each subtype of classifiers (+A, ±A, -A), for a total of 300 sentences, a number that preserves the representativeness of the sample.
10 With the Sketch Engine function 'get a random sample', the same number of lines generated from a given concordance produces the same concordance lines: thus, the search can be easily repeated and reproduced.

Quantitative Results
The tables below show results for each corpus. In the paper, 'clf' denotes each specific classifier, while 'CLF' indicates the word class. S.I. stands for 'sentence-initial', while de corresponds to the Chinese noun modifier marker 的 de, which may but need not be present. Orange, blue, and green mark +A, ±A, and -A classifiers, respectively (see § 3.2). Columns for pattern (c) as shown in (14) report figures of different modifiers patterns; the type and number of detectable patterns depend on the tools and CQL queries each corpus offers. The last column (ratio) shows the percentage of sentence-initial occurrences of each classifier in the pattern '一 yī CLF' over all occurrences of the pattern in any position in the sentence; in other words, it captures how often an indefinite noun phrase with a specific classifier occurs sentence-initially. Thanks to the Corpus Query Language (CQL) option, the ZHTenTen (ST) is the corpus that allowed extraction of the most detailed data. Table 4 presents the number of occurrences for each classifier for patterns (14a)-(14b) (column 3) and some possible patterns for (14c), distinguishing different modifier types (adjective, noun, or both, and with or without 的 de); modifiers are up to 7 characters long. Columns 10 and 11 show the total amount of detected S.I. '一 yī CLF' patterns that occur without and with 的 de, respectively, 11 while column 12 (total detected) provides the sum of these two. The classifier with the highest total occurrences in the three patterns identified in (14) is 个 ge (116,021), followed by 位 wèi (36,157 -about one third). However, an inverse tendency is observable in the last column, which again captures how often an indefinite noun phrase with a specific classifier occurs sentence-initially: the classifier where this ratio is by far the highest is 位 wèi (more than 10%); other +A classifiers are all around 3%, followed by 个 ge that drops to 2.78%. In the BCC corpus [tab. 5], it is more difficult to elaborate the query to include longer leftmost nominal or adjectival modifiers. Hence, detected modifiers are up to 2 characters long; 12 furthermore, composite queries to detect multiple patters (as in columns 9-10 of table 4) are not possible. This implies that the number of undetected tokens is higher than that in the ZHTenTen (ST) corpus. This is reflected in the figures, that are sensibly lower. The classifier with the highest ratio in the last column is still 位 wèi, although the ratio is lower (5.67%), about half the ratio in the ZHTenTen (ST) corpus.

Sinica venetiana 6 77
Corpus-Based Research on Chinese Language and Linguistics, 57-90 Since the PKU corpus is not tagged, complex queries involving nominal or adjectival modifiers highlighted in the previous corpora (pattern in (14c) are not possible [tab. 4]; however, the query (。| ? | ; | !) $2的一clf was used to single out one/two-character modifiers (columns 4, 9). Such a query singles out, for example, modifiers such as the one in (17).
Such a limited interval minimises statistical possibilities of including verbal items and, hence, postverbal indefinites (see discussion in § 3.2). However, this involves that SIIs with longer modifiers -as in (15c') -are missing from the total count, hence the remarkably lower figures in table 4.

Discussion.
Overall, results show that all examined classifiers occur with 一 yī in the sentence-initial position. Figures for pattern (14c) are higher in the ZHTenTen (ST) corpus, but this does not come as a surprise, as leftmost modifiers detected in the ZhTenTen (ST) are up to 7 characters, while in the other two corpora they are up to two characters (see § 3.2). Let us focus on the two classifier 位 wèi and 个 ge: the former's total occurrences in the (14a-b-c) patterns are 36,157 in the ZHTenTen (ST), 1,717 in the BCC, and 6,412 in the PKU; the latter's are 116,021 in the ZHTenTen (ST), 14,734 in the BCC, and 16,611 in the PKU. Crucially, ratio-wise 位 wèi significantly outranks 个 ge (10.2% over 2.78% in the ZHTenTen (ST)): in other words, while the string '一位' yí wèi overall occurs far less than '一个' yí ge, in the sentence-initial position the former occurs much more frequently than the latter. Other classifiers with a relatively high ratio (last column), especially in the ZHTenTen (ST) corpus, include +A classifiers in general and ±A classifiers like 组 zǔ 'group' and 班 bān 'class' (highly compatible with +A nouns) -almost all show a ratio above 3% in the ZHTenTen (ST). Relatively high ratios are also displayed by some -A classifiers, such as 级 jí 'level' (3.59%), 期 qī 'period' (7.11%), 部 bù 'part' (3.43%), 句 jù 'line' (3.89%), and 首 shǒu 'piece (e.g. of poetry/ lyric', 3.99%). Indefinite noun phrases with the first three classifiers (级 jí 'level', 期 qī 'period', 部 bù 'part') display an interesting common semantic trait related to partitivity: the referent may denote a part of a given whole, a level of a given multi-layered structure, a step of a given path, or else a phase of a given plan or project (see examples in sections below). The relatively high frequency of such NPs in the sentence-initial position might then be connected to the fact that the referent, although not identifiable, is at least locatable in a given set/whole/container that is comprehensible thanks to the semantics of each classifier (e.g. one level of a specific hierarchy, one step of a specific procedure etc.); it may also be specified in the previous context or, otherwise, be implicit (stage topics, 13 see discussion for sentence (4c)). This point will be examined in the qualitative analysis below. Conversely, 句 jù and 首 shǒu (classifiers for lines/quotes, and for songs/poems, respectively) come rather unexpected. We will look further into these classifiers through the qualitative analysis.
Let us now have a closer look at aggregated data with respect to the animacy trait (+A, ±A, and -A) in the ZHTenTen (ST) corpus A total of 232,682 sentence-initial NPs introduced by 'yī CLF' were detected in the corpus. As discussed, such a total includes neither NPs modified by relative clauses nor NPs preceded by modifiers longer than 7 characters and separated by commas (e.g. temporal/locative frame topics). Interestingly, almost 8% of animate NPs introduced by '一 yī CLF' are sentence-initial, while the ratio drops to 2.88% for ±A classifiers, and to 2.65% for -A classifiers. Charts below represent the percentage of '一 yī CLF' tokens over the total amount of tokens in all positions [chart 1] and in the sentence-initial position [chart 2], divided per animacy trait: as can be seen, the percentage of +A tokens is significantly higher (more than double) in the sentence-initial position (8.8% vs 20.9%).

Qualitative Results
As discussed in § 3.  Let us first focus on SIIs: strikingly, 94% of +A tokens display an indefinite reading and hence are true SIIs. In other categories, conversely, the percentage of true SIIs drops to 34% for ±A and 28% for -A tokens. If we assume that the above figures are statistically relevant (although this would benefit from more tests conducted on different samples), we could consider these three percentages as coefficients that enable determining the true amount of SIIs from quantitative data presented in § 4. For data from the ZHTenTen (ST) corpus, results would be as follows:  Interestingly, a reverse tendency can be observed with respect to +A tokens within the ±A subcorpus: only 12 (34%) are true SIIs (as compared to 94% in the +A subcorpus). Moreover, getting back to the comparison between 个 ge and 位 wèi, in the qualitative analysis, +animate (and +human) tokens introduced by 位 wèi tend to be referential/specific SIIs; conversely, for those introduced by 个 ge, generic NPs are twice as much as specific SIIs. This is very likely connected to their semantics: 位 wèi implies respect or courtesy and likely involves that the speaker knows the referent (specific indefi- If we further split ±A SIIs into A+ and -A and add this data to percentages indicated in table 11, we obtain the following figures: Some +A SIIs are not specific; however, the context makes them at least locatable (see discussion in § 2). This is the case of (22): the referent of 一位父亲 yí wèi fùqīn 'a father' is not identifiable, but rather locatable within the temporal and spatial settings previously specified in the article, namely a dancing event at the Huazhong Agricultural University (cf. context). Similarly, in (23) the context makes it clear that the referent of 一位坐在最后一排的演 yí wèi zuò zài zuìhòu yì pái de yǎnyuán 'an actor sitting in the last row' cannot be identified, but rather located, within the given venue/group of 160 meeting participants: Other 'locatable' SIIs bear a partitive or whole-part relationship with previous sentences, as in (24). A partitive relationship is particularly frequent in occurrences of inanimate classifiers with an inherent partitive meaning (as hypothesised in § 4), e.g. 级 jí 'level' and 期 qī 'period, phase'. 14 In most cases, these receive a definite/numeral reading, e.g. 'the first phase' in (25) 14 Qualitative data also reveal that the high frequency of patterns like '一级' yì jí is also connected to frequency in tables (tabs are also counted as sentence boundaries (<s>) in the ZHTenTen (ST) and are hard to rule out from the search).

Anna Morbiato
Chinese Sentence-Initial Indefinites: What Corpora Reveal A very interesting subtype found in -A tokens are referential SIIs, which come in three types: the first type (26) features a modifier that renders the referent uniquely identifiable, such as 最后 zuìhòu 'the last' or 最初 zuìchū 'the first'. The second type (27), also common in other languages (including English), is a sort of cross-clausal apposition linked to a referent mentioned in the previous context: The third type (28)-(29) interestingly features a proper name rather than a common name introduced by '一 yī CLF'. Classifiers occurring in this (not rare) pattern include 句 jù and 首 shǒu, thus explaining these classifiers' high sentence-initial ratios observed in table 4. This pattern had not been identified in our preliminary discussion, which confirms that corpora may help singling out new phenomena or patterns in a given language: de rén cóngcǐ biànchéng lìshǐ shū de dúzhě sp people from.now.on become history book sp reader 'A (the) book "Those Things Happened in the Ming Dynasty" may make many people who never read about history become readers of history books'.
We had found an example of such a pattern in table 1 above, reported in (30) below. In this case, the pattern occurs postverbally, but still features a proper noun (here, a title) introduced by the indefinite marker '一 yī CLF'. If we look at this pattern from the perspective of its meaning, it seems to introduce unique referents, that are generally referred to with a proper name (such as book titles or pieces of poetry): in particular, while the speaker knows about that referent, (s)he might be not sure whether the interlocutor has some knowledge of it. Nonetheless, this would benefit from further research.

Conclusions and Limitations
The present study was designed to determine the statistical significance of SIIs in Chinese as well as the interconnections with features such as animacy and locatability. The quantitative and qualitative analyses discussed so far support our initial hypotheses. Specifically, with reference to our initial research questions, this study shows that: (RQ 1) first, SIIs do exist in Chinese; statistically, their number is not unimportant. Statistical data and the analysis laid out so far suggest that, in the ZHTenTen (ST) corpus, a total of more than 100 thousands of true SIIs (i.e. sentence-initial '一 yī CLF' forms with a true indefinite reading) can be detected. If compared to the total amount of '一 yī CLF' occurrences in the ZHTenTen (ST) corpus, SIIs are 1.48%. Crucially, this analysis was not able to detect all SIIs (e.g. those introduced by numbers other than 一 yī, those with longer modifiers, or those modified by restrictive relative clauses as in (15c)): hence, the true amount of SIIs in the corpus is very likely to be higher. This has important implications: a theoretically sound account of the Chinese language and its word order should consider and discuss the existence and characteristics of this pattern. Similarly, SIIs should be introduced in Chinese grammars and teaching materials as well, explaining their peculiarities, tendencies, and restrictions. Of course, specific (cross-sectional or longitudinal) studies should be conducted to determine at what stage/proficiency level SIIs should be taught.
(RQ2) Animacy is indeed a factor that has significant impact on SI-Is: the study shows that almost 8% of animate NPs introduced by '一 yī CLF' are sentence-initial, percentage that drops to 2.6 for non-animate NPs. Furthermore, roughly, 6 SIIs out of 10 are animate. Again, this is in line with other cross-linguistic studies on animacy and the sentence-initial position. Animacy was found to be a relevant factor in determining the order of event participants cross-linguistically. Studies conducted on different languages, including Spanish, Italian, Greek, Japanese, German, Dutch, Odawa (North America), and Yucatec, reveal that animate referents tend to occur before inanimate ones, regardless of their role in the event (see Van Bergen 2011 for an overview). When animate participants play the role of patients, speakers tend to produce passive sentences or to place the animate patient at the beginning of the sentence as a topic.
Finally, the above results confirm that corpora indeed contribute towards a better understanding of languages, even on topics with an established scholarship such as Chinese word order and referentiality, and allow finding new previously unobserved or underdescribed patterns in the language: the study has revealed a new reading for seemingly indefinite patterns of the type of '一 yī CLF N', i.e. those featuring a proper noun, as in (28) and (29).
On the other hand, the study has also highlighted some limitations of corpus tools. First, in this case a qualitative, sentence-by-sentence check was essential to refine, interpret, and validate quantitative results. Second, corpus design and POS tagging do not have a 100% reliability. For example the query "[。; ? !]n一对" in the BCC, corpus which should reveal only nominal modifiers, also identified the following (postverbal) token: shì yí duì fū-fù if neg be one clf husband-wife 'If they weren't a married couple […]' All in all, the study clearly shows that SIIs are not only possible, but also do not constitute isolated exceptions, and that animacy and locatability indeed play a crucial role in increasing the acceptability of SIIs.