Research Methods in Applied Linguistics

This article demonstrates that, counter to current practice, (i) corpus-linguistic studies should provide uncertainty/interval estimates for all corpus-linguistic statistics, even for basic/fundamental ones such as frequencies, dispersions, or association measures, and (ii) these statistics should be based on text-/ﬁle-based bootstrapping and conﬁdence/data ellipses covering two or more dimensions of information. Four small case studies – three more programmatic and one more applied – are oﬀered to exemplify the logic and method. The ﬁrst case study shows how parametric conﬁ-dence intervals or conﬁdence intervals from word-based bootstrapping can be inappropriate; the second case study exempliﬁes the computation of frequency-cum-dispersion intervals; the third does the same for collocational/collostructional data (the ditransitive); and the last case study exempliﬁes the use of these methods in a diachronic statutory-interpretation context.


Introduction
1 Nearly all corpus-linguistic studies at some point report some basic statistical results such as 2 • the frequency/ies of (co-)occurrence of some element(s) in a (part of a) corpus. 3 • the dispersion(s) of some element(s) in (parts of) a corpus. 4 • association measures quantifying (dis)preferences of co-occurrence of two (kinds of) elements in (parts of) a corpus. 5 As a trivial example, one might state that the word give occurs 444 times in the British component of the International Corpus 6 of English (the 1m words ICE-GB), and many studies restrict themselves to reporting observed frequencies (sometimes normalized 7 to per-million-words (pmw)). However, reporting the frequency, while still the standard, is sub-optimal because it neglects the fact 8 that, in nearly all corpus-linguistic studies, the corpus being studied is a sample of a population it is supposed to represent. 1 This 9 means that, as corpus linguists, we need to be constantly aware of sampling variation, but too often we do not seem to be. Many 10 experimental studies in linguistics (Second Language Acquisition (SLA), psycholinguistics, or applied linguistics) report descriptive 11 statistics such as means and standard deviations for relevant experimental groups, but corpus studies -studies in learner corpus 12 research, historical linguistics, descriptive synchronic research, etc. -rarely do so. They normally do not provide corpus frequencies 13 together with estimates/indications of uncertainty/variability, namely a measure of dispersion (in either the statistical or the corpus- giving any indication of the statistical dispersion of give is unfortunately rare and, thus, would be laudable, the numbers provided are 19 still problematic. To explain how and why, consider the following scenario: You do a corpus-linguistic analysis of something where, 20 for your analysis to be supported by the corpus data, give would need to be more frequent than given . You look up the frequencies of 21 the two words in the ICE-GB, and you find that give and given occur 444 and 365 times, respectively. You write it up and send it off 22 for publication. However, some pesky reviewer asks 23 But how does the author know that the difference between 444 and 365 is "significant " or "robust enough " (i.e., not just a 24 sampling error or something like that)? I would like to see the authors demonstrate that we can place enough trust in this 25 frequency difference to consider their research hypothesis as supported. 26 The editor agrees with the reviewer's comment. Hence, you decide to provide the above kind of confidence intervals for the two 27 frequencies: All it takes are two lines of R code, and you are happy to report that the 95% CIs of give and given do not overlap, meaning 28 that you have the significant result you hoped for: 29 • 95% CI of give : [404.6, 487.3]; 30 • 95% CI of given : [329.4, 404.4]. 31 However, just before you send this off for publication, another review comes in late, which agrees with the idea of requiring you 32 to provide CIs, but recommends a bootstrapping approach, which is an approach that involves repeated sampling with replacement 33 from one's data in order to estimate how variable the results obtained from the data are (and with how big a grain of salt the results 34 might have to be taken); see Gries (2006) for an early application in corpus linguistics; Egbert and Plonsky (2020) 36 treatments in the context of applied linguistics. Accordingly, you write a small script in R that does the following 1000 times: You 37 take a random sample of words from the corpus with replacement and, then, you count the occurrences of give and given in them and 38 store those frequencies in two vectors, one for each verb form. Next, you use those 1000 element vectors to compute the 2.5% and 39 97.5% quantiles for both give and given to obtain the confidence intervals of the bootstrapped frequencies. The analysis reveals that 40 your findings are very similar to the results of R functions such as prop.test/binom.test and, thus, again support your hypothesis: 41 • 95% CI bootstrapped of give : [405,484]; 42 • 95% CI bootstrapped of given : [327,401]. 43 However, the problem still persists because the CIs computed as above assume a binomial distribution and/or a complete indepen-44 dence of data points (i.e., the computation assumes the so-called bag-of-words model, according to which corpora are "unstructured 45 bags of words "); this implies that this approach does not consider the division of the corpus into, here, 500 parts/files (or 13 sub-46 registers or 5 registers), which indicates that this approach does not consider that the probability for any word to show up more 47 than once in one text is higher than the probability of the word to show up more than once in a corpus as a whole (see the fitting 48 sub-title of Church's famous paper in 2000 "The chance of two Noriegas is closer to p / 2 than p 2 . "). Why does this matter? It matters 49 because this systematically distorts the computation of the CIs and the difference between the kinds of CIs just mentioned -the first 50 parametric ones from the functions prop.test/binom.test, the second ones from the bag-of-words-based bootstrapping), and, third, 51 the more appropriate text/file-based bootstrapping ones that 'respect' the division of the corpus into parts) -can 'make or break' an 52 analysis (depending on one's hypotheses and significance thresholds, obviously). 53 Consider Fig. 1  However, as may have already become clear from the introduction, this paper extends some of the arguments of Gries (2006) to 87 make the point that most fundamental corpus statistics -not only frequencies, dispersions, and associations, but also others -benefit 88 from having their uncertainty/variability quantified with bootstrapping approaches (or simulation-based approaches more generally). 89 Put differently, simulation-based approaches such as bootstrapping do not just help with t -test type statistics, but they already help 90 at an earlier stage, namely when we generate the kind of data for t -tests. Section 2 of this paper exemplifies the use of file-/part-91 based bootstrapping to exemplify the quantification of the uncertainty/variability that accompanies the combination of (i) observed using it as a technical term, one that is related to, yet distinct from, mere frequency of occurrence. A word w is 'common' in the most 105 prototypical way if most or all of the following conditions hold: 106 2 The notion of "Zipfian distribution " for corpus data refers to the fact that the frequencies of words in corpora in general or the frequencies of words in, say, lexically, grammatically, or constructionally defined slots in particular exhibit a distribution such that (i) a very small number of word types are highly frequent and (ii) a very large number of word types are extremely infrequent or even hapaxes. For example, of all the word forms between angular brackets in the ICE-GB, (i) the 30 most frequent word types (out of all 58,309) already account for 38.8% of all tokens in this 1 million words corpus and (ii) ≈58% of all word types each occur only a single time in the corpus. In other words, a Zipfian distribution is a power law function; see Manning & Schuetze (1999 :20-29 (1) an average speaker of the language is extremely likely to know and use w ;

107
(2) an average speaker of the language has known w for a long time (they learned w at an early age); 108 (3) an average native speaker of the language is extremely likely to encounter w regularly, which means 109 a. w is sufficiently frequent; 110 b. w is sufficiently dispersed in the language; and 111 (4) a non-native speaker of the language is likely to learn w early in their learning/acquisition process. 112 With this list, I am not implying that these are new criteria -they are not: Criteria 1 and 3a are operationalizable by frequency 113 of occurrence in general corpora, criterion 3b is operationalizable by dispersion in general corpora; criterion 2 is essentially age-of-114 acquisition; and criterion 4 is operationalizable by frequency of occurrence and dispersion in learner corpora. I am merely introducing 115 'commonness' here as a notion that is intuitively straightforward to grasp and yet is defined here with a variety of generally well- [i]t is important to control for word frequency in psycholinguistic experiments because this variable has subtle effects, emerging not only between highly frequent and highly infrequent words, but even between frequent and slightly less frequent words ". Similarly, Baayen et al. (2016) study the role of word frequencies in psycholinguistic work, and their following statement also implies that word frequencies are widely used in psycholinguistic work: "An assumption that lies behind the use of corpora in much psycholinguistic work is that a suitably representative corpus of, say, English can serve to represent (or control for) subjects prior lexical experience in accounting for various aspects of linguistic behavior " (p. 6); they then point to an additional important aspect to be considered when word frequencies are used, namely that "in using frequency counts for the study of specific aspects of lexical processing it is important to consider the communicative goals of the texts sampled by a given corpus and the specific demands imposed by a given task probing aspects of lexical processing. " With all this information, we can now compute the dispersion of each word form. The Kullback-Leibler divergence ( D KL) is a 142 directional measure of how much a probability distribution P 1-n diverges from another probability distribution Q 1-n (i.e., it is not a 143 symmetric distance metric). In applications of the D KL , 144 • the probability distribution P 1-n is usually the one that reflects a posterior distribution and/or an observed distribution, i.e., data.

145
• the probability distribution Q 1-n is usually the one that reflects a prior distribution and/or a theoretical or an expected distribution. 146 In our application here, Q corresponds to the file sizes, the numeric vector we called file.sizes.rel above, which states for each file If we plot dispersion against frequency, we obtain the distribution that is characteristic of most general corpora and most dispersion 161 measures, as shown in Fig. 2 . 162 For the exemplification of the bootstrapping approach, I will compute bootstrapped frequencies and dispersions for all 179 verb 163 types in a moderate frequency range (frequencies between 100 and 10,000 in the ICE-GB), but I will focus the more qualitative 164 discussion on six verb lemmas from that range: COME, GIVE, KNOW, LOOK, MAKE, and TAKE. I define the number of bootstrapping 165 iterations as 1000 and perform the following steps for each of these iterations. Thus, we 166 • Sample 500 files names with replacement.

167
• Retrieve the words and the file names for these 500 files (with repetitions as needed) and create a term-document matrix.  195 This demonstrates that quantifying the uncertainty of our corpus statistics becomes more important as the words in question 196 become less frequent and less evenly distributed. For the most frequent and/or evenly distributed word forms, there is not much 197 variability, but the variability increases as the frequencies/dispersion of the word forms decrease. 198 Let us do some plots of six lemmas of interest. For interpretability's sake, this will be done in three plots with two lemmas each. 199 Each plot shows all forms in grey and then highlights the results for the two lemmas of interest in blue and red. In all three plots, the  The results are clear. For each of the verb forms investigated, we find that the exact frequency and dispersion value computed 223 from the corpus exhibit a sizable amount of variability on both dimensions of frequency and dispersion. A study whose implications 224 rested on how the forms of a lemma were ranked in terms of their commonness (or even just frequency and dispersion separately) 225 would have to concede that, even if the results for the whole corpus supported every hypothesis, they show such a high amount of 226 variability that many of the form-by-form rankings must rather be seen as consisting of forms that are not significantly different for 227 both frequency and dispersion; this is particularly obvious for the forms of KNOW included here. 228 The proposed approach has several appealing characteristics. First, we can evaluate frequency and dispersion differences for An alternative approach would be for researchers to use bootstrapping to collect many resamples of texts from the corpus, by 248 storing a word list each time. The simplest way of creating a word list using these resampled word lists is to include any word 249 that occurred in at least N% of the resampled lists. This approach combines the measurement of frequency and dispersion into 250 a single step. It is a much more robust method that is less dependent on the design and contents of the original corpus sample. 251 However, I submit that the present approach goes beyond this suggestion in two ways by 252 • not just using range as a measure of dispersion like their proposal implies but a measure that also takes frequencies of words in

255
• not just using the bootstrapping result as a way to identify a cutoff point (although that is also a possible application) but also 256 representing the degree of uncertainty/variability to inform cutoff points as well as rankings or other research decisions. 257 Finally, all this is achieved on the basis of a linguistically meaningful sampling unit, namely the file, which in this corpus represents 258 the text or linguistic interaction. This is important because not respecting these linguistically meaningful divisions in our corpora, 259 such as by dividing corpora artificially into n equally large parts even if that cuts across texts and/or subregisters, has been shown to 260 lead to suboptimal results ( Burch et al., 2017 ). Here, however, the bootstrapping was done on files (in Gries (2006) , it was done on 261 files, sub-registers, or registers), which preserves the integrity of the sampling process by not sampling across text boundaries. Let us 262 now apply the same logic to the study of association measures.

265
To exemplify the text/file-based bootstrapping approach in the domain of research on co-occurrence/association, I will use a 266 collostructional case study asking how much words are attracted to a slot in a construction. As an example, I will use the ditransitive 267 construction simply because it is well known to make for a good test/validation case. As mentioned above and slightly tweaking 268 Baayen's (2011) suggestion, I will use D KLnorm as an association measure, which has two advantages: • It is less strongly correlated with the raw frequencies of the elements involved than some other widely-used association measures, 270 such as G 2 or t , which really mostly reflect co-occurrence frequency rather than association (see Gries, to appear a, for empirical 271 evidence to that effect).

272
• It is a directional measure, allowing us to focus on one direction of association rather than consider only mutual association. The 273 direction of association I will use is from the verb to the construction (v2c). We also see that D KLnorm is, as desired, much less dependent on frequency of co-occurrence: 287 With the exception of forms occurring more than 2 6 = 64 times in the ditransitive, the attraction of the forms to the construction is 288 not obviously predictable from the co-occurrence frequency, which means that each dimension contributes something largely unique 289 to the analysis. 290 To exemplify bootstrapping, we will proceed in a way that is analogous to the one above. We perform the following 1000 times:

291
• Sample 500 file names with replacement.

292
• Cross-tabulate for all verb form tokens when they are used in a ditransitive and when they are not.

293
• Compute D KLnorm (v2c) for each verb form and store the results in a collector structure.

295
We plot the verb forms again by using data ellipses as in Figure 8 . 296 We observe that the verb forms that score the top slots, as in Stefanowitsch & Gries (2003:299 ), are of the lemmas GIVE and TELL. 297 Interestingly, the forms of TELL seem to score slightly higher on association than the forms of GIVE, but we can also see that there  can affect rankings considerably. 333 It is worth pointing out that it is possible to add a third dimension -dispersion -and its uncertainty/data ellipse to the mix to 334 generate a three-dimensional version of Fig. 9 with uncertainty spheres. However to keep complexity manageable, I am not doing 335 this here. Let us now consider a diachronic case study from the domain of legal/forensic linguistics. To address both issues simultaneously, I used the following procedure, which was applied separately to each of the five decades. • The x -axis represents the Zipf scale frequencies.

375
• The y -axis represents range logged to the base of 2.

378
• The shaded areas around the points/numbers represent 90% and 95% data ellipses. 379 We can see that 380 • are words one would think every child or learner acquires or learns early (e.g., faculty or chronicle ), but the number of words one 423 might assume to be known to younger kids and earlier learners is still considerable (this seems to be true even for the names): clinic , 424 wolf , designer , drivers , turkey , sandy , sheriff, mississippi , korea , massachusetts , alan . The AoA ratings for the words similar to gender in 425 the 2000s fall right between, and are significantly different from both the AoA ratings for the words similar to gender in the 1960s 426 and AoA ratings for the words similar to sex in the 1960s, as is also represented in the ecdf plot in Fig. 12 . 427 From the frequency-cum-dispersion changes over time, their uncertainty ellipses, and the "changes in vocabulary similar to gen- The lookup was done in three stages: I first looked up the exact words listed above in the AoA ratings and used the AoA rating value if that word was included in the ratings. If the word was not included in the ratings, I looked up the lemma form and used that lemma's AoA rating instead (e.g., the singulars horse and family for horses and families or the infinitives begin and design for the forms begins and designed ). AoA ratings for the words for which no AoA ratings were available were set to the highest AoA rating available for the noun in question ( gender vs. sex ) in the time period in question. I am grateful to Reviewer 1 for suggesting to use these ratings. 8 This is not everything that can be said about how gender related to sex especially in the 1960s; see Eskridge et al. (2021) for much more detailed discussion (including concrete examples of 1960s uses of the sex that today would involve the word gender with expression such as sex roles and psychosexual or the relation to transsexual and transgender ).  Zipfian-, not normally, distributed, as discussed in Section 1 . To provide a brief example, Fig. 13 shows the distributions of files 450 sizes (in words) of the BNC (left panel) and the frequencies (in files) of the randomly chosen word furniture (right panel). Clearly, 451 quantifying the uncertainty with a mean and standard deviation is not a good idea for the data represented in these histograms 452 ( Figure 13 )). 453 There are other advantages as well. For example, the visual representation suggested above not only gives us more information 454 than a mere plotting of a single point estimate for each (year, frequency, dispersion) tuple would offer, but it also provides a nice can find temporal stages in diachronic data. This algorithm yielded interesting results, but is, in at least all applications I have seen, 458 based solely on the exact data it is fed, meaning that it might be just as sensitive to corpus sampling artefacts/peculiarities as the 459 case study in Section 1 and might, therefore, perhaps benefit from being complemented by the bootstrapping approach outlined here, 460 which gives rise to stages via (lack of) overlap of the data ellipses resulting from the resampling. This idea should be explored further. 461 The present approach can also be improved. For instance, while the above identification of similar words to aid the interpretation 462 of Zipf scale frequencies and ranges are already helpful, there is one really important way in which I would like to see frequency 463 comparisons be improved. It is customary to compare frequencies of linguistic elements across different and especially differently sized 464 corpora with normalized frequencies (e.g., pmw) or Zipf scale frequencies, but such comparisons are still potentially very misleading 465 because, while they involve a correction for different corpus sizes, they do not correct for how the compositions of the corpora (or, 466 here, corpus decades as a whole) have changed. For instance, a word type w 1 with a Zipf scale frequency z and the range value r in 467 the 1960s decade might be the 1000th most frequent and the 2000th most evenly dispersed word type in that corpus decade, but a 468 word type w 2 might have the same Zipf scale frequency z and the same range value r in another corpus decade, but, be the 500th 469 most frequent and the 1000th most evenly dispersed word type simply because of how the word type distributions differ across the 470 two corpus decades. Gries (2021b) uses this correction and shows that the results for the singular forms gender and sex (using ranks of 471 frequencies and DP -values) support the results presented above and indicate that gender moves up considerably in the frequency and 472 dispersion rankings (i.e., it becomes more frequent and evenly distributed than many other words per decade), whereas sex remains 473 relatively the same across all decades. This kind of computation is computationally extremely demanding because it means that, even 474 if one is only interested in two word types, one still needs to compute frequencies and dispersions of all word types in the corpus 475 (parts) under consideration (for computing ranks for the words of interest) and one needs to do so once for every bootstrapping 476 iteration. However, this strategy would allow for comparisons that are not just based on the values of two words, but take all the 477 corpus changes into consideration. Hopefully, modern computing resources will make approaches such as these more feasible. 478 Regardless of the additional applications, extensions, and improvements we can come up with, I hope that the advantages of the 479 current suggestions are attractive enough to make the field consider these suggestions and deviate from reporting just point estimates 480 and/or simplistic chi-squared/ G 2 applications in general corpus linguistics, theoretical applications (e.g., within cognitive or usage-481 based linguistics), or psycholinguistic experimentation, as well as in more applied fields such as learner corpus research or, like here, 482 legal/forensic contexts. Hopefully, the increased reliability resulting from the above will allow the field to make further progress.