Skip to main content
Log in

Statistical models for word frequency distributions: A linguistic evaluation

  • Published:
Computers and the Humanities Aims and scope Submit manuscript

Abstract

Three models for word frequency distributions, the lognormal law, the generalized inverse Gauss-Poisson law and the extended generalized Zipf's law are compared and evaluated with respect to goodness of fit and rationale. Application of these models to frequency distributions of a text, a corpus and morphological data reveals that no model can lay claim to exclusive validity, while inspection of the extrapolated theoretical vocabulary sizes raises doubts as to whether the urn scheme with independent trials is the correct underlying model for word frequency data. The role of morphology in shaping word frequency distributions is discussed, as well as parallelisms between vocabulary richness in literary studies and morphological productivity in linguistics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baayen, R.H. A Corpus-Based Approach to Morphological Productivity. Statistical Analysis and Psycholinguistic Interpretation. Diss. Free University, Amsterdam, 1989.

    Google Scholar 

  • Baayen, R.H., and Lieber, R. “Productivity and English Derivation: A Corpus Based Study.” Linguistics, 29 (1991), 801–43.

    Google Scholar 

  • Baayen, R.H. “A Stochastic Process for Word Frequency Distributions.” In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Ed. D.E. Appelt. 1991 (a), pp. 271–78.

  • Baayen, R.H. “A Quantitative Approach to Morphological Productivity.” In Yearbook of Morphology 1991. Eds. G.E. Booij and J. van Marle. Dordrecht: Kluwer, 1991 (b), 109–49.

    Google Scholar 

  • Bolinger, D.L. “On Defining the Morpheme.” In Forms of English. Accent, Morpheme, Order. Ed. D.L. Bolinger. Cambridge, MA: Harvard University Press, 1948, pp. 183–89.

    Google Scholar 

  • Brunet, E. Le Vocabulaire de Jean Giraudoux. Structure et Évolution. Genève: Slatkine, 1978.

    Google Scholar 

  • Carroll, J.B. “On Sampling from a Lognormal Model of Word Frequency Distribution.” In Computational Analysis of Present-Day American English. Eds. H. Kučera and W.N. Francis. Providence: Brown University Press, 1967, pp. 406–24.

    Google Scholar 

  • Carroll, J.B. “A Rationale for an Asymptotic Lognormal Form of Word Frequency Distributions.” Research Bulletin. Educational Testing Service. Princeton, November 1969.

  • Efron, B., and Thisted, R. “Estimating the Number of Unseen Species: How many Words did Shakespeare Know?” Biometrika, 63 (1976), 435–47.

    Google Scholar 

  • Good, I.J. “The Population Frequencies of Species and the Estimation of Population Parameters.” Biometrika, 40 (1953) 237–64.

    Google Scholar 

  • Good, I.J., and Toulmin, G.H. “The Number of New Species and the Increase in Population Coverage, when a Sample is Increased.” Biometrika, 43 (1956), 45–63.

    Google Scholar 

  • Guiraud, H. Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France, 1954.

    Google Scholar 

  • Haeringen, C. B. van “Het Achtervoegsel -ing: Mogelijkheden en Beperkingen.” De Nieuwe Taalgids, 64 (1971), 449–68.

    Google Scholar 

  • Harwood, F.W., and Wright, A.M. “Statistical Study of English Word Formation.” Language, 32 (1956), 260–73.

    Google Scholar 

  • Herdan, G. Type-Token Mathematics. The Hague: Mouton, 1960.

    Google Scholar 

  • Herdan, G. Quantitative Linguistics. London: Buttersworths, 1964.

    Google Scholar 

  • Hill, B. M. “A Theoretical Derevation of the Zipf (Pareto) Law.” In Studies on Zipf's Law. Eds. H. Guiter and M.V. Arapov. Bochum: Brockmeyer, 1983, pp. 53–64.

    Google Scholar 

  • Kalinin, V.M. “Functionals Related to the Poisson Distribution, and Statistical Structure of a Text.” In Articles on Mathematical Statistics and the Theory of Probability. Ed. J.V. Finnik. Providence, RI: American Mathematical Society, 1965, pp. 202–20.

    Google Scholar 

  • Khmaladze, E.V., and Chitashvili, R.J. =“Statistical Analysis of Large Number of Rare Events and Related Problems.” Transactions of the Tbilisi Mathematical Institute, 91 (1989), 196–245.

    Google Scholar 

  • Landauer, T.K., and Streeter, L.A. “Structural Differences Between Common and Rare Words: Failure of Equivalence Assumptions for Theories of Word Recognition.” Journal of Verbal Learning and Verbal Behavior, 12 (1973), 119–31.

    Google Scholar 

  • Lánský, P., and Radil-Weiss, T. “A Generalization of the Yule-Simon Model, with Special Reference to Word Association Tests and Neural Cell Assembly Formation.” Journal of Mathematical Psychology, 21 (1980), 53–65.

    Google Scholar 

  • Mandelbrot, B. “On the Theory of Word Frequencies and on Related Markovian Models of Discourse.” In Structure of Language and its Mathematical Aspects. Proceedings of Symposia in Applied Mathematics. Vol. XII. Ed. R. Jakobson. Providence, RI: American Mathematical Society, 1962, pp. 190–219.

    Google Scholar 

  • Martin, W. Analyse van een Vocabularium met behulp van een computer. Brussels: AIMAV, 1970.

    Google Scholar 

  • Menard, N. Mesure de la Richesse Lexicale. Théorie et Vérifi-cations Expérimentales. Etudes Stylométriques et Sociolinguistiques. Genève: Slatkine-Champion, 1983.

    Google Scholar 

  • Miller, G.A. “Some Effects of Intermittent Silence.” The American Journal of Psychology, 52 (1957), 311–14.

    Google Scholar 

  • Miller, G.A., Newman, E.B., and Friedman, E.A. “LengthFrequency Statistics for Written English.” Information and Control, 1 (1958), 370–89.

    Google Scholar 

  • Morrison, D.F. Multivariate Statistical Methods. Tokyo: McGraw-Hill Kogakusha, 1976.

    Google Scholar 

  • Muller, C. Principes et Méthodes de Statistique Lexicale. Paris: Hachette, 1977.

    Google Scholar 

  • Muller, C. “Du Nouveau sur les Distributions Lexicales: La Formule de Waring-Herdan.” In Langue Française et Linguistique Quantitative. Ed. C. Muller. Geneve: Slatkine, 1979, pp. 177–95.

    Google Scholar 

  • Nushbaum, H.C. “A Stochastic Account of the Relationship between Lexical Density and Word Frequency.” Research on Speech Perception, Progress Report # 11. 1985, Indiana University.

  • Orlov, J.K. “Dynamik der Häufigkeitsstrukturen.” In Studies on Zipf's Law. Eds. H. Guiter and M.V. Arapov. Bochum: Brockmeyer, 1983, pp. 116–53.

    Google Scholar 

  • Orlov, J.K. “Ein Model der Häufigkeitsstruktur des Vokabulars.” In Studies of Zipf's Law. Eds. H. Guiter and M.V. Arapov. Bochum: Brockmeyer, 1983, pp. 154–233.

    Google Scholar 

  • Orlov, J.K., and Chitashvili, R.Y. “On the Distribution of Frequency Spectrum in Small Samples from Populations with a Large Number of Events.” Bulletin of the Academy of Sciences, Georgia, 108.2 (1982a), 297–300.

    Google Scholar 

  • Orlov, J.K., and Chitashvili, R.Y. “On Some Problems of Statistical Estimation in Relatively Small Samples.” Bulletin of the Academy of Sciences, Georgia, 108.3 (1982b), 513–16.

    Google Scholar 

  • Orlov, J.K., and Chitashvili, R.Y. “On the Statistical Interpretation of Zipf's Law.” Bulletin of the Academy of Sciences, Georgia, 109.3 (1983a), 505–508.

    Google Scholar 

  • Orlov, J.K., and Chitashvili, R.Y. “Generalized Z-Distribution Generating the Well-Known ‘Rank-Distributions’.” Bulletin of the Academy of Sciences, Georgia, 110.2 (1983b), 268–72.

    Google Scholar 

  • Paivio, A., Yuille, J.C., and Madigan, S. “Concreteness, Imagery and Meaningful Values for 925 Nouns.” Journal of Experimental Psychology Monograph 76 I, Pt. 2.1968.

  • Rainer, F. “Towards a Theory of Blocking: The Case of Italian and German Quality Nouns.” Yearbook of Morphology, 1 (1988), 155–85.

    Google Scholar 

  • Ratkowsky, D. “The Travaux de Linguistique Quantitative.” (Book Review.) Computers and the Humanities, 22 (1988), 77–85.

    Google Scholar 

  • Reder, L.M., Anderson, J.R., and Bjork, R.A. “A Semantic Interpretation of Encoding Specificity.” Journal of Experimental Psychology, 102 (1974), 648–56.

    Google Scholar 

  • Rouault, A. “Loi de Zipf et Sources Markoviennes.” Ann. Inst. H. Poincaré, 14 (1978), 169–88.

    Google Scholar 

  • Roy, G-R. Contribution d l Analyse de Syntagme Verbal. Étude Morphosyntaxique et Statistique des Coverbes. Paris: Klincksieck, 1976.

    Google Scholar 

  • Schultink, H. “Produktiviteit als Morfologisch Fenomeen.” Forum der Letteren, 2 (1961), 110–25.

    Google Scholar 

  • Sichel, H.A. “On a Distribution Law for Word Frequencies.” Journal of the American Statistical Association, 70 (1975), 542–47.

    Google Scholar 

  • Sichel, H.A. “Word Frequency Distributions and Type-Token Characteristics.” Mathematical Scientist, 11 (1986), 45–72.

    Google Scholar 

  • Simon, H.A. “On a Class of Skew Distribution Functions.” Biometrika, 42 (1955), 435–40.

    Google Scholar 

  • Sinclair, J.M., ed. Looking Up: An Account of the Cobuild Project in Lexical Computing. London: Collins, 1987.

    Google Scholar 

  • Sterkenburg, P.G.J., and Pijnenburg, W.J.J. van Dale Groot woordenboek van hedendaags Nederlands. Utrecht: Van Dale Lexicografie, 1984.

    Google Scholar 

  • Uit den Boogaart, P.C. Woordfrequenties in Gesproken en Geschreven Nederlands. Utrecht: Oosthoek, Scheltema and Holkema, 1975.

    Google Scholar 

  • Veld, R in 't. Hoe willekeurig kiest een schrijver ziin woorden? Een urn model voor onderzoek naar de frequenties van woorden, munten, achternamen en vissen. Doctoral dissertation. University of Amsterdam, 1984.

  • Yule, G.U. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press, 1944.

    Google Scholar 

  • Zipf, G.K. The Psycho-Biology of Language. Boston: Houghton Mifflin, 1935.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

R. Harald Baayen received his PhD at the Free University, Amsterdam, where he was involved in research on morphological productivity. He is now at the Max-Planck Institute for Psycholinguistics, Nijmegen, participating in a project on computational modelling of lexical representation and process.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baayen, H. Statistical models for word frequency distributions: A linguistic evaluation. Comput Hum 26, 347–363 (1992). https://doi.org/10.1007/BF00136980

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00136980

Key Words

Navigation