Skip to main content

Advertisement

Log in

Constructing n-gram rules for natural language models through exploring the limitation of the Zipf–Mandelbrot law

  • Published:
Computing Aims and scope Submit manuscript

Abstract

The Zipf–Mandelbrot law is widely used to model a power-law distribution on ranked data. One of the best known applications of the Zipf–Mandelbrot law is in the area of linguistic analysis of the distribution of words ranked by their frequency in a text corpus. By exploring known limitations of the Zipf–Mandelbrot law in modeling the actual linguistic data from different domains in both printed media and online content, a new algorithm is developed to effectively construct n-gram rules for building natural language (NL) models required for a human-to-computer interface. The construction of statistically-oriented n-gram rules is based on a new computing algorithm that identifies the area of divergence between Zipf–Mandelbrot curve and the actual frequency distribution of the ranked n-gram text tokens extracted from a large text corpus derived from the online electronic programming guide (EPG) for television shows and movies. Two empirical experiments were carried out to evaluate the EPG-specific language models created using the new algorithm in the context of NL-based information retrieval systems. The experimental results show the effectiveness of the algorithm for developing low-complexity concept models with high coverage for the user’s language models associated with both typed and spoken queries when interacting with a NL-based EPG search interface.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Zipf GK (1935) The psychobiology of language. Houghton-Mifflin, Boston

    Google Scholar 

  2. Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley, Cambridge

    Google Scholar 

  3. Champernowne DG (1953) A model of income distribution. J Economic 23: 318–351

    Article  Google Scholar 

  4. Darwin JH (1953) Population differences between species growing according to simple birth and death processes. Biometrika 40(3–4): 370–382

    MathSciNet  MATH  Google Scholar 

  5. Simon HA (1955) On a class of skew distribution functions. Biometrika 42(3–4): 425–440

    MathSciNet  MATH  Google Scholar 

  6. Redner S (1998) Eur Phys J B 4: 131

    Article  Google Scholar 

  7. Mandelbrot BB (1953) An information theory of the statistical structure of language. In: Communication Theory. Academic Press, New York, pp 503–512

    Google Scholar 

  8. Booth AD (1967) A law of occurrences for words in low frequency. Inform Control 10(4): 386–393

    Article  MATH  Google Scholar 

  9. Smith FJ, Devine K (1985) Storing and retrieving word phrases. Inform Process Manage 21(3): 215–224

    Article  Google Scholar 

  10. Samuelson C (1996) Relating Turing’s formula and Zipf’s law. In: Proceedings of the 4th workshop on very large Corpora, Copenhagen, Denmark

  11. Silagadze ZK (1997) Citations and the Zipf–Mandelbrot law. Complex Syst 11(6): 487–499

    MATH  Google Scholar 

  12. Yonezawa Y, Motohasi H (1999) Zipf-scaling description in the DNA sequences. Tenth workshop on genome informatics, Japan

  13. Li W (2001) Zipf’s law in importance of genes for cancer classification using microarray data. Lab of Statistical Genetics, Rockefeller University, New York

  14. Ha LQ, Sicilia-Garcia EI, Ming, J, Smith FJ (2002) Extension of Zipf’s law to words and phrases. In: Proceedings of COLING 2002, Taipei, Taiwan

  15. Damián HZ (2006) Zipf’s law and the creation of musical context. Musicae Sci 10: 3–18

    Article  Google Scholar 

  16. Evert S, Marco B (2006) Testing the extrapolation quality of word frequency models. In: Proceedings corpus linguistics

  17. Guo L, Tan E, Chen S, Xiao Z, Zeng X (2008) The stretched exponential distribution of Internet media access patterns. In: Proceedings of the 27th ACM symposium of principles of distributed computing, Canada, pp 283–294

  18. Vogt PI (2004) Minimum cost and the emergence of the Zipf–Mandelbrot law. In: Proceedings of the 9th artificial life conference. MIT Press, Cambridge

  19. Montemurro MA (2002) Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A Stat Mech Appl 300(3–4): 567–578

    Google Scholar 

  20. Beck A, Borst S, Ensor B., Esteban J.O., Hilt V, Rimac I, Walid A (2008) New challenges in content dissemination networks. Bell Labs Tech J 13(3): 5–12

    Article  Google Scholar 

  21. Writtenburg K, Lanning T, Schwenke D, Shubin H, Vetro A (2006) The prospects for unrestricted speech Input for TV content search. In: Proceedings of the working conference on advanced visual interfaces, pp 352–359

  22. Johnston M, D’Haro L-F, Levine M, Renger R (2007) A multimodal interface for access to content in home. In: ACL, pp 376–383

  23. Francis WN, Kucera H (1964) Manual of information to accompany a standard corpus of present-day edited American English for use with digital computers. Department of Linguistics, Brown University, Providence, Rhode Island

  24. Paul D, Baker JM (1992) The design for the Wall Street Journal-based CSR corpus. In: Proceedings of ICSLP’92, pp 899–902

  25. Amores JG, Perez G, Manchon P (2007) A multimodal and multilingual dialogue system for the home domain. In: ACL, pp 1–4

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harry M. Chang.

Additional information

Communicated by C.H. Cap.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, H.M. Constructing n-gram rules for natural language models through exploring the limitation of the Zipf–Mandelbrot law. Computing 91, 241–264 (2011). https://doi.org/10.1007/s00607-010-0116-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-010-0116-x

Keywords

Mathematics Subject Classification (2000)

Navigation