Abstract
The Zipf–Mandelbrot law is widely used to model a power-law distribution on ranked data. One of the best known applications of the Zipf–Mandelbrot law is in the area of linguistic analysis of the distribution of words ranked by their frequency in a text corpus. By exploring known limitations of the Zipf–Mandelbrot law in modeling the actual linguistic data from different domains in both printed media and online content, a new algorithm is developed to effectively construct n-gram rules for building natural language (NL) models required for a human-to-computer interface. The construction of statistically-oriented n-gram rules is based on a new computing algorithm that identifies the area of divergence between Zipf–Mandelbrot curve and the actual frequency distribution of the ranked n-gram text tokens extracted from a large text corpus derived from the online electronic programming guide (EPG) for television shows and movies. Two empirical experiments were carried out to evaluate the EPG-specific language models created using the new algorithm in the context of NL-based information retrieval systems. The experimental results show the effectiveness of the algorithm for developing low-complexity concept models with high coverage for the user’s language models associated with both typed and spoken queries when interacting with a NL-based EPG search interface.
Similar content being viewed by others
References
Zipf GK (1935) The psychobiology of language. Houghton-Mifflin, Boston
Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley, Cambridge
Champernowne DG (1953) A model of income distribution. J Economic 23: 318–351
Darwin JH (1953) Population differences between species growing according to simple birth and death processes. Biometrika 40(3–4): 370–382
Simon HA (1955) On a class of skew distribution functions. Biometrika 42(3–4): 425–440
Redner S (1998) Eur Phys J B 4: 131
Mandelbrot BB (1953) An information theory of the statistical structure of language. In: Communication Theory. Academic Press, New York, pp 503–512
Booth AD (1967) A law of occurrences for words in low frequency. Inform Control 10(4): 386–393
Smith FJ, Devine K (1985) Storing and retrieving word phrases. Inform Process Manage 21(3): 215–224
Samuelson C (1996) Relating Turing’s formula and Zipf’s law. In: Proceedings of the 4th workshop on very large Corpora, Copenhagen, Denmark
Silagadze ZK (1997) Citations and the Zipf–Mandelbrot law. Complex Syst 11(6): 487–499
Yonezawa Y, Motohasi H (1999) Zipf-scaling description in the DNA sequences. Tenth workshop on genome informatics, Japan
Li W (2001) Zipf’s law in importance of genes for cancer classification using microarray data. Lab of Statistical Genetics, Rockefeller University, New York
Ha LQ, Sicilia-Garcia EI, Ming, J, Smith FJ (2002) Extension of Zipf’s law to words and phrases. In: Proceedings of COLING 2002, Taipei, Taiwan
Damián HZ (2006) Zipf’s law and the creation of musical context. Musicae Sci 10: 3–18
Evert S, Marco B (2006) Testing the extrapolation quality of word frequency models. In: Proceedings corpus linguistics
Guo L, Tan E, Chen S, Xiao Z, Zeng X (2008) The stretched exponential distribution of Internet media access patterns. In: Proceedings of the 27th ACM symposium of principles of distributed computing, Canada, pp 283–294
Vogt PI (2004) Minimum cost and the emergence of the Zipf–Mandelbrot law. In: Proceedings of the 9th artificial life conference. MIT Press, Cambridge
Montemurro MA (2002) Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A Stat Mech Appl 300(3–4): 567–578
Beck A, Borst S, Ensor B., Esteban J.O., Hilt V, Rimac I, Walid A (2008) New challenges in content dissemination networks. Bell Labs Tech J 13(3): 5–12
Writtenburg K, Lanning T, Schwenke D, Shubin H, Vetro A (2006) The prospects for unrestricted speech Input for TV content search. In: Proceedings of the working conference on advanced visual interfaces, pp 352–359
Johnston M, D’Haro L-F, Levine M, Renger R (2007) A multimodal interface for access to content in home. In: ACL, pp 376–383
Francis WN, Kucera H (1964) Manual of information to accompany a standard corpus of present-day edited American English for use with digital computers. Department of Linguistics, Brown University, Providence, Rhode Island
Paul D, Baker JM (1992) The design for the Wall Street Journal-based CSR corpus. In: Proceedings of ICSLP’92, pp 899–902
Amores JG, Perez G, Manchon P (2007) A multimodal and multilingual dialogue system for the home domain. In: ACL, pp 1–4
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by C.H. Cap.
Rights and permissions
About this article
Cite this article
Chang, H.M. Constructing n-gram rules for natural language models through exploring the limitation of the Zipf–Mandelbrot law. Computing 91, 241–264 (2011). https://doi.org/10.1007/s00607-010-0116-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-010-0116-x
Keywords
- Zipf–Mandelbrot law
- Natural language processing
- N-gram statistical language models
- Quantitative linguistics