The study of large-scale characteristics of graphs that arise in natural language processing is an essential step in finding structural regularities. Structure discovery processes have to be designed with an awareness of these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and its evolution.
In this chapter, we examine power-law distributions and small world graphs (SWGs) originating from natural language data. There are several reasons for the special interest in these structures.
-
1.
Power laws appear in many rank-frequency statistics. Furthermore, we can construct graphs with words as nodes and use various rules to introduce edges between words. In many cases, this results in SWGs, which again often have a power-law distribution for their node degrees.
-
2.
SWGs appear in many other real world data, like social networks of many kinds, in the link structure of the World Wide Web or in traffic networks. It is interesting to analyze all these networks in more detail to identify similarities and differences.
-
3.
From an application-driven view, SWGs allow effective clustering strategies in nearly linear time. Because these clusters are often related to the growth process of the underlying graph, they are often meaningful. In the case of natural language these clusters usually reflect semantic and/or syntactic structures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
LCC, see http://www.corpora.uni-leipzig.de [July 7th, 2007].
- 2.
e.g. http://www.nslij-genetics.org/wli/zipf/index.html [April 1, 2007].
- 3.
http://www.natcorp.ox.ac.uk/ [April 1, 2007]
- 4.
http://www.altavista.com
- 5.
http://www.wikipedia.org
- 6.
For example, a log likelihood ratio of 3.84 corresponds to a 5% error in stating that two words do not occur by chance, a significance of 6.63 corresponds to a 1% error.
- 7.
http://corpora.informatik.uni-leipzig.de/?dict=en [April 1, 2007]
References
Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304.
Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive graphs. In STOC '00: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press.
Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509.
Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text and co-occurrence statistics. In Proceedings of NODALIDA '05, Joensuu, Finland.
Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document collections using attributes with low noise. In Proceedings of the Third International Conference on Web Information Systems and Technologies (WEBIST-07), pages 130–135, Barcelona, Spain.
Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-04), Lisbon, Portugal.
Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically building concept structures and displaying concept trails for the use in brainstorming sessions and content management systems. In Proceedings of Innovative Internet Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico.
Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of corenet using a bootstrapping mechanism on corpus-based co-occurrences. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), Morristown, NJ, USA. Association for Computational Linguistics.
Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, University of Leipzig.
Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford University Computing Service, Oxford, U.K.
Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong's numbers in the Bible to test an automatic alignment of parallel texts. Special issue of Sprachtypologie und Universalienforschung (STUF), pages 66–79.
Dorogovtsev, S. N. and Mendes, J. F. F. (2001). Language as an evolving word web. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1485), 2603–2606.
Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart.
Ferrer-i-Cancho, R. and Sol, R. V. (2001). The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1482), 2261–2265.
Ferrer-i-Cancho, R. and Sol, R. V. (2002). Zipf's law and random texts. Advances in Complex Systems, 5(1), 1–6.
Glassman, S. (1994). A caching relay for the world wide web. Computer Networks and ISDN Systems, 27(2), 165–173.
Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of Zipf's law to words and phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ, USA. Association for Computational Linguistics.
Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press.
Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for generating ontology extension candidates. In Proceedings of the ICML-05 Workshop on Ontology Learning and Extension using Machine Learning Methods, Bonn, Germany.
Mandelbrot, B. B. (1953). An information theory of the statistical structure of language. In Proceedings of the Symposium on Applications of Communications Theory. Butterworths.
Miller, G. A. (1957). Some effects of intermittent silence. American Journal of Psychology, 70, 311–313.
Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain. Association for Computational Linguistics.
Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy.
Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz. Olms, Hildesheim.
Sigurd, B., Eeg-Olofsson, M., and van de Weijer, J. (2004). Word length, sentence length and frequency – Zipf revisited. Studia Linguistica, 58(1), 37–52.
Smith, F. J. and Devine, K. (1985). Storing and retrieving word phrases. Inf. Process. Manage., 21(3), 215–224.
Steyvers, M. and Tenenbaum, J. B. (2005). The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1), 41–78.
Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors, ISSI2005, volume 1, pages 221–231, Stockholm. International Society for Scientometrics and Informetrics.
Zanette, D. H. and Montemurro, M. A. (2005). Dynamics of text generation with realistic Zipf's distribution. Journal of Quantitative Linguistics, 12(1), 29–40.
Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston.
Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Birkhäuser Boston, a part of Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Biemann, C., Quasthoff, U. (2009). Networks Generated from Natural Language Text. In: Ganguly, N., Deutsch, A., Mukherjee, A. (eds) Dynamics On and Of Complex Networks. Modeling and Simulation in Science, Engineering and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4751-3_10
Download citation
DOI: https://doi.org/10.1007/978-0-8176-4751-3_10
Published:
Publisher Name: Birkhäuser Boston
Print ISBN: 978-0-8176-4750-6
Online ISBN: 978-0-8176-4751-3
eBook Packages: Computer ScienceComputer Science (R0)