Networks Generated from Natural Language Text

Biemann, Chris; Quasthoff, Uwe

doi:10.1007/978-0-8176-4751-3_10

Chris Biemann⁴ &
Uwe Quasthoff⁵

Part of the book series: Modeling and Simulation in Science, Engineering and Technology ((MSSET))

1548 Accesses
2 Citations

The study of large-scale characteristics of graphs that arise in natural language processing is an essential step in finding structural regularities. Structure discovery processes have to be designed with an awareness of these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and its evolution.

In this chapter, we examine power-law distributions and small world graphs (SWGs) originating from natural language data. There are several reasons for the special interest in these structures.

1.
Power laws appear in many rank-frequency statistics. Furthermore, we can construct graphs with words as nodes and use various rules to introduce edges between words. In many cases, this results in SWGs, which again often have a power-law distribution for their node degrees.
2.
SWGs appear in many other real world data, like social networks of many kinds, in the link structure of the World Wide Web or in traffic networks. It is interesting to analyze all these networks in more detail to identify similarities and differences.
3.
From an application-driven view, SWGs allow effective clustering strategies in nearly linear time. Because these clusters are often related to the growth process of the underlying graph, they are often meaningful. In the case of natural language these clusters usually reflect semantic and/or syntactic structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
LCC, see http://www.corpora.uni-leipzig.de [July 7th, 2007].
2.
e.g. http://www.nslij-genetics.org/wli/zipf/index.html [April 1, 2007].
3.
http://www.natcorp.ox.ac.uk/ [April 1, 2007]
4.
http://www.altavista.com
5.
http://www.wikipedia.org
6.
For example, a log likelihood ratio of 3.84 corresponds to a 5% error in stating that two words do not occur by chance, a significance of 6.63 corresponds to a 1% error.
7.
http://corpora.informatik.uni-leipzig.de/?dict=en [April 1, 2007]

References

Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304.
Google Scholar
Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive graphs. In STOC '00: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press.
Google Scholar
Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509.
Article MathSciNet Google Scholar
Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text and co-occurrence statistics. In Proceedings of NODALIDA '05, Joensuu, Finland.
Google Scholar
Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document collections using attributes with low noise. In Proceedings of the Third International Conference on Web Information Systems and Technologies (WEBIST-07), pages 130–135, Barcelona, Spain.
Google Scholar
Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-04), Lisbon, Portugal.
Google Scholar
Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically building concept structures and displaying concept trails for the use in brainstorming sessions and content management systems. In Proceedings of Innovative Internet Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico.
Google Scholar
Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of corenet using a bootstrapping mechanism on corpus-based co-occurrences. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), Morristown, NJ, USA. Association for Computational Linguistics.
Google Scholar
Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, University of Leipzig.
Google Scholar
Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford University Computing Service, Oxford, U.K.
Google Scholar
Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong's numbers in the Bible to test an automatic alignment of parallel texts. Special issue of Sprachtypologie und Universalienforschung (STUF), pages 66–79.
Google Scholar
Dorogovtsev, S. N. and Mendes, J. F. F. (2001). Language as an evolving word web. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1485), 2603–2606.
Article Google Scholar
Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Google Scholar
Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart.
Google Scholar
Ferrer-i-Cancho, R. and Sol, R. V. (2001). The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1482), 2261–2265.
Article Google Scholar
Ferrer-i-Cancho, R. and Sol, R. V. (2002). Zipf's law and random texts. Advances in Complex Systems, 5(1), 1–6.
Article MATH Google Scholar
Glassman, S. (1994). A caching relay for the world wide web. Computer Networks and ISDN Systems, 27(2), 165–173.
Article Google Scholar
Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of Zipf's law to words and phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ, USA. Association for Computational Linguistics.
Google Scholar
Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press.
Google Scholar
Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for generating ontology extension candidates. In Proceedings of the ICML-05 Workshop on Ontology Learning and Extension using Machine Learning Methods, Bonn, Germany.
Google Scholar
Mandelbrot, B. B. (1953). An information theory of the statistical structure of language. In Proceedings of the Symposium on Applications of Communications Theory. Butterworths.
Google Scholar
Miller, G. A. (1957). Some effects of intermittent silence. American Journal of Psychology, 70, 311–313.
Article Google Scholar
Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain. Association for Computational Linguistics.
Google Scholar
Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy.
Google Scholar
Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz. Olms, Hildesheim.
Google Scholar
Sigurd, B., Eeg-Olofsson, M., and van de Weijer, J. (2004). Word length, sentence length and frequency – Zipf revisited. Studia Linguistica, 58(1), 37–52.
Article Google Scholar
Smith, F. J. and Devine, K. (1985). Storing and retrieving word phrases. Inf. Process. Manage., 21(3), 215–224.
Article Google Scholar
Steyvers, M. and Tenenbaum, J. B. (2005). The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1), 41–78.
Article Google Scholar
Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors, ISSI2005, volume 1, pages 221–231, Stockholm. International Society for Scientometrics and Informetrics.
Google Scholar
Zanette, D. H. and Montemurro, M. A. (2005). Dynamics of text generation with realistic Zipf's distribution. Journal of Quantitative Linguistics, 12(1), 29–40.
Article Google Scholar
Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston.
Google Scholar
Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Computer Science, NLP Department, University of Leipzig, 04103, Johannisgasse 26, Leipzig, Germany
Chris Biemann
Institute for Computer Science, NLP Department, University of Leipzig, 04103, Johannisgasse 26, Leipzig, Germany
Uwe Quasthoff

Authors

Chris Biemann
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Quasthoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chris Biemann or Uwe Quasthoff .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Biemann, C., Quasthoff, U. (2009). Networks Generated from Natural Language Text. In: Ganguly, N., Deutsch, A., Mukherjee, A. (eds) Dynamics On and Of Complex Networks. Modeling and Simulation in Science, Engineering and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4751-3_10

Download citation

DOI: https://doi.org/10.1007/978-0-8176-4751-3_10
Published: 26 February 2009
Publisher Name: Birkhäuser Boston
Print ISBN: 978-0-8176-4750-6
Online ISBN: 978-0-8176-4751-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics