Skip to main content

Networks Generated from Natural Language Text

  • Chapter
  • First Online:
Dynamics On and Of Complex Networks

The study of large-scale characteristics of graphs that arise in natural language processing is an essential step in finding structural regularities. Structure discovery processes have to be designed with an awareness of these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and its evolution.

In this chapter, we examine power-law distributions and small world graphs (SWGs) originating from natural language data. There are several reasons for the special interest in these structures.

  1. 1.

    Power laws appear in many rank-frequency statistics. Furthermore, we can construct graphs with words as nodes and use various rules to introduce edges between words. In many cases, this results in SWGs, which again often have a power-law distribution for their node degrees.

  2. 2.

    SWGs appear in many other real world data, like social networks of many kinds, in the link structure of the World Wide Web or in traffic networks. It is interesting to analyze all these networks in more detail to identify similarities and differences.

  3. 3.

    From an application-driven view, SWGs allow effective clustering strategies in nearly linear time. Because these clusters are often related to the growth process of the underlying graph, they are often meaningful. In the case of natural language these clusters usually reflect semantic and/or syntactic structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    LCC, see http://www.corpora.uni-leipzig.de [July 7th, 2007].

  2. 2.

    e.g. http://www.nslij-genetics.org/wli/zipf/index.html [April 1, 2007].

  3. 3.

    http://www.natcorp.ox.ac.uk/ [April 1, 2007]

  4. 4.

    http://www.altavista.com

  5. 5.

    http://www.wikipedia.org

  6. 6.

    For example, a log likelihood ratio of 3.84 corresponds to a 5% error in stating that two words do not occur by chance, a significance of 6.63 corresponds to a 1% error.

  7. 7.

    http://corpora.informatik.uni-leipzig.de/?dict=en [April 1, 2007]

References

  1. Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304.

    Google Scholar 

  2. Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive graphs. In STOC '00: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press.

    Google Scholar 

  3. Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509.

    Article  MathSciNet  Google Scholar 

  4. Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text and co-occurrence statistics. In Proceedings of NODALIDA '05, Joensuu, Finland.

    Google Scholar 

  5. Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document collections using attributes with low noise. In Proceedings of the Third International Conference on Web Information Systems and Technologies (WEBIST-07), pages 130–135, Barcelona, Spain.

    Google Scholar 

  6. Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-04), Lisbon, Portugal.

    Google Scholar 

  7. Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically building concept structures and displaying concept trails for the use in brainstorming sessions and content management systems. In Proceedings of Innovative Internet Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico.

    Google Scholar 

  8. Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of corenet using a bootstrapping mechanism on corpus-based co-occurrences. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), Morristown, NJ, USA. Association for Computational Linguistics.

    Google Scholar 

  9. Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, University of Leipzig.

    Google Scholar 

  10. Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford University Computing Service, Oxford, U.K.

    Google Scholar 

  11. Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong's numbers in the Bible to test an automatic alignment of parallel texts. Special issue of Sprachtypologie und Universalienforschung (STUF), pages 66–79.

    Google Scholar 

  12. Dorogovtsev, S. N. and Mendes, J. F. F. (2001). Language as an evolving word web. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1485), 2603–2606.

    Article  Google Scholar 

  13. Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  14. Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart.

    Google Scholar 

  15. Ferrer-i-Cancho, R. and Sol, R. V. (2001). The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1482), 2261–2265.

    Article  Google Scholar 

  16. Ferrer-i-Cancho, R. and Sol, R. V. (2002). Zipf's law and random texts. Advances in Complex Systems, 5(1), 1–6.

    Article  MATH  Google Scholar 

  17. Glassman, S. (1994). A caching relay for the world wide web. Computer Networks and ISDN Systems, 27(2), 165–173.

    Article  Google Scholar 

  18. Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of Zipf's law to words and phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ, USA. Association for Computational Linguistics.

    Google Scholar 

  19. Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press.

    Google Scholar 

  20. Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for generating ontology extension candidates. In Proceedings of the ICML-05 Workshop on Ontology Learning and Extension using Machine Learning Methods, Bonn, Germany.

    Google Scholar 

  21. Mandelbrot, B. B. (1953). An information theory of the statistical structure of language. In Proceedings of the Symposium on Applications of Communications Theory. Butterworths.

    Google Scholar 

  22. Miller, G. A. (1957). Some effects of intermittent silence. American Journal of Psychology, 70, 311–313.

    Article  Google Scholar 

  23. Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain. Association for Computational Linguistics.

    Google Scholar 

  24. Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy.

    Google Scholar 

  25. Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz. Olms, Hildesheim.

    Google Scholar 

  26. Sigurd, B., Eeg-Olofsson, M., and van de Weijer, J. (2004). Word length, sentence length and frequency – Zipf revisited. Studia Linguistica, 58(1), 37–52.

    Article  Google Scholar 

  27. Smith, F. J. and Devine, K. (1985). Storing and retrieving word phrases. Inf. Process. Manage., 21(3), 215–224.

    Article  Google Scholar 

  28. Steyvers, M. and Tenenbaum, J. B. (2005). The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1), 41–78.

    Article  Google Scholar 

  29. Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors, ISSI2005, volume 1, pages 221–231, Stockholm. International Society for Scientometrics and Informetrics.

    Google Scholar 

  30. Zanette, D. H. and Montemurro, M. A. (2005). Dynamics of text generation with realistic Zipf's distribution. Journal of Quantitative Linguistics, 12(1), 29–40.

    Article  Google Scholar 

  31. Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston.

    Google Scholar 

  32. Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chris Biemann or Uwe Quasthoff .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Birkhäuser Boston, a part of Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Biemann, C., Quasthoff, U. (2009). Networks Generated from Natural Language Text. In: Ganguly, N., Deutsch, A., Mukherjee, A. (eds) Dynamics On and Of Complex Networks. Modeling and Simulation in Science, Engineering and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4751-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-0-8176-4751-3_10

  • Published:

  • Publisher Name: Birkhäuser Boston

  • Print ISBN: 978-0-8176-4750-6

  • Online ISBN: 978-0-8176-4751-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics