skip to main content
10.1145/383952.383990acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Enhanced topic distillation using text, markup tags, and hyperlinks

Published:01 September 2001Publication History

ABSTRACT

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.

References

  1. 1.D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning, 34(1-3):177-210, 1999. Online at http://www.cs.cmu.edu/~lafferty/ps/ml-final.ps. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104-111, Aug. 1998. Online at ftp:// ftp.digital.com/pub/DEC/SRC/publications/monika/sigir98.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide Web. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World-Wide Web Conference (WWW7), 1998. Online at http://decweb.ethz.ch/ WWW7/1921/com1921.htm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.C. Buckley, M. Mitra, J. Waltz, and C. Cardie. Using clustering and SuperConcepts within SMART: TREC6. In Proceedings of the Sixth Text Retrieval Conference (TREC6), Gaithersburg, MD, 1998. National Institute of Standards and Technology (NIST). Online at http://www.cs.cornell.edu/home/ cardie/papers/trec6-ipm.ps.Google ScholarGoogle Scholar
  6. 6.S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/489. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.S. Chakrabarti, B. E. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. IEEE Computer, 32(8):60- 67, Aug. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, Inc., 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.D. A. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. In VLDB, volume 24, pages 311-322, New York, Aug. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, London, 1989.Google ScholarGoogle Scholar
  11. 11.M. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994. Online at http://www.sims.berkeley.edu/~hearst/publications.shtml. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.D. S. Johnson and K. A. Niemi. On knapsacks, partitions, and a new dynamic programming technique for trees. Mathematics of Operations Research, 8(1):1-14, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.J. Kleinberg. Authoritative sources in a hyperlinked environment. In ACM-SIAM Symposium on Discrete Algorithms, 1998. Online at http://www.cs.cornell.edu/home/kleinber/auth. ps. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.Google ScholarGoogle Scholar
  15. 15.J. M. Ponte and W. B. Croft. Text segmentation by topic. In First European Conference on Research and Advanced Technology for Digitial Libraries, pages 120-129, 1997. Online at http://cobar.cs.umass.edu/pubfiles/ir-103.ps.gz. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.J. C. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania, 1998. Online at http://www.cis.upenn.edu/~jcreynar/research.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.K. Richmond, A. Smith, and E. Amitay. Detecting subject boundaries within text: A language independent statistical approach. In Empirical Methods in Natural Language Processing, volume 2, Providence, RI, 1997. Online at http://www.ics.mq. edu.au/~einat/publications.html.Google ScholarGoogle Scholar
  18. 18.G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.J. Savoy. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management, 32(2):155-170, Mar. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.A. Singhal and M. Kaszkiel. A case study in web search using TREC algorithms. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Online at http://www.dcs.gla.ac.uk/Keith/ Preface.html. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhanced topic distillation using text, markup tags, and hyperlinks

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
            September 2001
            454 pages
            ISBN:1581133316
            DOI:10.1145/383952

            Copyright © 2001 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 September 2001

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            SIGIR '01 Paper Acceptance Rate47of201submissions,23%Overall Acceptance Rate792of3,983submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader