Article

Enhanced topic distillation using text, markup tags, and hyperlinks

Authors:
Soumen Chakrabarti

IIT Bombay, India

IIT Bombay, India
View Profile

,
Mukul Joshi

IIT Bombay, India

IIT Bombay, India
View Profile

,
Vivek Tawde

IIT Bombay, India

IIT Bombay, India
View Profile

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrievalSeptember 2001Pages 208–216https://doi.org/10.1145/383952.383990

Published:01 September 2001Publication History

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 208–216

ABSTRACT

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.

References

1.D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning, 34(1-3):177-210, 1999. Online at http://www.cs.cmu.edu/~lafferty/ps/ml-final.ps. Google ScholarDigital Library
2.K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104-111, Aug. 1998. Online at ftp:// ftp.digital.com/pub/DEC/SRC/publications/monika/sigir98.pdf. Google ScholarDigital Library
3.A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide Web. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/314. Google ScholarDigital Library
4.S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World-Wide Web Conference (WWW7), 1998. Online at http://decweb.ethz.ch/ WWW7/1921/com1921.htm. Google ScholarDigital Library
5.C. Buckley, M. Mitra, J. Waltz, and C. Cardie. Using clustering and SuperConcepts within SMART: TREC6. In Proceedings of the Sixth Text Retrieval Conference (TREC6), Gaithersburg, MD, 1998. National Institute of Standards and Technology (NIST). Online at http://www.cs.cornell.edu/home/ cardie/papers/trec6-ipm.ps.Google Scholar
6.S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/489. Google ScholarDigital Library
7.S. Chakrabarti, B. E. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. IEEE Computer, 32(8):60- 67, Aug. 1999. Google ScholarDigital Library
8.T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, Inc., 1991. Google ScholarDigital Library
9.D. A. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. In VLDB, volume 24, pages 311-322, New York, Aug. 1998. Google ScholarDigital Library
10.G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, London, 1989.Google Scholar
11.M. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994. Online at http://www.sims.berkeley.edu/~hearst/publications.shtml. Google ScholarDigital Library
12.D. S. Johnson and K. A. Niemi. On knapsacks, partitions, and a new dynamic programming technique for trees. Mathematics of Operations Research, 8(1):1-14, 1983.Google ScholarDigital Library
13.J. Kleinberg. Authoritative sources in a hyperlinked environment. In ACM-SIAM Symposium on Discrete Algorithms, 1998. Online at http://www.cs.cornell.edu/home/kleinber/auth. ps. Google ScholarDigital Library
14.A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.Google Scholar
15.J. M. Ponte and W. B. Croft. Text segmentation by topic. In First European Conference on Research and Advanced Technology for Digitial Libraries, pages 120-129, 1997. Online at http://cobar.cs.umass.edu/pubfiles/ir-103.ps.gz. Google ScholarDigital Library
16.J. C. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania, 1998. Online at http://www.cis.upenn.edu/~jcreynar/research.html. Google ScholarDigital Library
17.K. Richmond, A. Smith, and E. Amitay. Detecting subject boundaries within text: A language independent statistical approach. In Empirical Methods in Natural Language Processing, volume 2, Providence, RI, 1997. Online at http://www.ics.mq. edu.au/~einat/publications.html.Google Scholar
18.G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
19.J. Savoy. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management, 32(2):155-170, Mar. 1996. Google ScholarDigital Library
20.A. Singhal and M. Kaszkiel. A case study in web search using TREC algorithms. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/317. Google ScholarDigital Library
21.C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Online at http://www.dcs.gla.ac.uk/Keith/ Preface.html. Google ScholarDigital Library

Index Terms

Enhanced topic distillation using text, markup tags, and hyperlinks
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
    2. Search methodologies
      1. Discrete space search
      2. Game tree search
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Enhanced web document summarization using hyperlinks
HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia

This paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider ...
Read More
Topic distillation using hierarchy concept tree
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

In this paper, we propose a new approach for topic distillation on World Wide Web. Topic distillation is to find quality documents related to the user query topic. Our approach is based on Bharat's topic distillation algorithm [1]. We present the ...
Read More
Question Tags or Text for Topic Modeling: Which is better
Abstract
Topic modelling is a probabilistic based statistical model used to find the latent topics that best depicts the content of the documents. Community Question Answering websites such as Quora, Stack Overflow and Yahoo! Answers have been prevalently ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
September 2001
454 pages
ISBN:1581133316
DOI:10.1145/383952
Chairmen:
Donald H. Kraft
Louisiana State Univ.
,
W. Bruce Croft
University of Massachusetts, (For the Americas)
,
David J. Harper
The Robert Gordon University, (For Europe and Africa)
,
Justin Zobel
RMIT University, (For Asia and Australasia)
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '01 Paper Acceptance Rate47of201submissions,23%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 82
  Total Citations
  View Citations
- 926
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhanced topic distillation using text, markup tags, and hyperlinks

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enhanced web document summarization using hyperlinks

Topic distillation using hierarchy concept tree

Question Tags or Text for Topic Modeling: Which is better