ABSTRACT
Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.
- 1.D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning, 34(1-3):177-210, 1999. Online at http://www.cs.cmu.edu/~lafferty/ps/ml-final.ps. Google ScholarDigital Library
- 2.K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104-111, Aug. 1998. Online at ftp:// ftp.digital.com/pub/DEC/SRC/publications/monika/sigir98.pdf. Google ScholarDigital Library
- 3.A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide Web. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/314. Google ScholarDigital Library
- 4.S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World-Wide Web Conference (WWW7), 1998. Online at http://decweb.ethz.ch/ WWW7/1921/com1921.htm. Google ScholarDigital Library
- 5.C. Buckley, M. Mitra, J. Waltz, and C. Cardie. Using clustering and SuperConcepts within SMART: TREC6. In Proceedings of the Sixth Text Retrieval Conference (TREC6), Gaithersburg, MD, 1998. National Institute of Standards and Technology (NIST). Online at http://www.cs.cornell.edu/home/ cardie/papers/trec6-ipm.ps.Google Scholar
- 6.S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/489. Google ScholarDigital Library
- 7.S. Chakrabarti, B. E. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. IEEE Computer, 32(8):60- 67, Aug. 1999. Google ScholarDigital Library
- 8.T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, Inc., 1991. Google ScholarDigital Library
- 9.D. A. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. In VLDB, volume 24, pages 311-322, New York, Aug. 1998. Google ScholarDigital Library
- 10.G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, London, 1989.Google Scholar
- 11.M. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994. Online at http://www.sims.berkeley.edu/~hearst/publications.shtml. Google ScholarDigital Library
- 12.D. S. Johnson and K. A. Niemi. On knapsacks, partitions, and a new dynamic programming technique for trees. Mathematics of Operations Research, 8(1):1-14, 1983.Google ScholarDigital Library
- 13.J. Kleinberg. Authoritative sources in a hyperlinked environment. In ACM-SIAM Symposium on Discrete Algorithms, 1998. Online at http://www.cs.cornell.edu/home/kleinber/auth. ps. Google ScholarDigital Library
- 14.A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.Google Scholar
- 15.J. M. Ponte and W. B. Croft. Text segmentation by topic. In First European Conference on Research and Advanced Technology for Digitial Libraries, pages 120-129, 1997. Online at http://cobar.cs.umass.edu/pubfiles/ir-103.ps.gz. Google ScholarDigital Library
- 16.J. C. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania, 1998. Online at http://www.cis.upenn.edu/~jcreynar/research.html. Google ScholarDigital Library
- 17.K. Richmond, A. Smith, and E. Amitay. Detecting subject boundaries within text: A language independent statistical approach. In Empirical Methods in Natural Language Processing, volume 2, Providence, RI, 1997. Online at http://www.ics.mq. edu.au/~einat/publications.html.Google Scholar
- 18.G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
- 19.J. Savoy. An extended vector processing scheme for searching information in hypertext systems. Information Processing and Management, 32(2):155-170, Mar. 1996. Google ScholarDigital Library
- 20.A. Singhal and M. Kaszkiel. A case study in web search using TREC algorithms. In WWW10, Hong Kong, May 2001. Online at http://www10.org/cdrom/papers/317. Google ScholarDigital Library
- 21.C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Online at http://www.dcs.gla.ac.uk/Keith/ Preface.html. Google ScholarDigital Library
Index Terms
- Enhanced topic distillation using text, markup tags, and hyperlinks
Recommendations
Enhanced web document summarization using hyperlinks
HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermediaThis paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider ...
Topic distillation using hierarchy concept tree
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalIn this paper, we propose a new approach for topic distillation on World Wide Web. Topic distillation is to find quality documents related to the user query topic. Our approach is based on Bharat's topic distillation algorithm [1]. We present the ...
Question Tags or Text for Topic Modeling: Which is better
AbstractTopic modelling is a probabilistic based statistical model used to find the latent topics that best depicts the content of the documents. Community Question Answering websites such as Quora, Stack Overflow and Yahoo! Answers have been prevalently ...
Comments