Skip to main content

Block-Based Language Modeling Approach Towards Web Search

  • Conference paper
Web Technologies Research and Development - APWeb 2005 (APWeb 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

Abstract

Using probabilistic Language Modeling approach in Information Retrieval, model for each document is estimated individually. However, with Web pages becoming more complex, each of them may contain some blocks discussing different topics. Consequently, the performance of statistic model for web document tends to be degraded by the mixture of topics. In this paper, we argue that segmenting Web page into several relatively independent blocks will assist the language modeling and a Block-based Language Modeling (BLM) approach is proposed. Different with normal method, BLM refines the modeling process into two parts: the probability of a query occurring in a block, and the probability of a block occurring in a Web page. Then given a query, those pages with more relevant blocks tend to be retrieved. Experimental results show that when unigram model is used, our approach outperforms original language modeling for web search in most cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Ponte, J., Croft, W.: A Language Modeling Approach to Information Retrieval. In: Proc. 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998 SIGIR (1998)

    Google Scholar 

  • Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., Laakko, T.: Two Approaches to Bringing Internet Services to WAP Devices. In: Proc. 9th International World Wide Web Conference, pp. 231–246 (2000)

    Google Scholar 

  • Lin, S.H., Ho, J.M.: Discovering Informative Content Blocks from Web Documents. In: Proc. ACM SIGKDD 2002 (2002)

    Google Scholar 

  • Wong, W., Fu, A.W.: Finding Structure and Characteristics of Web Documents for Classification. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery(DMKD), Dallas, TX., USA (2000)

    Google Scholar 

  • Chen, J., Zhou, B., Shi, J., Zhang, H., Wu, Q.: Function Based Object Model towards Website Adaptation. In: Proc. 10th International World Wide Web Conference (2001)

    Google Scholar 

  • Yang, Y., Zhang, H.: HTML Page Analysis Based on Visual Cues. In: 6th International Conference on Document Analysis and Recognition, Seattle, USA (2001)

    Google Scholar 

  • Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: the 5th Asia Pacific Web Conference (2003)

    Google Scholar 

  • Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in Web documents. In: Proc. 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, pp. 467–478 (1999)

    Google Scholar 

  • Yu, S., Cai, D., Wen, J.R., Ma, W.Y.: Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation. In: Proc. 12th World Wide Web Conference, Budapest, Hungary (2003)

    Google Scholar 

  • Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based Web Search. In: Proc. 27th annual international ACM SIGIR conference on Research and development in information retrieval (2004)

    Google Scholar 

  • Yi, L., Liu, B., Li, X.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003), Washington, DC, USA (August 2003)

    Google Scholar 

  • Song, R., Liu, H., Wen, J.R.: Learning Block Importance Models for Web Pages. In: Proc. 13th World Wide conference (WWW2004) (May 2004)

    Google Scholar 

  • Cai, D., He, X., Wen, J.R., Ma, W.Y.: Block-level Link Analysis. In: Proc. 27th annual international ACM SIGIR conference on Research and development in information retrieval (2004)

    Google Scholar 

  • Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Ad Hoc Retrieval. In: Proc. ACM SIGIR conference on Research and development in information retrieval (2001)

    Google Scholar 

  • Berger, A., Lafferty, J.: Information Retrieval as Statistical Translation. In: Proc. ACM SIGIR conference on Research and development in information retrieval (1999)

    Google Scholar 

  • Zelen, M., Severo, N.: “Probability Functions” Handbook of Mathematical Functions. National Bureau of Standards Applied Mathematics Series, vol. 55 (1964)

    Google Scholar 

  • Kleinber, J.: Authoritative Sources in a Hyperlinked Environment. Journal of the ACM 46(5), 604–622 (1999)

    Article  MathSciNet  Google Scholar 

  • Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford University, Stanford, CA (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, S., Huang, S., Xue, GR., Yu, Y. (2005). Block-Based Language Modeling Approach Towards Web Search. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31849-1_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25207-8

  • Online ISBN: 978-3-540-31849-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics