Skip to main content

Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2021)

Abstract

In this work, we analyze a pseudo-relevance retrieval method based on the results of web search engines. By enriching topics with text data from web search engine result pages and linked contents, we train topic-specific and cost-efficient classifiers that can be used to search test collections for relevant documents. Building upon attempts initially made at TREC Common Core 2018 by Grossman and Cormack, we address questions of system performance over time considering different search engines, queries, and test collections. Our experimental results show how and to which extent the considered components affect the retrieval performance. Overall, the analyzed method is robust in terms of average retrieval performance and a promising way to use web content for the data enrichment of relevance feedback methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/irgroup/clef2021-web-prf/.

  2. 2.

    https://spreadprivacy.com/why-use-duckduckgo-instead-of-google/, accessed: May 3rd, 2021.

  3. 3.

    GC are affiliated with the University of Waterloo in Canada.

  4. 4.

    Most Tables and Figures contain results instantiated with nDCG. For results instantiated with other measures, please have a look at the online appendix at https://github.com/irgroup/clef2021-web-prf/blob/master/doc/appendix.pdf.

  5. 5.

    Pearson’s \(r=0.9747\) and \(p=0.0002\).

  6. 6.

    https://doi.org/10.5281/zenodo.4105885.

References

  1. Bandyopadhyay, A., Ghosh, K., Majumder, P., Mitra, M.: Query expansion for microblog retrieval. Int. J. Web Sci. 1(4), 368–380 (2012)

    Article  Google Scholar 

  2. Breuer, T., et al.: How to measure the reproducibility of system-oriented IR experiments. In: Proceedings of SIGIR, pp. 349–358 (2020)

    Google Scholar 

  3. Breuer, T., Ferro, N., Maistro, M., Schaer, P.: repro\(\_\)eval: a python interface to reproducibility measures of system-oriented IR experiments. In: Proceedings of ECIR, pp. 481–486 (2021)

    Google Scholar 

  4. Cormack, G.V., Smucker, M.D., Clarke, C.L.A.: Efficient and effective spam filtering and re-ranking for large web datasets. Inf. Retr. 14(5), 441–465 (2011). https://doi.org/10.1007/s10791-011-9162-z

    Article  Google Scholar 

  5. Croft, W.B., Harper, D.J.: Using probabilistic models of document retrieval without relevance information. J. Doc. 35(4), 285–295 (1979)

    Article  Google Scholar 

  6. Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: Proceedings of SIGIR, pp. 154–161. ACM (2006)

    Google Scholar 

  7. Grossman, M.R., Cormack, G.V.: In: MRG\(\_\)UWaterloo and WaterlooCormack Participation in the TREC 2017 Common Core Track, vol. 500–324. National Institute of Standards and Technology (NIST) (2017)

    Google Scholar 

  8. Grossman, M.R., Cormack, G.V.: MRG\(\_\)UWaterloo participation in the TREC 2018 common core track. In: Proceedings of TREC (2018)

    Google Scholar 

  9. Hannak, A., Sapiezynski, P., Kakhki, A.M., Krishnamurthy, B., Lazer, D., Mislove, A., Wilson, C.: Measuring personalization of web search. In: Proceedings of World Wide Web Conference, WWW, pp. 527–538 (2013)

    Google Scholar 

  10. Kwok, K., Grunfeld, L., Deng, P.: Improving weak ad-hoc retrieval by web assistance and data fusion. In: Proceedings of Asia Information Retrieval Symposium, AIRS 2005, pp. 17–30 (2005)

    Google Scholar 

  11. Li, Y., Luk, R.W.P., Ho, E.K.S., Chung, K.F.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of SIGIR, pp. 797–798. ACM (2007)

    Google Scholar 

  12. Liu, T.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009)

    Article  Google Scholar 

  13. Nallapati, R.: Discriminative models for information retrieval. In: Proceedings of SIGIR, pp. 64–71. ACM (2004)

    Google Scholar 

  14. Raman, K., Udupa, R., Bhattacharya, P., Bhole, A.: On improving pseudo-relevance feedback using pseudo-irrelevant documents. In: Gurrin, C., et al. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 573–576. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_50

    Chapter  Google Scholar 

  15. Robertson, S., Callan, J.: TREC - experiment and evaluation in information retrieval, pp. 99–122 (2005)

    Google Scholar 

  16. Roy, D., Mitra, M., Ganguly, D.: To clean or not to clean: document preprocessing and reproducibility. J. Data Inf. Qual. 10(4), 18:1–18:25 (2018)

    Google Scholar 

  17. Ruthven, I., Lalmas, M.: A survey on the use of relevance feedback for information access systems. Knowl. Eng. Rev. 18(2), 95–145 (2003)

    Article  Google Scholar 

  18. Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Proceedings of TREC, vol. 500–261 (2004)

    Google Scholar 

  19. Voorhees, E.M.: Overview of the TREC 2005 robust retrieval track. In: Proceedings of TREC, vol. 500–266 (2005)

    Google Scholar 

  20. Walker, S., Robertson, S.E., Boughanem, M., Jones, G.J.F., Jones, K.S.: Okapi at TREC-6 Automatic ad hoc, VLC, routing, filtering and QSDR. In: Proceedings of TREC, pp. 125–136 (1997)

    Google Scholar 

  21. Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28(4), 20:1-20:38 (2010)

    Article  Google Scholar 

  22. Xu, Y., Jones, G.J.F., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of SIGIR, pp. 59–66. ACM (2009)

    Google Scholar 

  23. Xu, Z., Akella, R.: A Bayesian logistic regression model for active relevance feedback. In: Proceedings of SIGIR, pp. 227–234. ACM (2008)

    Google Scholar 

  24. Yi, X., Raghavan, H., Leggetter, C.: Discovering users’ specific geo intention in web search. In: Proceedings of World Wide Web Conference, WWW, pp. 481–490 (2009)

    Google Scholar 

  25. Yu, S., Cai, D., Wen, J., Ma, W.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of World Wide Web Conference, WWW, pp. 11–18. ACM (2003)

    Google Scholar 

Download references

Acknowledgments

This paper is supported by the DFG (project no. 407518790).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Timo Breuer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Breuer, T., Pest, M., Schaer, P. (2021). Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval. In: Candan, K.S., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2021. Lecture Notes in Computer Science(), vol 12880. Springer, Cham. https://doi.org/10.1007/978-3-030-85251-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85251-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85250-4

  • Online ISBN: 978-3-030-85251-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics