Abstract
In this work, we analyze a pseudo-relevance retrieval method based on the results of web search engines. By enriching topics with text data from web search engine result pages and linked contents, we train topic-specific and cost-efficient classifiers that can be used to search test collections for relevant documents. Building upon attempts initially made at TREC Common Core 2018 by Grossman and Cormack, we address questions of system performance over time considering different search engines, queries, and test collections. Our experimental results show how and to which extent the considered components affect the retrieval performance. Overall, the analyzed method is robust in terms of average retrieval performance and a promising way to use web content for the data enrichment of relevance feedback methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
https://spreadprivacy.com/why-use-duckduckgo-instead-of-google/, accessed: May 3rd, 2021.
- 3.
GC are affiliated with the University of Waterloo in Canada.
- 4.
Most Tables and Figures contain results instantiated with nDCG. For results instantiated with other measures, please have a look at the online appendix at https://github.com/irgroup/clef2021-web-prf/blob/master/doc/appendix.pdf.
- 5.
Pearson’s \(r=0.9747\) and \(p=0.0002\).
- 6.
References
Bandyopadhyay, A., Ghosh, K., Majumder, P., Mitra, M.: Query expansion for microblog retrieval. Int. J. Web Sci. 1(4), 368–380 (2012)
Breuer, T., et al.: How to measure the reproducibility of system-oriented IR experiments. In: Proceedings of SIGIR, pp. 349–358 (2020)
Breuer, T., Ferro, N., Maistro, M., Schaer, P.: repro\(\_\)eval: a python interface to reproducibility measures of system-oriented IR experiments. In: Proceedings of ECIR, pp. 481–486 (2021)
Cormack, G.V., Smucker, M.D., Clarke, C.L.A.: Efficient and effective spam filtering and re-ranking for large web datasets. Inf. Retr. 14(5), 441–465 (2011). https://doi.org/10.1007/s10791-011-9162-z
Croft, W.B., Harper, D.J.: Using probabilistic models of document retrieval without relevance information. J. Doc. 35(4), 285–295 (1979)
Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: Proceedings of SIGIR, pp. 154–161. ACM (2006)
Grossman, M.R., Cormack, G.V.: In: MRG\(\_\)UWaterloo and WaterlooCormack Participation in the TREC 2017 Common Core Track, vol. 500–324. National Institute of Standards and Technology (NIST) (2017)
Grossman, M.R., Cormack, G.V.: MRG\(\_\)UWaterloo participation in the TREC 2018 common core track. In: Proceedings of TREC (2018)
Hannak, A., Sapiezynski, P., Kakhki, A.M., Krishnamurthy, B., Lazer, D., Mislove, A., Wilson, C.: Measuring personalization of web search. In: Proceedings of World Wide Web Conference, WWW, pp. 527–538 (2013)
Kwok, K., Grunfeld, L., Deng, P.: Improving weak ad-hoc retrieval by web assistance and data fusion. In: Proceedings of Asia Information Retrieval Symposium, AIRS 2005, pp. 17–30 (2005)
Li, Y., Luk, R.W.P., Ho, E.K.S., Chung, K.F.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of SIGIR, pp. 797–798. ACM (2007)
Liu, T.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009)
Nallapati, R.: Discriminative models for information retrieval. In: Proceedings of SIGIR, pp. 64–71. ACM (2004)
Raman, K., Udupa, R., Bhattacharya, P., Bhole, A.: On improving pseudo-relevance feedback using pseudo-irrelevant documents. In: Gurrin, C., et al. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 573–576. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_50
Robertson, S., Callan, J.: TREC - experiment and evaluation in information retrieval, pp. 99–122 (2005)
Roy, D., Mitra, M., Ganguly, D.: To clean or not to clean: document preprocessing and reproducibility. J. Data Inf. Qual. 10(4), 18:1–18:25 (2018)
Ruthven, I., Lalmas, M.: A survey on the use of relevance feedback for information access systems. Knowl. Eng. Rev. 18(2), 95–145 (2003)
Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Proceedings of TREC, vol. 500–261 (2004)
Voorhees, E.M.: Overview of the TREC 2005 robust retrieval track. In: Proceedings of TREC, vol. 500–266 (2005)
Walker, S., Robertson, S.E., Boughanem, M., Jones, G.J.F., Jones, K.S.: Okapi at TREC-6 Automatic ad hoc, VLC, routing, filtering and QSDR. In: Proceedings of TREC, pp. 125–136 (1997)
Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28(4), 20:1-20:38 (2010)
Xu, Y., Jones, G.J.F., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of SIGIR, pp. 59–66. ACM (2009)
Xu, Z., Akella, R.: A Bayesian logistic regression model for active relevance feedback. In: Proceedings of SIGIR, pp. 227–234. ACM (2008)
Yi, X., Raghavan, H., Leggetter, C.: Discovering users’ specific geo intention in web search. In: Proceedings of World Wide Web Conference, WWW, pp. 481–490 (2009)
Yu, S., Cai, D., Wen, J., Ma, W.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of World Wide Web Conference, WWW, pp. 11–18. ACM (2003)
Acknowledgments
This paper is supported by the DFG (project no. 407518790).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Breuer, T., Pest, M., Schaer, P. (2021). Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval. In: Candan, K.S., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2021. Lecture Notes in Computer Science(), vol 12880. Springer, Cham. https://doi.org/10.1007/978-3-030-85251-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-85251-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85250-4
Online ISBN: 978-3-030-85251-1
eBook Packages: Computer ScienceComputer Science (R0)