Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval

Breuer, Timo; Pest, Melanie; Schaer, Philipp

doi:10.1007/978-3-030-85251-1_5

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12880))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

927 Accesses
1 Citations
1 Altmetric

Abstract

In this work, we analyze a pseudo-relevance retrieval method based on the results of web search engines. By enriching topics with text data from web search engine result pages and linked contents, we train topic-specific and cost-efficient classifiers that can be used to search test collections for relevant documents. Building upon attempts initially made at TREC Common Core 2018 by Grossman and Cormack, we address questions of system performance over time considering different search engines, queries, and test collections. Our experimental results show how and to which extent the considered components affect the retrieval performance. Overall, the analyzed method is robust in terms of average retrieval performance and a promising way to use web content for the data enrichment of relevance feedback methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/irgroup/clef2021-web-prf/.
2.
https://spreadprivacy.com/why-use-duckduckgo-instead-of-google/, accessed: May 3rd, 2021.
3.
GC are affiliated with the University of Waterloo in Canada.
4.
Most Tables and Figures contain results instantiated with nDCG. For results instantiated with other measures, please have a look at the online appendix at https://github.com/irgroup/clef2021-web-prf/blob/master/doc/appendix.pdf.
5.
Pearson’s \(r=0.9747\) and \(p=0.0002\).
6.
https://doi.org/10.5281/zenodo.4105885.

References

Bandyopadhyay, A., Ghosh, K., Majumder, P., Mitra, M.: Query expansion for microblog retrieval. Int. J. Web Sci. 1(4), 368–380 (2012)
Article Google Scholar
Breuer, T., et al.: How to measure the reproducibility of system-oriented IR experiments. In: Proceedings of SIGIR, pp. 349–358 (2020)
Google Scholar
Breuer, T., Ferro, N., Maistro, M., Schaer, P.: repro\(\_\)eval: a python interface to reproducibility measures of system-oriented IR experiments. In: Proceedings of ECIR, pp. 481–486 (2021)
Google Scholar
Cormack, G.V., Smucker, M.D., Clarke, C.L.A.: Efficient and effective spam filtering and re-ranking for large web datasets. Inf. Retr. 14(5), 441–465 (2011). https://doi.org/10.1007/s10791-011-9162-z
Article Google Scholar
Croft, W.B., Harper, D.J.: Using probabilistic models of document retrieval without relevance information. J. Doc. 35(4), 285–295 (1979)
Article Google Scholar
Diaz, F., Metzler, D.: Improving the estimation of relevance models using large external corpora. In: Proceedings of SIGIR, pp. 154–161. ACM (2006)
Google Scholar
Grossman, M.R., Cormack, G.V.: In: MRG\(\_\)UWaterloo and WaterlooCormack Participation in the TREC 2017 Common Core Track, vol. 500–324. National Institute of Standards and Technology (NIST) (2017)
Google Scholar
Grossman, M.R., Cormack, G.V.: MRG\(\_\)UWaterloo participation in the TREC 2018 common core track. In: Proceedings of TREC (2018)
Google Scholar
Hannak, A., Sapiezynski, P., Kakhki, A.M., Krishnamurthy, B., Lazer, D., Mislove, A., Wilson, C.: Measuring personalization of web search. In: Proceedings of World Wide Web Conference, WWW, pp. 527–538 (2013)
Google Scholar
Kwok, K., Grunfeld, L., Deng, P.: Improving weak ad-hoc retrieval by web assistance and data fusion. In: Proceedings of Asia Information Retrieval Symposium, AIRS 2005, pp. 17–30 (2005)
Google Scholar
Li, Y., Luk, R.W.P., Ho, E.K.S., Chung, K.F.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of SIGIR, pp. 797–798. ACM (2007)
Google Scholar
Liu, T.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009)
Article Google Scholar
Nallapati, R.: Discriminative models for information retrieval. In: Proceedings of SIGIR, pp. 64–71. ACM (2004)
Google Scholar
Raman, K., Udupa, R., Bhattacharya, P., Bhole, A.: On improving pseudo-relevance feedback using pseudo-irrelevant documents. In: Gurrin, C., et al. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 573–576. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_50
Chapter Google Scholar
Robertson, S., Callan, J.: TREC - experiment and evaluation in information retrieval, pp. 99–122 (2005)
Google Scholar
Roy, D., Mitra, M., Ganguly, D.: To clean or not to clean: document preprocessing and reproducibility. J. Data Inf. Qual. 10(4), 18:1–18:25 (2018)
Google Scholar
Ruthven, I., Lalmas, M.: A survey on the use of relevance feedback for information access systems. Knowl. Eng. Rev. 18(2), 95–145 (2003)
Article Google Scholar
Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Proceedings of TREC, vol. 500–261 (2004)
Google Scholar
Voorhees, E.M.: Overview of the TREC 2005 robust retrieval track. In: Proceedings of TREC, vol. 500–266 (2005)
Google Scholar
Walker, S., Robertson, S.E., Boughanem, M., Jones, G.J.F., Jones, K.S.: Okapi at TREC-6 Automatic ad hoc, VLC, routing, filtering and QSDR. In: Proceedings of TREC, pp. 125–136 (1997)
Google Scholar
Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28(4), 20:1-20:38 (2010)
Article Google Scholar
Xu, Y., Jones, G.J.F., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Proceedings of SIGIR, pp. 59–66. ACM (2009)
Google Scholar
Xu, Z., Akella, R.: A Bayesian logistic regression model for active relevance feedback. In: Proceedings of SIGIR, pp. 227–234. ACM (2008)
Google Scholar
Yi, X., Raghavan, H., Leggetter, C.: Discovering users’ specific geo intention in web search. In: Proceedings of World Wide Web Conference, WWW, pp. 481–490 (2009)
Google Scholar
Yu, S., Cai, D., Wen, J., Ma, W.: Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of World Wide Web Conference, WWW, pp. 11–18. ACM (2003)
Google Scholar

Download references

Acknowledgments

This paper is supported by the DFG (project no. 407518790).

Author information

Authors and Affiliations

TH Köln (University of Applied Sciences), Cologne, Germany
Timo Breuer, Melanie Pest & Philipp Schaer

Authors

Timo Breuer
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Pest
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Schaer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo Breuer .

Editor information

Editors and Affiliations

Arizona State University, Tempe, AZ, USA
K. Selçuk Candan
Politehnica University of Bucharest, Bucharest, Romania
Bogdan Ionescu
Université Grenoble Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Aalborg University Copenhagen, Copenhagen, Denmark
Birger Larsen
HES-SO Valais-Wallis, Sierre, Switzerland
Henning Müller
University of Montpellier, Montpellier, France
Alexis Joly
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
TU Wien, Vienna, Austria
Florina Piroi
University of Padua, Padova, Italy
Guglielmo Faggioli
University of Padua, Padova, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Breuer, T., Pest, M., Schaer, P. (2021). Evaluating Elements of Web-Based Data Enrichment for Pseudo-relevance Feedback Retrieval. In: Candan, K.S., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2021. Lecture Notes in Computer Science(), vol 12880. Springer, Cham. https://doi.org/10.1007/978-3-030-85251-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-85251-1_5
Published: 14 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85250-4
Online ISBN: 978-3-030-85251-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics