The Open Web Index

Hendriksen, Gijs; Dinzinger, Michael; Farzana, Sheikh Mastura; Fathima, Noor Afshan; Fröbe, Maik; Schmidt, Sebastian; Zerhoudi, Saber; Granitzer, Michael; Hagen, Matthias; Hiemstra, Djoerd; Potthast, Martin; Stein, Benno

doi:10.1007/978-3-031-56069-9_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14612))

Included in the following conference series:

European Conference on Information Retrieval

307 Accesses

Abstract

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index.

The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index—for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, SIGIR 2021, pp. 2288-2292. Association for Computing Machinery, New York (2021), ISBN 9781450380379
Google Scholar
Fröbe, M., et al.: The Information Retrieval Experiment Platform. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)
Google Scholar
Fröbe, M., et al.: Continuous integration for reproducible shared tasks with TIRA.io. In: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023). LNCS. Springer (2023). https://doi.org/10.1007/978-3-031-28241-6_20
Gao, L., et al.: The Pile: An 800GB Dataset of Diverse Text for Language Modeling (Dec 2020)
Google Scholar
Goel, S., Broder, A.Z., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinary tastes. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, 4-6 February 2010, pp. 201–210. ACM, New York (2010)
Google Scholar
Gollub, T., Potthast, M., Stein, B.: Shaping the Information Nutrition Label. In: Albakour, D., Corney, D., Gonzalo, J., Martinez, M., Poblete, B., Valochas, A. (eds.) 2nd International Workshop on Recent Trends in News Information Retrieval (NewsIR 2018) at ECIR. CEUR Workshop Proceedings, vol. 2079, pp. 9–11 (Mar 2018), ISSN 1613-0073
Google Scholar
Granitzer, M., Voigt, S., et al.: Impact and Development of an Open Web Index for Open Web Search. J. Assoc. Inform. Sci. Technol. (2023)
Google Scholar
Guha, R.V., Brickley, D., MacBeth, S.: Schema.org: evolution of structured data on the web: big data makes common schemas even more necessary. Queue 13(9), 10–37 (2015), ISSN 1542-7730
Google Scholar
Kamphuis, C., Hasibi, F., Lin, J., de Vries, A.P.: REBL: entity linking at scale. In: Alonso, O., Baeza-Yates, R., King, T.H., Silvello, G. (eds.) Proceedings of the Third International Conference on Design of Experimental Search & Information Retrieval Systems, San Jose, CA, USA, 30-31 August 2022. CEUR Workshop Proceedings, vol. 3480, pp. 68–75. CEUR-WS.org (2022)
Google Scholar
Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020, pp. 39-48. Association for Computing Machinery, New York (2020), ISBN 9781450380164
Google Scholar
Koster, M., Illyes, G., Zeller, H., Sassman, L.: RFC 9309 Robots Exclusion Protocol (2022)
Google Scholar
Kreutzer, J., et al.: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (2021)
Google Scholar
Lewandowski, D.: The web is missing an essential part of infrastructure: an open web index. Commun. ACM 62(4), 24 (2019)
Article Google Scholar
Li, H., Su, Y., Cai, D., Wang, Y., Liu, L.: A Survey on Retrieval-Augmented Text Generation. arXiv preprint arXiv:2202.01110 (2022)
Lin, J., et al.: Supporting interoperability between open-source search engines with the common index file format. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2149–2152 (2020)
Google Scholar
Lugeon, S., Piccardi, T.: Curlie Dataset - Language-agnostic Website Embedding and Classification (Jan 2023). https://doi.org/10.6084/m9.figshare.19406693.v5, https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693
Middleton, S.E., Kordopatis-Zilos, G., Papadopoulos, S., Kompatsiaris, Y.: Location extraction from social media: geoparsing, location disambiguation, and geotagging. ACM Trans. Inform. Syst. (TOIS) 36(4), 1–27 (2018)
Article Google Scholar
Mühleisen, H., Bizer, C.: Web data commons - extracting structured data from two large web corpora. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) WWW 2012 Workshop on Linked Data on the Web, Lyon, France, 16 April 2012. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)
Google Scholar
Overwijk, A., Xiong, C., Liu, X., VandenBerg, C., Callan, J.: ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information (Dec 2022)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
Google Scholar
Scao, T.L., et al.: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR arXiv: 2211.05100 (2022)
Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J.S., Kazai, G. (eds.) SIGIR 2022: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11 - 15 July 2022, pp. 2825–2837. ACM (2022)
Google Scholar
Touvron, H., et al.: LLaMA: Open and Efficient Foundation Language Models. CoRR arXiv: 2302.13971 (2023)
van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2197–2200. ACM, Virtual Event China (Jul 2020), ISBN 978-1-4503-8016-4
Google Scholar
Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12113–12134. Association for Computational Linguistics, Toronto, Canada (Jul 2023)
Google Scholar

Download references

Acknowledgments

This work has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070014 (OpenWebSearch.EU, https://doi.org/10.3030/101070014).

Author information

Authors and Affiliations

Radboud University, Nijmegen, The Netherlands
Gijs Hendriksen & Djoerd Hiemstra
University of Passau, Passau, Germany
Michael Dinzinger, Saber Zerhoudi & Michael Granitzer
German Aerospace Center (DLR), Cologne, Germany
Sheikh Mastura Farzana
CERN, Geneva, Switzerland
Noor Afshan Fathima
Friedrich-Schiller-Universität Jena, Jena, Germany
Maik Fröbe & Matthias Hagen
Leipzig University, Leipzig, Germany
Sebastian Schmidt & Martin Potthast
ScaDS.AI, Leipzig, Germany
Martin Potthast
Bauhaus-Universität Weimar, Weimar, Germany
Benno Stein

Authors

Gijs Hendriksen
View author publications
You can also search for this author in PubMed Google Scholar
Michael Dinzinger
View author publications
You can also search for this author in PubMed Google Scholar
Sheikh Mastura Farzana
View author publications
You can also search for this author in PubMed Google Scholar
Noor Afshan Fathima
View author publications
You can also search for this author in PubMed Google Scholar
Maik Fröbe
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Saber Zerhoudi
View author publications
You can also search for this author in PubMed Google Scholar
Michael Granitzer
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Hagen
View author publications
You can also search for this author in PubMed Google Scholar
Djoerd Hiemstra
View author publications
You can also search for this author in PubMed Google Scholar
Martin Potthast
View author publications
You can also search for this author in PubMed Google Scholar
Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gijs Hendriksen .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hendriksen, G. et al. (2024). The Open Web Index. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14612. Springer, Cham. https://doi.org/10.1007/978-3-031-56069-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-56069-9_10
Published: 23 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56068-2
Online ISBN: 978-3-031-56069-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics