Skip to main content

The Open Web Index

Crawling and Indexing the Web for Public Use

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Abstract

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index.

The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index—for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://openwebsearch.eu/.

  2. 2.

    https://commoncrawl.org/.

  3. 3.

    https://laion.ai/.

  4. 4.

    https://curlie.org.

  5. 5.

    https://blog.mojeek.com/2022/03/five-billion-pages.html.

  6. 6.

    https://betterweb.qwant.com/en/2023/09/18/web-indexing-where-is-qwants-independence/.

  7. 7.

    https://curlie.org.

  8. 8.

    https://opencode.it4i.eu/openwebsearcheu-public/.

  9. 9.

    https://zenodo.org/communities/owseu/.

  10. 10.

    http://urlfrontier.net.

  11. 11.

    https://iipc.github.io/warc-specifications/.

  12. 12.

    https://urlhaus.abuse.ch/.

  13. 13.

    https://noml.info.

  14. 14.

    https://www.w3.org/2022/tdmrep/.

  15. 15.

    https://github.com/LLNL/magpie/.

References

  1. Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, SIGIR 2021, pp. 2288-2292. Association for Computing Machinery, New York (2021), ISBN 9781450380379

    Google Scholar 

  2. Fröbe, M., et al.: The Information Retrieval Experiment Platform. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)

    Google Scholar 

  3. Fröbe, M., et al.: Continuous integration for reproducible shared tasks with TIRA.io. In: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023). LNCS. Springer (2023). https://doi.org/10.1007/978-3-031-28241-6_20

  4. Gao, L., et al.: The Pile: An 800GB Dataset of Diverse Text for Language Modeling (Dec 2020)

    Google Scholar 

  5. Goel, S., Broder, A.Z., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinary tastes. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, 4-6 February 2010, pp. 201–210. ACM, New York (2010)

    Google Scholar 

  6. Gollub, T., Potthast, M., Stein, B.: Shaping the Information Nutrition Label. In: Albakour, D., Corney, D., Gonzalo, J., Martinez, M., Poblete, B., Valochas, A. (eds.) 2nd International Workshop on Recent Trends in News Information Retrieval (NewsIR 2018) at ECIR. CEUR Workshop Proceedings, vol. 2079, pp. 9–11 (Mar 2018), ISSN 1613-0073

    Google Scholar 

  7. Granitzer, M., Voigt, S., et al.: Impact and Development of an Open Web Index for Open Web Search. J. Assoc. Inform. Sci. Technol. (2023)

    Google Scholar 

  8. Guha, R.V., Brickley, D., MacBeth, S.: Schema.org: evolution of structured data on the web: big data makes common schemas even more necessary. Queue 13(9), 10–37 (2015), ISSN 1542-7730

    Google Scholar 

  9. Kamphuis, C., Hasibi, F., Lin, J., de Vries, A.P.: REBL: entity linking at scale. In: Alonso, O., Baeza-Yates, R., King, T.H., Silvello, G. (eds.) Proceedings of the Third International Conference on Design of Experimental Search & Information Retrieval Systems, San Jose, CA, USA, 30-31 August 2022. CEUR Workshop Proceedings, vol. 3480, pp. 68–75. CEUR-WS.org (2022)

    Google Scholar 

  10. Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020, pp. 39-48. Association for Computing Machinery, New York (2020), ISBN 9781450380164

    Google Scholar 

  11. Koster, M., Illyes, G., Zeller, H., Sassman, L.: RFC 9309 Robots Exclusion Protocol (2022)

    Google Scholar 

  12. Kreutzer, J., et al.: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (2021)

    Google Scholar 

  13. Lewandowski, D.: The web is missing an essential part of infrastructure: an open web index. Commun. ACM 62(4), 24 (2019)

    Article  Google Scholar 

  14. Li, H., Su, Y., Cai, D., Wang, Y., Liu, L.: A Survey on Retrieval-Augmented Text Generation. arXiv preprint arXiv:2202.01110 (2022)

  15. Lin, J., et al.: Supporting interoperability between open-source search engines with the common index file format. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2149–2152 (2020)

    Google Scholar 

  16. Lugeon, S., Piccardi, T.: Curlie Dataset - Language-agnostic Website Embedding and Classification (Jan 2023). https://doi.org/10.6084/m9.figshare.19406693.v5, https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693

  17. Middleton, S.E., Kordopatis-Zilos, G., Papadopoulos, S., Kompatsiaris, Y.: Location extraction from social media: geoparsing, location disambiguation, and geotagging. ACM Trans. Inform. Syst. (TOIS) 36(4), 1–27 (2018)

    Article  Google Scholar 

  18. Mühleisen, H., Bizer, C.: Web data commons - extracting structured data from two large web corpora. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) WWW 2012 Workshop on Linked Data on the Web, Lyon, France, 16 April 2012. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)

    Google Scholar 

  19. Overwijk, A., Xiong, C., Liu, X., VandenBerg, C., Callan, J.: ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information (Dec 2022)

    Google Scholar 

  20. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)

    Google Scholar 

  21. Scao, T.L., et al.: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR arXiv: 2211.05100 (2022)

  22. Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J.S., Kazai, G. (eds.) SIGIR 2022: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11 - 15 July 2022, pp. 2825–2837. ACM (2022)

    Google Scholar 

  23. Touvron, H., et al.: LLaMA: Open and Efficient Foundation Language Models. CoRR arXiv: 2302.13971 (2023)

  24. van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2197–2200. ACM, Virtual Event China (Jul 2020), ISBN 978-1-4503-8016-4

    Google Scholar 

  25. Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12113–12134. Association for Computational Linguistics, Toronto, Canada (Jul 2023)

    Google Scholar 

Download references

Acknowledgments

This work has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070014 (OpenWebSearch.EU, https://doi.org/10.3030/101070014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gijs Hendriksen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hendriksen, G. et al. (2024). The Open Web Index. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14612. Springer, Cham. https://doi.org/10.1007/978-3-031-56069-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56069-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56068-2

  • Online ISBN: 978-3-031-56069-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics