skip to main content
10.1145/3589335.3648327acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Open Access

MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Authors Info & Claims
Published:13 May 2024Publication History

ABSTRACT

Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demands innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MS-MARCO-Web-Search.

Skip Supplemental Material Section

Supplemental Material

ip6166.mp4

Supplemental video

mp4

74.7 MB

References

  1. Virgilio Almeida, Azer Bestavros, Mark Crovella, and Adriana De Oliveira. 1996. Characterizing reference locality in the WWW. In Fourth International Conference on Parallel and Distributed Information Systems. IEEE, 92--103.Google ScholarGoogle ScholarCross RefCross Ref
  2. er et al.(2017)]% aumuller2017ann, Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2017. ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. In International conference on similarity search and applications. Springer, 34--49.Google ScholarGoogle ScholarCross RefCross Ref
  3. Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence, Vol. 37, 6 (2014), 1247--1260.Google ScholarGoogle Scholar
  4. Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2055--2063.Google ScholarGoogle Scholar
  5. Dmitry Baranchuk, Artem Babenko, and Yury Malkov. 2018. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In Proceedings of the European Conference on Computer Vision (ECCV). 202--216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni. 2022. Autoregressive search engines: Generating substrings as document identifiers. Advances in Neural Information Processing Systems , Vol. 35 (2022), 31668--31683.Google ScholarGoogle Scholar
  7. Jamie Callan. 2012. The lemur project and its clueweb12 dataset. In Invited talk at the SIGIR 2012 Workshop on Open-Source Information Retrieval.Google ScholarGoogle Scholar
  8. Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. 2021. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search. Advances in Neural Information Processing Systems , Vol. 34 (2021), 5199--5212.Google ScholarGoogle Scholar
  9. Charles Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte Track. In TREC.Google ScholarGoogle Scholar
  10. Charles LA Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track.. In Trec, Vol. 9. 20--29.Google ScholarGoogle Scholar
  11. Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, and Bodo Billerbeck. 2020. ORCAS: 20 million clicked query-document pairs for analyzing search. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2983--2989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Zhuyun Dai and Jamie Callan. 2019. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687 (2019).Google ScholarGoogle Scholar
  13. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  14. Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2843--2853.Google ScholarGoogle ScholarCross RefCross Ref
  15. Steven Glassman. 1994. A caching relay for the world wide web. Computer Networks and ISDN systems , Vol. 27, 2 (1994), 165--173.Google ScholarGoogle Scholar
  16. Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic models for the first-stage retrieval: A comprehensive review. ACM Transactions on Information Systems (TOIS), Vol. 40, 4 (2022), 1--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In Proceedings of the 37th International Conference on Machine Learning (ICML). 3887--3896.Google ScholarGoogle Scholar
  18. Yelong Shen Jiancheng Lv Nan Duan Weizhu Chen Hang Zhang, Yeyun Gong. 2022. Adversarial Retriever-Ranker model for Dense Retrieval. In ICLR.Google ScholarGoogle Scholar
  19. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. Advances in neural information processing systems , Vol. 27 (2014).Google ScholarGoogle Scholar
  20. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333--2338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  22. Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems , Vol. 32 (2019).Google ScholarGoogle Scholar
  23. Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, Vol. 33, 1 (2010), 117--128.Google ScholarGoogle Scholar
  24. Hervé Jégou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. 2011. Searching in one billion vectors: re-rank with source coding. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 861--864.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data (2019).Google ScholarGoogle ScholarCross RefCross Ref
  26. Yannis Kalantidis and Yannis Avrithis. 2014. Locally optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2321--2328.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Vladimir Karpukhin, Barlas Oug uz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).Google ScholarGoogle Scholar
  28. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics , Vol. 7 (2019), 453--466.Google ScholarGoogle ScholarCross RefCross Ref
  29. Carlos Lassance and Stéphane Clinchant. 2023. Naver Labs Europe (SPLADE)@ TREC Deep Learning 2022. arXiv preprint arXiv:2302.12574 (2023).Google ScholarGoogle Scholar
  30. Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. 2021. Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 1000--1008.Google ScholarGoogle ScholarCross RefCross Ref
  31. Wenhao Lu, Jian Jiao, and Ruofei Zhang. 2020. Twinbert: Distilling knowledge to twin-structured compressed BERT models for large-scale retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2645--2652.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. microsoft. [n.,d.] a. Bing search. https://www.bing.com/.Google ScholarGoogle Scholar
  33. microsoft. [n.,d.] b. New Bing. https://www.bing.com/new.Google ScholarGoogle Scholar
  34. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).Google ScholarGoogle Scholar
  35. Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. 2022. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022).Google ScholarGoogle Scholar
  36. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.Google ScholarGoogle Scholar
  37. Arnold Overwijk, Chenyan Xiong, and Jamie Callan. 2022. ClueWeb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3360--3362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, 4 (2016), 694--707.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding the Behaviors of BERT in Ranking. arXiv preprint arXiv:1904.07531 (2019).Google ScholarGoogle Scholar
  40. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).Google ScholarGoogle Scholar
  41. Jie Ren, Minjia Zhang, and Dong Li. 2020. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. In In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vol. 33.Google ScholarGoogle Scholar
  42. Stephen E Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR'94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University. Springer, 232--241.Google ScholarGoogle ScholarCross RefCross Ref
  43. Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. 2018. Cross-lingual learning-to-rank with shared representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 458--463.Google ScholarGoogle ScholarCross RefCross Ref
  44. Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4430--4441.Google ScholarGoogle ScholarCross RefCross Ref
  45. Xuan Shan, Chuanjie Liu, Yiqian Xia, Qi Chen, Yusi Zhang, Kaize Ding, Yaobo Liang, Angen Luo, and Yuxiang Luo. 2021. GLOW: Global Weighted Self-Attention Network for Web Search. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 519--528.Google ScholarGoogle Scholar
  46. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd international conference on world wide web. 373--374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ian Soboroff. 2021. Overview of TREC 2021. In 30th Text REtrieval Conference. Gaithersburg, Maryland.Google ScholarGoogle Scholar
  48. Suhas Jayaram Subramanya, Rohan Kadekodi, Ravishankar Krishaswamy, and Harsha Vardhan Simhadri. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13766--13776.Google ScholarGoogle Scholar
  49. Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems , Vol. 35 (2022), 21831--21843.Google ScholarGoogle Scholar
  50. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).Google ScholarGoogle Scholar
  51. Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al. 2022. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems , Vol. 35 (2022), 25600--25614.Google ScholarGoogle Scholar
  52. Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, et al. 2022. Distill-vq: Learning retrieval oriented vector quantization by distilling knowledge from dense embeddings. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1513--1523.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020).Google ScholarGoogle Scholar
  54. Linlong Xu, Baosong Yang, Xiaoyu Lv, Tianchi Bi, Dayiheng Liu, and Haibo Zhang. 2021. Leveraging Advantages of Interactive and Non-Interactive Models for Vector-Based Cross-Lingual Information Retrieval. arXiv preprint arXiv:2111.01992 (2021).Google ScholarGoogle Scholar
  55. Jingtao Zhan, Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2022. Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2486--2496.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, and Nan Duan. 2022. SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 548--559.Google ScholarGoogle Scholar
  57. Shengyao Zhuang, Hang Li, and G. Zuccon. 2021. Deep Query Likelihood Model for Information Retrieval. In ECIR. ioGoogle ScholarGoogle Scholar

Index Terms

  1. MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '24: Companion Proceedings of the ACM on Web Conference 2024
      May 2024
      1928 pages
      ISBN:9798400701726
      DOI:10.1145/3589335

      Copyright © 2024 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 May 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%
    • Article Metrics

      • Downloads (Last 12 months)229
      • Downloads (Last 6 weeks)229

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader