skip to main content
10.1145/3589335.3648333acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Open Access

DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder

Published:13 May 2024Publication History

ABSTRACT

In the field of Natural Language Processing (NLP), sentence pair classification is important in various real-world applications. Bi-encoders are commonly used to address these problems due to their low-latency requirements, and their ability to act as effective retrievers. However, bi-encoders often under-perform compared to cross-encoders by a significant margin. To address this gap, many Knowledge Distillation (KD) techniques have been proposed. Most existing KD methods focus solely on utilizing the prediction scores of cross-encoder models and overlook the fact that cross-encoders and bi-encoders have fundamentally different input structures. In this work, we introduce a novel knowledge distillation approach called DISKCO, which DISentangles the Knowledge learned in Cross-encoder models especially from multi-head cross-attention models and transfers it to bi-encoder models. DISKCO leverages the information encoded in the cross-attention weights of the trained cross-encoder model, and provide it as contextual cues for the student bi-encoder model during training and inference. DISKCO combines the benefits of independent encoding for low-latency applications with the knowledge acquired from cross-encoders, resulting in improved performance. Empirically, we demonstrate the effectiveness of DISKCO on proprietary and on various publicly available datasets. Our experiments show that DISKCO outperforms traditional knowledge distillation methods by upto 2%.

Skip Supplemental Material Section

Supplemental Material

ip7335.mp4

Supplemental video

mp4

85.2 MB

References

  1. Arindam Bhattacharya, Ankit Gandhi, et al. 2023. Beyond hard negatives in product search: Semantic matching using one-class classification (SMOCC). In WSDM 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arindam Bhattacharya, Ankith Ms, et al. 2023. CUPID: Curriculum Learning Based Real-Time Prediction using Distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  3. Claudio Carpineto, Renato De Mori, Giovanni Romano, and Brigitte Bigi. 2001. An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems (TOIS) 19, 1 (2001).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR) 44, 1 (2012).Google ScholarGoogle Scholar
  5. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look at? An Analysis of BERT's Attention. Association for Computational Linguistics, Florence, Italy.Google ScholarGoogle Scholar
  6. Hang Cui, Ji-Rong Wen, et al. 2002. Probabilistic Query Expansion Using Query Logs. In Proceedings of the 11th International Conference on World Wide Web. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Efthimis N Efthimiadis. 1996. Query Expansion. Annual review of information science and technology (ARIST) 31 (1996), 121--87.Google ScholarGoogle Scholar
  8. Luyu Gao, Zhuyun Dai, et al. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  9. Albert Gordo, Filip Radenovic, and Tamara Berg. 2020. Attention-based query expansion learning. In European Conference on Computer Vision. Springer, 172-- 188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Geoffrey Hinton, Oriol Vinyals, et al. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML]Google ScholarGoogle Scholar
  11. Po-Sen Huang, Xiaodong He, et al. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Xiaoqi Jiao, Yichun Yin, et al. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv:1909.10351 [cs.CL]Google ScholarGoogle Scholar
  13. Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ryan Kiros, Yukun Zhu, et al. 2015. Skip-Thought Vectors. In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc.Google ScholarGoogle Scholar
  15. Zheng Lin, Yeyun Gong, et al. 2023. PROD: Progressive Distillation for Dense Retrieval. Proceedings of the ACM Web Conference 2023 (2023).Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Weijie Liu, Peng Zhou, et al. 2020. Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178 (2020).Google ScholarGoogle Scholar
  17. Wenhao Lu, Jian Jiao, et al. 2020. TwinBERT: Distilling Knowledge to Twin- Structured BERT Models for Efficient Retrieval. arXiv:2002.06275 [cs.IR]Google ScholarGoogle Scholar
  18. Yuxiang Lu, Yiding Liu, et al. 2022. ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval. ArXiv abs/2205.09153 (2022).Google ScholarGoogle Scholar
  19. Sourab Mangrulkar, Ankith M S, et al. 2022. Multilingual semantic sourcing using product images for cross-lingual alignment. In The Web Conference 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Priyanka Nigam, Yiwei Song, et al. 2019. Semantic Product Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Priyanka Nigam, Yiwei Song, et al. 2019. Semantic Product Search. arXiv:1907.00937 [cs.IR]Google ScholarGoogle Scholar
  22. Yingqi Qu, Yuchen Ding, et al. 2020. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. ArXiv abs/2010.08191 (2020).Google ScholarGoogle Scholar
  23. Chandan K. Reddy et al. 2022. Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. arXiv preprint arXiv:2206.06588 (2022). arXiv:2206.06588Google ScholarGoogle Scholar
  24. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).Google ScholarGoogle Scholar
  25. Ruiyang Ren, Shangwen Lv, et al. 2021. PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval. ArXiv abs/2108.06027 (2021).Google ScholarGoogle Scholar
  26. Ruiyang Ren, Yingqi Qu, et al. 2021. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. ArXiv abs/2110.07367 (2021).Google ScholarGoogle Scholar
  27. Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. 2016. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608 (2016).Google ScholarGoogle Scholar
  28. Victor Sanh, Lysandre Debut, et al. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL]Google ScholarGoogle Scholar
  29. Keshav Santhanam, O. Khattab, et al. 2021. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In North American Chapter of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  30. Siqi Sun, Yu Cheng, et al. 2019. Patient Knowledge Distillation for BERT Model Compression.Google ScholarGoogle Scholar
  31. Raphael Tang, Yao Lu, et al. 2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019).Google ScholarGoogle Scholar
  32. Ashish Vaswani, Noam Shazeer, et al. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhi Zheng, Kai Hui, et al. 2020. BERT-QE: Contextualized Query Expansion for Document Re-ranking. arXiv:2009.07258 [cs.IR]Google ScholarGoogle Scholar

Index Terms

  1. DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '24: Companion Proceedings of the ACM on Web Conference 2024
          May 2024
          1928 pages
          ISBN:9798400701726
          DOI:10.1145/3589335

          Copyright © 2024 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 May 2024

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%
        • Article Metrics

          • Downloads (Last 12 months)35
          • Downloads (Last 6 weeks)35

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader