ABSTRACT
In the field of Natural Language Processing (NLP), sentence pair classification is important in various real-world applications. Bi-encoders are commonly used to address these problems due to their low-latency requirements, and their ability to act as effective retrievers. However, bi-encoders often under-perform compared to cross-encoders by a significant margin. To address this gap, many Knowledge Distillation (KD) techniques have been proposed. Most existing KD methods focus solely on utilizing the prediction scores of cross-encoder models and overlook the fact that cross-encoders and bi-encoders have fundamentally different input structures. In this work, we introduce a novel knowledge distillation approach called DISKCO, which DISentangles the Knowledge learned in Cross-encoder models especially from multi-head cross-attention models and transfers it to bi-encoder models. DISKCO leverages the information encoded in the cross-attention weights of the trained cross-encoder model, and provide it as contextual cues for the student bi-encoder model during training and inference. DISKCO combines the benefits of independent encoding for low-latency applications with the knowledge acquired from cross-encoders, resulting in improved performance. Empirically, we demonstrate the effectiveness of DISKCO on proprietary and on various publicly available datasets. Our experiments show that DISKCO outperforms traditional knowledge distillation methods by upto 2%.
Supplemental Material
- Arindam Bhattacharya, Ankit Gandhi, et al. 2023. Beyond hard negatives in product search: Semantic matching using one-class classification (SMOCC). In WSDM 2023.Google ScholarDigital Library
- Arindam Bhattacharya, Ankith Ms, et al. 2023. CUPID: Curriculum Learning Based Real-Time Prediction using Distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). Association for Computational Linguistics.Google ScholarCross Ref
- Claudio Carpineto, Renato De Mori, Giovanni Romano, and Brigitte Bigi. 2001. An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems (TOIS) 19, 1 (2001).Google ScholarDigital Library
- Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR) 44, 1 (2012).Google Scholar
- Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look at? An Analysis of BERT's Attention. Association for Computational Linguistics, Florence, Italy.Google Scholar
- Hang Cui, Ji-Rong Wen, et al. 2002. Probabilistic Query Expansion Using Query Logs. In Proceedings of the 11th International Conference on World Wide Web. Association for Computing Machinery.Google ScholarDigital Library
- Efthimis N Efthimiadis. 1996. Query Expansion. Annual review of information science and technology (ARIST) 31 (1996), 121--87.Google Scholar
- Luyu Gao, Zhuyun Dai, et al. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.Google ScholarCross Ref
- Albert Gordo, Filip Radenovic, and Tamara Berg. 2020. Attention-based query expansion learning. In European Conference on Computer Vision. Springer, 172-- 188.Google ScholarDigital Library
- Geoffrey Hinton, Oriol Vinyals, et al. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML]Google Scholar
- Po-Sen Huang, Xiaodong He, et al. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. Association for Computing Machinery.Google ScholarDigital Library
- Xiaoqi Jiao, Yichun Yin, et al. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv:1909.10351 [cs.CL]Google Scholar
- Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery.Google ScholarDigital Library
- Ryan Kiros, Yukun Zhu, et al. 2015. Skip-Thought Vectors. In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc.Google Scholar
- Zheng Lin, Yeyun Gong, et al. 2023. PROD: Progressive Distillation for Dense Retrieval. Proceedings of the ACM Web Conference 2023 (2023).Google ScholarDigital Library
- Weijie Liu, Peng Zhou, et al. 2020. Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178 (2020).Google Scholar
- Wenhao Lu, Jian Jiao, et al. 2020. TwinBERT: Distilling Knowledge to Twin- Structured BERT Models for Efficient Retrieval. arXiv:2002.06275 [cs.IR]Google Scholar
- Yuxiang Lu, Yiding Liu, et al. 2022. ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval. ArXiv abs/2205.09153 (2022).Google Scholar
- Sourab Mangrulkar, Ankith M S, et al. 2022. Multilingual semantic sourcing using product images for cross-lingual alignment. In The Web Conference 2022.Google ScholarDigital Library
- Priyanka Nigam, Yiwei Song, et al. 2019. Semantic Product Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery.Google ScholarDigital Library
- Priyanka Nigam, Yiwei Song, et al. 2019. Semantic Product Search. arXiv:1907.00937 [cs.IR]Google Scholar
- Yingqi Qu, Yuchen Ding, et al. 2020. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. ArXiv abs/2010.08191 (2020).Google Scholar
- Chandan K. Reddy et al. 2022. Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. arXiv preprint arXiv:2206.06588 (2022). arXiv:2206.06588Google Scholar
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).Google Scholar
- Ruiyang Ren, Shangwen Lv, et al. 2021. PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval. ArXiv abs/2108.06027 (2021).Google Scholar
- Ruiyang Ren, Yingqi Qu, et al. 2021. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. ArXiv abs/2110.07367 (2021).Google Scholar
- Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. 2016. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608 (2016).Google Scholar
- Victor Sanh, Lysandre Debut, et al. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL]Google Scholar
- Keshav Santhanam, O. Khattab, et al. 2021. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In North American Chapter of the Association for Computational Linguistics.Google Scholar
- Siqi Sun, Yu Cheng, et al. 2019. Patient Knowledge Distillation for BERT Model Compression.Google Scholar
- Raphael Tang, Yao Lu, et al. 2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019).Google Scholar
- Ashish Vaswani, Noam Shazeer, et al. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA.Google ScholarDigital Library
- Zhi Zheng, Kai Hui, et al. 2020. BERT-QE: Contextualized Query Expansion for Document Re-ranking. arXiv:2009.07258 [cs.IR]Google Scholar
Index Terms
- DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder
Recommendations
Multi-view knowledge distillation for efficient semantic segmentation
AbstractCurrent state-of-the-art semantic segmentation models achieve remarkable success in segmentation accuracy. However, the huge model size and computing cost restrict their applications on low-latency online systems or devices. Knowledge distillation ...
From Within to Between: Knowledge Distillation for Cross Modality Retrieval
Computer Vision – ACCV 2022AbstractWe propose a novel loss function for training text-to-video and video-to-text retrieval networks based on knowledge distillation. This loss function addresses an important drawback of the max-margin loss function often used in existing cross-...
Multi-view Clustering via Multiple Auto-Encoder
Web and Big DataAbstractMulti-view clustering (MVC), which aims to explore the underlying structure of data by leveraging heterogeneous information of different views, has brought along a growth of attention. Multi-view clustering algorithms based on different theories ...
Comments