research-article

Open Access

DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder

Authors:
Ankith MS

Amazon, Bangalore, India

Amazon, Bangalore, India

0000-0002-3776-0607
View Profile

,
Arindam Bhattacharya

Amazon, Bangalore, India

Amazon, Bangalore, India

0000-0003-3492-7763
View Profile

,
Ankit Gandhi

Amazon, Bangalore, India

Amazon, Bangalore, India

0000-0002-8286-2792
View Profile

,
Vijay Huddar

Amazon, Bangalore, India

Amazon, Bangalore, India

0000-0002-7844-4379
View Profile

,
Atul Saroop

Amazon, Bangalore, India

Amazon, Bangalore, India

0000-0003-0321-5069
View Profile

,
Rahul Bhagat

Amazon, Bangalore, India

Amazon, Bangalore, India

0000-0002-7187-4544
View Profile

WWW '24: Companion Proceedings of the ACM on Web Conference 2024May 2024Pages 346–354https://doi.org/10.1145/3589335.3648333

Published:13 May 2024Publication History

WWW '24: Companion Proceedings of the ACM on Web Conference 2024

Pages 346–354

ABSTRACT

In the field of Natural Language Processing (NLP), sentence pair classification is important in various real-world applications. Bi-encoders are commonly used to address these problems due to their low-latency requirements, and their ability to act as effective retrievers. However, bi-encoders often under-perform compared to cross-encoders by a significant margin. To address this gap, many Knowledge Distillation (KD) techniques have been proposed. Most existing KD methods focus solely on utilizing the prediction scores of cross-encoder models and overlook the fact that cross-encoders and bi-encoders have fundamentally different input structures. In this work, we introduce a novel knowledge distillation approach called DISKCO, which DISentangles the Knowledge learned in Cross-encoder models especially from multi-head cross-attention models and transfers it to bi-encoder models. DISKCO leverages the information encoded in the cross-attention weights of the trained cross-encoder model, and provide it as contextual cues for the student bi-encoder model during training and inference. DISKCO combines the benefits of independent encoding for low-latency applications with the knowledge acquired from cross-encoders, resulting in improved performance. Empirically, we demonstrate the effectiveness of DISKCO on proprietary and on various publicly available datasets. Our experiments show that DISKCO outperforms traditional knowledge distillation methods by upto 2%.

Supplemental Material

ip7335.mp4

Supplemental video

mp4

85.2 MB

Download

References

Arindam Bhattacharya, Ankit Gandhi, et al. 2023. Beyond hard negatives in product search: Semantic matching using one-class classification (SMOCC). In WSDM 2023.Google ScholarDigital Library
Arindam Bhattacharya, Ankith Ms, et al. 2023. CUPID: Curriculum Learning Based Real-Time Prediction using Distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). Association for Computational Linguistics.Google ScholarCross Ref
Claudio Carpineto, Renato De Mori, Giovanni Romano, and Brigitte Bigi. 2001. An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems (TOIS) 19, 1 (2001).Google ScholarDigital Library
Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR) 44, 1 (2012).Google Scholar
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look at? An Analysis of BERT's Attention. Association for Computational Linguistics, Florence, Italy.Google Scholar
Hang Cui, Ji-Rong Wen, et al. 2002. Probabilistic Query Expansion Using Query Logs. In Proceedings of the 11th International Conference on World Wide Web. Association for Computing Machinery.Google ScholarDigital Library
Efthimis N Efthimiadis. 1996. Query Expansion. Annual review of information science and technology (ARIST) 31 (1996), 121--87.Google Scholar
Luyu Gao, Zhuyun Dai, et al. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.Google ScholarCross Ref
Albert Gordo, Filip Radenovic, and Tamara Berg. 2020. Attention-based query expansion learning. In European Conference on Computer Vision. Springer, 172-- 188.Google ScholarDigital Library
Geoffrey Hinton, Oriol Vinyals, et al. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML]Google Scholar
Po-Sen Huang, Xiaodong He, et al. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. Association for Computing Machinery.Google ScholarDigital Library
Xiaoqi Jiao, Yichun Yin, et al. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv:1909.10351 [cs.CL]Google Scholar
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery.Google ScholarDigital Library
Ryan Kiros, Yukun Zhu, et al. 2015. Skip-Thought Vectors. In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc.Google Scholar
Zheng Lin, Yeyun Gong, et al. 2023. PROD: Progressive Distillation for Dense Retrieval. Proceedings of the ACM Web Conference 2023 (2023).Google ScholarDigital Library
Weijie Liu, Peng Zhou, et al. 2020. Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178 (2020).Google Scholar
Wenhao Lu, Jian Jiao, et al. 2020. TwinBERT: Distilling Knowledge to Twin- Structured BERT Models for Efficient Retrieval. arXiv:2002.06275 [cs.IR]Google Scholar
Yuxiang Lu, Yiding Liu, et al. 2022. ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval. ArXiv abs/2205.09153 (2022).Google Scholar
Sourab Mangrulkar, Ankith M S, et al. 2022. Multilingual semantic sourcing using product images for cross-lingual alignment. In The Web Conference 2022.Google ScholarDigital Library
Priyanka Nigam, Yiwei Song, et al. 2019. Semantic Product Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery.Google ScholarDigital Library
Priyanka Nigam, Yiwei Song, et al. 2019. Semantic Product Search. arXiv:1907.00937 [cs.IR]Google Scholar
Yingqi Qu, Yuchen Ding, et al. 2020. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. ArXiv abs/2010.08191 (2020).Google Scholar
Chandan K. Reddy et al. 2022. Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. arXiv preprint arXiv:2206.06588 (2022). arXiv:2206.06588Google Scholar
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).Google Scholar
Ruiyang Ren, Shangwen Lv, et al. 2021. PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval. ArXiv abs/2108.06027 (2021).Google Scholar
Ruiyang Ren, Yingqi Qu, et al. 2021. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. ArXiv abs/2110.07367 (2021).Google Scholar
Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, and Utpal Garain. 2016. Using word embeddings for automatic query expansion. arXiv preprint arXiv:1606.07608 (2016).Google Scholar
Victor Sanh, Lysandre Debut, et al. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL]Google Scholar
Keshav Santhanam, O. Khattab, et al. 2021. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In North American Chapter of the Association for Computational Linguistics.Google Scholar
Siqi Sun, Yu Cheng, et al. 2019. Patient Knowledge Distillation for BERT Model Compression.Google Scholar
Raphael Tang, Yao Lu, et al. 2019. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136 (2019).Google Scholar
Ashish Vaswani, Noam Shazeer, et al. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA.Google ScholarDigital Library
Zhi Zheng, Kai Hui, et al. 2020. BERT-QE: Contextualized Query Expansion for Document Re-ranking. arXiv:2009.07258 [cs.IR]Google Scholar

Index Terms

DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Multi-view knowledge distillation for efficient semantic segmentation
Abstract
Current state-of-the-art semantic segmentation models achieve remarkable success in segmentation accuracy. However, the huge model size and computing cost restrict their applications on low-latency online systems or devices. Knowledge distillation ...
Read More
From Within to Between: Knowledge Distillation for Cross Modality Retrieval
Computer Vision – ACCV 2022
Abstract
We propose a novel loss function for training text-to-video and video-to-text retrieval networks based on knowledge distillation. This loss function addresses an important drawback of the max-margin loss function often used in existing cross-...
Read More
Multi-view Clustering via Multiple Auto-Encoder
Web and Big Data
Abstract
Multi-view clustering (MVC), which aims to explore the underlying structure of data by leveraging heterogeneous information of different views, has brought along a growth of attention. Multi-view clustering algorithms based on different theories ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '24: Companion Proceedings of the ACM on Web Conference 2024
May 2024
1928 pages
ISBN:9798400701726
DOI:10.1145/3589335
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Proceedings Chair:
Roy Ka-Wei Lee
Singapore University of Technology and Design
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University
Copyright © 2024 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2024
Check for updates
Author Tags
information retrieval
knowledge distillation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 35
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)35
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder

WWW '24: Companion Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Multi-view knowledge distillation for efficient semantic segmentation

From Within to Between: Knowledge Distillation for Cross Modality Retrieval

Multi-view Clustering via Multiple Auto-Encoder

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DISKCO : Disentangling Knowledge from Cross-Encoder to Bi-Encoder

WWW '24: Companion Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Multi-view knowledge distillation for efficient semantic segmentation

From Within to Between: Knowledge Distillation for Cross Modality Retrieval

Multi-view Clustering via Multiple Auto-Encoder

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media