A Two-Stage Chinese Medical Video Retrieval Framework with LLM

Lei, Ningjie; Cai, Jinxiang; Qian, Yixin; Zheng, Zhilong; Han, Chao; Liu, Zhiyue; Huang, Qingbao

doi:10.1007/978-3-031-44699-3_19

Ningjie Lei¹¹,
Jinxiang Cai¹¹,
Yixin Qian¹¹,
Zhilong Zheng¹¹,
Chao Han^11,13,
Zhiyue Liu^12,13 &
…
Qingbao Huang^11,13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14304))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

619 Accesses

Abstract

With the increasing popularity of online videos, research on video corpus retrieval (VCR) has made significant progress. However, existing VCR models have not performed well in the medical field due to the unique characteristics of medical VCR task. Specifically, the open-ended queries used in medical VCR are more challenging compared to image-caption style queries, and the long duration of medical videos poses a great burden on model retrieval efficiency. To address these challenges, we propose a two-stage framework based on GPT-3.5 and cross-modal contrastive global-span (CCGS) for medical video VCR (termed GPT-CMR). In the first stage, we leverage the powerful natural language processing capabilities of the large language model (LLM) GPT-3.5 to improve retrieval efficiency. In the second stage, we use CCGS model to further enhance retrieval accuracy. Additionally, we developed a CCGS-VCR Analyzer to leverage the characteristics of the CCGS model’s output without additional training costs. According to the official result, our method achieve first place in Track 2 of the NLPCC 2023 Task 5 competition. Experiments show that our method has retrieval efficiency and accuracy far exceeding the official baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2011)
Google Scholar
Chen, Y.-C., et al.: UNITER: universal image-text representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Gao, W., et al.: Deep retrieval: learning a retrievable structure for large-scale recommendations. arXiv preprint arXiv:2007.07203 (2020)
Gupta, D., Attal, K., Demner-Fushman, D.: A dataset for medical instructional video classification and question answering. Sci. Data 10(1), 158 (2023)
Article Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970. IEEE (2015)
Google Scholar
Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. Technologies 9(1), 2 (2020)
Article Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol. 1, p. 2 (2019)
Google Scholar
Kirkpatrick, S., Gelatt, C.D., Jr., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Article MathSciNet MATH Google Scholar
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880 (2020)
Google Scholar
Li, B., Weng, Y., Sun, B., Li, S.: Towards visual-prompt temporal answering grounding in medical instructional video. arXiv preprint arXiv:2203.06667 (2022)
Li, B., Weng, Y., Sun, B., Li, S.: Learning to locate visual answer in video corpus using question. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lu, W., Jiao, J., Zhang, R.: Twinbert: distilling knowledge to twin-structured compressed bert models for large-scale retrieval. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2645–2652 (2020)
Google Scholar
Luo, H., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. Stat. 1050, 4 (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop Datacentric AI. No. FZJ-2022-00923, Jülich Supercomputing Center (2021)
Google Scholar
Weng, Y., Li, B.: Visual answer localization with cross-modal mutual knowledge transfer. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
Google Scholar

Download references

Acknowledgments

This work was supported by the Guangxi Natural Science Foundation (No. 2022GXNSFAA035627), Guangxi Natural Science Foundation Key Project (Application No. 2023JJD170015), National Natural Science Foundation of China (62276072), Guangxi Scientific and Technological Bases and Talents Special Projects (guikeAD23026213 and guikeAD23026230), Innovation Project of Guangxi Graduate Education, and the Open Research Fund of Guangxi Key Laboratory of Multimedia Communications and Network Technology.

Author information

Authors and Affiliations

School of Electrical Engineering, Guangxi University, Nanning, Guangxi, China
Ningjie Lei, Jinxiang Cai, Yixin Qian, Zhilong Zheng, Chao Han & Qingbao Huang
School of Computer, Electronics and Information, Guangxi University, Nanning, Guangxi, China
Zhiyue Liu
Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, China
Chao Han, Zhiyue Liu & Qingbao Huang

Authors

Ningjie Lei
View author publications
You can also search for this author in PubMed Google Scholar
Jinxiang Cai
View author publications
You can also search for this author in PubMed Google Scholar
Yixin Qian
View author publications
You can also search for this author in PubMed Google Scholar
Zhilong Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Chao Han
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qingbao Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingbao Huang .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lei, N. et al. (2023). A Two-Stage Chinese Medical Video Retrieval Framework with LLM. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14304. Springer, Cham. https://doi.org/10.1007/978-3-031-44699-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-44699-3_19
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44698-6
Online ISBN: 978-3-031-44699-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

A Two-Stage Chinese Medical Video Retrieval Framework with LLM