Inflated 3D Convolution-Transformer for Weakly-Supervised Carotid Stenosis Grading with Ultrasound Videos

Zhou, Xinrui; Huang, Yuhao; Xue, Wufeng; Yang, Xin; Zou, Yuxin; Ying, Qilong; Zhang, Yuanji; Liu, Jia; Ren, Jie; Ni, Dong

doi:10.1007/978-3-031-43895-0_48

Xinrui Zhou^14,15,16,
Yuhao Huang^14,15,16,
Wufeng Xue^14,15,16,
Xin Yang^14,15,16,
Yuxin Zou^14,15,16,
Qilong Ying^14,15,16,
Yuanji Zhang^14,15,16,17,
Jia Liu¹⁸,
Jie Ren¹⁸ &
…
Dong Ni^14,15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14221))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

3311 Accesses

Abstract

Localization of the narrowest position of the vessel and corresponding vessel and remnant vessel delineation in carotid ultrasound (US) are essential for carotid stenosis grading (CSG) in clinical practice. However, the pipeline is time-consuming and tough due to the ambiguous boundaries of plaque and temporal variation. To automatize this procedure, a large number of manual delineations are usually required, which is not only laborious but also not reliable given the annotation difficulty. In this study, we present the first video classification framework for automatic CSG. Our contribution is three-fold. First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG. Second, to ease the model training, we adopt an inflation strategy for the network, where pre-trained 2D convolution weights can be adapted into the 3D counterpart in our network for an effective warm start. Third, to enhance the feature discrimination of the video, we propose a novel attention-guided multi-dimension fusion (AMDF) transformer encoder to model and integrate global dependencies within and across spatial and temporal dimensions, where two lightweight cross-dimensional attention mechanisms are designed. Our approach is extensively validated on a large clinically collected carotid US video dataset, demonstrating state-of-the-art performance compared with strong competitors.

X. Zhou and Y. Huang—Contribute equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference On Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, C.F.R., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)
Google Scholar
Curto, D., et al.: Dyadformer: a multi-modal transformer for long-range modeling of dyadic interactions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2177–2188 (2021)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Howard, D.P., Gaziano, L., Rothwell, P.M.: Risk of stroke in relation to degree of asymptomatic carotid stenosis: a population-based cohort study, systematic review, and meta-analysis. Lancet Neurol. 20(3), 193–202 (2021)
Article Google Scholar
Li, K., et al.: UniFormer: unified transformer for efficient spatial-temporal representation learning. In: International Conference on Learning Representations (2022)
Google Scholar
Liang, J., et al.: Sketch guided and progressive growing GAN for realistic and editable ultrasound image synthesis. Med. Image Anal. 79, 102461 (2022)
Article Google Scholar
Liu, J., et al.: Deep learning based on carotid transverse B-mode scan videos for the diagnosis of carotid plaque: a prospective multicenter study. Eur. Radiol. 32, 1–10 (2022)
Google Scholar
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Peng, Z., et al.: Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 367–376 (2021)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Xu, Y., Zhang, Q., Zhang, J., Tao, D.: ViTAE: vision transformer advanced by exploring intrinsic inductive bias. Adv. Neural. Inf. Process. Syst. 34, 28522–28535 (2021)
Google Scholar
Yan, S., et al.: Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343 (2022)
Google Scholar
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 591–600 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the grant from National Natural Science Foundation of China (Nos. 62171290, 62101343), Shenzhen-Hong Kong Joint Research Program (No. SGDX20201103095613036), Shenzhen Science and Technology Innovations Committee (No. 20200812143441001), Shenzhen College Stable Support Plan (Nos. 20220810145705001, 20200812162245001), and National Natural Science Foundation of China (No 81971632).

Author information

Authors and Affiliations

National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, China
Xinrui Zhou, Yuhao Huang, Wufeng Xue, Xin Yang, Yuxin Zou, Qilong Ying, Yuanji Zhang & Dong Ni
Medical Ultrasound Image Computing (MUSIC) Lab, Shenzhen University, Shenzhen, China
Xinrui Zhou, Yuhao Huang, Wufeng Xue, Xin Yang, Yuxin Zou, Qilong Ying, Yuanji Zhang & Dong Ni
Marshall Laboratory of Biomedical Engineering, Shenzhen University, Shenzhen, China
Xinrui Zhou, Yuhao Huang, Wufeng Xue, Xin Yang, Yuxin Zou, Qilong Ying, Yuanji Zhang & Dong Ni
Shenzhen RayShape Medical Technology Co. Ltd., Shenzhen, China
Yuanji Zhang
The Third Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Jia Liu & Jie Ren

Authors

Xinrui Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yuhao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wufeng Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Zou
View author publications
You can also search for this author in PubMed Google Scholar
Qilong Ying
View author publications
You can also search for this author in PubMed Google Scholar
Yuanji Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jia Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Ren
View author publications
You can also search for this author in PubMed Google Scholar
Dong Ni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Ni .

Editor information

Editors and Affiliations

Icahn School of Medicine, Mount Sinai, NYC, NY, USA, Tel Aviv University, Tel Aviv, Israel
Hayit Greenspan
Emory University, Atlanta, GA, USA
Anant Madabhushi
Queen’s University, Kingston, ON, Canada
Parvin Mousavi
The University of British Columbia, Vancouver, BC, Canada
Septimiu Salcudean
Yale University, New Haven, CT, USA
James Duncan
IBM Research, San Jose, CA, USA
Tanveer Syeda-Mahmood
Johns Hopkins University, Baltimore, MD, USA
Russell Taylor

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 668 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X. et al. (2023). Inflated 3D Convolution-Transformer for Weakly-Supervised Carotid Stenosis Grading with Ultrasound Videos. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14221. Springer, Cham. https://doi.org/10.1007/978-3-031-43895-0_48

Download citation

DOI: https://doi.org/10.1007/978-3-031-43895-0_48
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43894-3
Online ISBN: 978-3-031-43895-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Inflated 3D Convolution-Transformer for Weakly-Supervised Carotid Stenosis Grading with Ultrasound Videos