ABSTRACT
Convolutional Neural Networks (CNNs) have been the state-of-the-art techniques applied in the field of medical imaging for numerous image processing tasks. Recently, vision transformer networks are emerging as another technique, complementing current CNNs in the medical field providing on-par performance while also having a number of unique characteristics that may be useful for medical image processing. While CNNs have been predominantly applied to artefact detection and classification in endoscopic images, ViT has been sparsely applied in this area. Additionally, both CNN and ViT have been sparingly applied to colour misalignment artefact classification. In this work, we, therefore, explore the application of Vision Transformer (ViT) in the classification of artefacts in endoscopic images of the gastrointestinal tract organs. Furthermore, the performance of ViT is compared to that of CNN in the classification of colour misalignment artefacts. Our customised ViT model, based on DeiT (Data-efficient image Transformers), has obtained an accuracy of 96.33% as compared to the CNN based Inceptionv3 model with an accuracy of 78.67% and InceptionResNetv2 with 76.67%. The results demonstrate that when pretrained on ImageNet, ViT offer better performance than CNNs in colour misalignment artefact classification. This is due to the ability of ViT to better depict the relationship between image pixels through self-attention weights. Moreover, the built-in self-attention mechanism offers fresh insight into the decision-making processes of the model.
- Sharib Ali, Mariia Dmitrieva, Noha Ghatwary, Sophia Bano, Gorkem Polat, Alptekin Temizel, Adrian Krenzer, Amar Hekalo, Yun Bo Guo, Bogdan Matuszewski, Mourad Gridach, Irina Voiculescu, Vishnusai Yoganand, Arnav Chavan, Aryan Raj, Nhan T. Nguyen, Dat Q. Tran, Le Duy Huynh, Nicolas Boutry, Shahadate Rezvy, Haijian Chen, Yoon Ho Choi, Anand Subramanian, Velmurugan Balasubramanian, Xiaohong W. Gao, Hongyu Hu, Yusheng Liao, Danail Stoyanov, Christian Daul, Stefano Realdon, Renato Cannizzaro, Dominique Lamarque, Terry Tran-Nguyen, Adam Bailey, Barbara Braden, James East, and Jens Rittscher. 2021. Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy. Medical Image Analysis 70 (May 2021), 102002. https://doi.org/10.1016/j.media.2021.102002 arXiv:2010.06034 [cs].Google ScholarCross Ref
- Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnières, Victor Loschenov, Enrico Grisan, Walter Blondel, and Jens Rittscher. 2019. Endoscopy artifact detection (EAD 2019) challenge dataset. https://doi.org/10.17632/C7FJBXCGJ9.1 arXiv:1905.03209 [cs, eess].Google ScholarCross Ref
- Mayank Banoula. [n. d.]. What Is Deep Learning? | How It Works, Techniques & Applications.Google Scholar
- datagen.tech. [n. d.]. ResNet-50: The Basics and a Quick Tutorial.Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://openreview.net/forum?id=YicbFdNTTyGoogle Scholar
- Xiaohong Gao, Barbara Braden, Stephen Taylor, and Wei Pang. 2019. Towards Real-Time Detection of Squamous Pre-Cancers from Oesophageal Endoscopic Videos. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). 1606–1612. https://doi.org/10.1109/ICMLA.2019.00264Google ScholarCross Ref
- Xiaohong W. Gao, Stephen Taylor, Wei Pang, Rui Hui, Xin Lu, and Barbara Braden. 2023. Fusion of colour contrasted images for early detection of oesophageal squamous cell dysplasia from endoscopic videos in real time. Information Fusion 92 (April 2023), 64–79. https://doi.org/10.1016/j.inffus.2022.11.023Google ScholarDigital Library
- Behnaz Gheflati and Hassan Rivaz. 2022. Vision Transformer for Classification of Breast Ultrasound Images. https://doi.org/10.48550/arXiv.2110.14731 arXiv:2110.14731 [cs].Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90 ISSN: 1063-6919.Google ScholarCross Ref
- Christos Matsoukas, Johan Haslum, Magnus Soderberg, and Kevin Smith. 2021. Is it Time to Replace CNNs with Transformers for Medical Images?Google Scholar
- Ken Namikawa, Toshiaki Hirasawa, Toshiyuki Yoshio, Junko Fujisaki, Tsuyoshi Ozawa, Soichiro Ishihara, Tomonori Aoki, Atsuo Yamada, Kazuhiko Koike, Hideo Suzuki, and Tomohiro Tada. 2020. Utilizing artificial intelligence in endoscopy: a clinician’s guide. Expert Review of Gastroenterology & Hepatology 14, 8 (Aug. 2020), 689–706. https://doi.org/10.1080/17474124.2020.1779058Google ScholarCross Ref
- Nhan T. Nguyen, Dat Q. Tran, and Dung B. Nguyen. 2020. Detection and Segmentation of Endoscopic Artefacts and Diseases Using Deep Architectures. https://doi.org/10.1101/2020.04.17.20070201 Pages: 2020.04.17.20070201.Google ScholarCross Ref
- Ilkay Oksuz, James R. Clough, James R. Clough, and Julia A. Schnabel. 2019. Artefact detection in video endoscopy using retinanet and focal loss function. CEUR Workshop Proceedings 2366 (2019). http://www.scopus.com/inward/record.url?scp=85066467552&partnerID=8YFLogxKGoogle Scholar
- Shehan Perera, Srikar Adhikari, and Alper Yilmaz. 2021. POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound. https://doi.org/10.48550/arXiv.2105.09913 arXiv:2105.09913 [cs, eess].Google ScholarCross Ref
- Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, and Huazhu Fu. 2022. Transformers in Medical Imaging: A Survey. https://doi.org/10.48550/arXiv.2201.09873 arXiv:2201.09873 [cs, eess].Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://doi.org/10.48550/arXiv.1409.1556 arXiv:1409.1556 [cs].Google ScholarCross Ref
- Hyuna Sung, Jacques Ferlay, Rebecca L. Siegel, Mathieu Laversanne, Isabelle Soerjomataram, Ahmedin Jemal, and Freddie Bray. 2021. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA: a cancer journal for clinicians 71, 3 (May 2021), 209–249. https://doi.org/10.3322/caac.21660Google ScholarCross Ref
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. https://doi.org/10.48550/arXiv.1602.07261 arXiv:1602.07261 [cs].Google ScholarCross Ref
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. https://doi.org/10.48550/arXiv.1512.00567 arXiv:1512.00567 [cs].Google ScholarCross Ref
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. https://doi.org/10.48550/arXiv.2012.12877 arXiv:2012.12877 [cs].Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle Scholar
- VideoProc. [n. d.]. [OFFICIAL]VideoProc Converter - One-Stop Video Processing Software for Windows Mac. https://www.videoproc.com/Google Scholar
- Lianlian Wu, Wei Zhou, Xinyue Wan, Jun Zhang, Lei Shen, Shan Hu, Qianshan Ding, Ganggang Mu, Anning Yin, Xu Huang, Jun Liu, Xiaoda Jiang, Zhengqiang Wang, Yunchao Deng, Mei Liu, Rong Lin, Tingsheng Ling, Peng Li, Qi Wu, Peng Jin, Jie Chen, and Honggang Yu. 2019. A deep neural network improves endoscopic detection of early gastric cancer without blind spots. Endoscopy 51, 6 (June 2019), 522–531. https://doi.org/10.1055/a-0855-3532Google ScholarCross Ref
- Suhui Yang and G. Cheng. 2019. ENDOSCOPIC ARTEFACT DETECTION AND SEGMENTATION WITH DEEP CONVOLUTIONAL NEURAL NETWORK. https://www.semanticscholar.org/paper/ENDOSCOPIC-ARTEFACT-DETECTION-AND-SEGMENTATION-WITH-Yang-Cheng/57c589a70e3dd1b9fcb57ccd7361387ddfc3e8edGoogle Scholar
Index Terms
- Endoscopic Image Classification using Vision Transformers
Recommendations
Local and Global Feature Interaction Network for Endoscope Image Classification
Image and GraphicsAbstractConvolutional Neural Network (CNN) shows great performance in the field of endoscopic image classification in past few years. It can capture local features of endoscopic images, but it fails to exploit global semantic information. Recently ...
Towards Efficient Adversarial Training on Vision Transformers
Computer Vision – ECCV 2022AbstractVision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is ...
End-to-End Large-Scale Image Retrieval Network with Convolution and Vision Transformers
Artificial Neural Networks and Machine Learning – ICANN 2022AbstractThere has been significant progress in content-based image retrieval with the development of convolutional neural networks and visual transformers. However, there are semantic gaps between high-level semantic information and low-level visual ...
Comments