skip to main content
research-article

Attention-guided Adversarial Attack for Video Object Segmentation

Authors Info & Claims
Published:14 November 2023Publication History
Skip Abstract Section

Abstract

Video Object Segmentation (VOS) methods have made many breakthroughs with the help of the continuous development and advancement of deep learning. However, the deep learning model is vulnerable to malicious adversarial attacks, which mislead the model to make wrong decisions by adding adversarial perturbation that humans cannot perceive to the input image. Threats to deep learning models remind us that video object segmentation methods are also vulnerable to attacks, thereby threatening their security. Therefore, we study adversarial attacks on the VOS task to better identify the vulnerabilities of the VOS method, which in turn provides an opportunity to improve its robustness. In this paper, we propose an attention-guided adversarial attack method, which uses spatial attention blocks to capture features with global dependencies to construct correlations between consecutive video frames, and performs multipath aggregation to effectively integrate spatial-temporal perturbation, thereby guiding the deconvolution network to generate adversarial examples with strong attack capability. Specifically, the class loss function is designed to enable the deconvolution network to better activate noise in other regions and suppress the activation related to the object class based on the enhanced feature map of the object class. At the same time, attentional feature loss is designed to enhance the transferability against attack. The experimental results on the DAVIS dataset show that the proposed attention-guided adversarial attack method can significantly reduce the segmentation accuracy of OSVOS, and the J&F mean on DAVIS 2016 can reach 73.6% drop rate. The generated adversarial examples are also highly transferable to other video object segmentation models.

REFERENCES

  1. [1] Caelles S., Maninis K.-K., Pont-Tuset J., Leal-Taixé L., Cremers D., and Gool L. Van. 2017. One-shot video object segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 53205329. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Carion Nicolas, Massa Francisco, Synnaeve Gabriel, Usunier Nicolas, Kirillov Alexander, and Zagoruyko Sergey. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Chen Xi, Li Zuoxin, Yuan Ye, Yu Gang, Shen Jianxin, and Qi Donglian. 2020. State-aware tracker for real-time video object segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 93819390. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Yuhua, Pont-Tuset Jordi, Montes Alberto, and Gool Luc Van. 2018. Blazingly fast video object segmentation with pixel-wise metric learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11891198. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cheng Jingchun, Tsai Yi-Hsuan, Hung Wei-Chih, Wang Shengjin, and Yang Ming-Hsuan. 2018. Fast and accurate online video object segmentation via tracking parts. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 74157424. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Dong Yinpeng, Liao Fangzhou, Pang Tianyu, Su Hang, Zhu Jun, Hu Xiaolin, and Li Jianguo. 2018. Boosting adversarial attacks with momentum. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 91859193. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Feng Xinyang, Song Dongjin, Chen Yuncong, Chen Zhengzhang, Ni Jingchao, and Chen Haifeng. 2021. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia. 55465554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Fu Yang, Wang Xiaoyang, Wei Yunchao, and Huang Thomas. 2019. STA: Spatial-temporal attention for large-scale video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 82878294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Goodfellow I. J., Shlens J., and Szegedy C.. 2014. Explaining and harnessing adversarial examples. Computer Science (2014).Google ScholarGoogle Scholar
  10. [10] Guo Qing, Cheng Ziyi, Juefei-Xu Felix, Ma Lei, Xie Xiaofei, Liu Yang, and Zhao Jianjun. 2021. Learning to adversarially blur visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1083910848.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Guo Qing, Xie Xiaofei, Juefei-Xu Felix, Ma Lei, Li Zhongguo, Xue Wanli, Feng Wei, and Liu Yang. 2020. SPARK: Spatial-aware online incremental attack against visual tracking. In European Conference on Computer Vision. Springer, 202219.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Han Junwei, Yang Le, Zhang Dingwen, Chang Xiaojun, and Liang Xiaodan. 2018. Reinforcement cutting-agent learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 90809089.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Hu P., Liu J., Wang G., Ablavsky V., and Sclaroff S.. 2020. DIPNet: Dynamic identity propagation network for video object segmentation. In IEEE Winter Conference on Applications of Computer Vision (WACV’20).Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hu Y. T., Huang J. B., and Schwing A. G.. 2018. MaskRNN: Instance level video object segmentation. Advances in Neural Information Processing Systems 2017-December (2018), 325334.Google ScholarGoogle Scholar
  16. [16] Huang Peiliang, Han Junwei, Liu Nian, Ren Jun, and Zhang Dingwen. 2021. Scribble-supervised video object segmentation. IEEE/CAA Journal of Automatica Sinica 9, 2 (2021), 339353.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Huang Xuhua, Xu Jiarui, Tai Yu-Wing, and Tang Chi-Keung. 2020. Fast video object segmentation with temporal aggregation network and dynamic template matching. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 88768886. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Inkawhich Nathan, Inkawhich Matthew, Chen Yiran, and Li Hai. 2018. Adversarial attacks for optical flow-based action recognition classifiers. arXiv preprint arXiv:1811.11875 (2018).Google ScholarGoogle Scholar
  19. [19] Ioffe Sergey and Szegedy Christian. 2014. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2014).Google ScholarGoogle Scholar
  20. [20] Jabri Allan, Owens Andrew, and Efros Alexei. 2020. Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems 33 (2020), 1954519560.Google ScholarGoogle Scholar
  21. [21] Jaderberg Max, Simonyan Karen, Zisserman Andrew, and Kavukcuoglu Koray. 2015. Spatial transformer networks. Advances in Neural Information Processing Systems 28 (2015).Google ScholarGoogle Scholar
  22. [22] Jampani Varun, Gadde Raghudeep, and Gehler Peter V.. 2017. Video propagation networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 31543164. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Jia Shuai, Song Yibing, Ma Chao, and Yang Xiaokang. 2021. IoU attack: Towards temporally coherent black-box adversarial attack for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 67096718.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Jiang Linxi, Ma Xingjun, Chen Shaoxiang, Bailey James, and Jiang Yu-Gang. 2019. Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia. 864872.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Johnander Joakim, Danelljan Martin, Brissman Emil, Khan Fahad Shahbaz, and Felsberg Michael. 2019. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 89538962.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Khoreva A., Benenson R., Ilg E., Brox T., and Schiele B.. 2019. Lucid data dreaming for video object segmentation. International Journal of Computer Vision (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Khoreva A., Perazzi F., Benenson R., Schiele B., and Sorkine-Hornung A.. 2017. Learning video object segmentation from static images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle Scholar
  28. [28] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012).Google ScholarGoogle Scholar
  29. [29] Kumar Deepak, Kumar Chetan, Seah Chun Wei, Xia Siyu, and Shao Ming. 2020. Finding Achilles’ heel: Adversarial attack on multi-modal action recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 38293837.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Kurakin A., Goodfellow I., and Bengio S.. 2016. Adversarial examples in the physical world. ArXiv abs/1607.02533 (2016).Google ScholarGoogle Scholar
  31. [31] Li X. and Loy C. C.. 2018. Video object segmentation with joint re-identification and attention-aware mask propagation. (2018).Google ScholarGoogle Scholar
  32. [32] Li Y., Tian D., Mingching-Chang, Bian X., and Lyu S.. 2018. Robust adversarial perturbation on deep proposal-based models. ArXiv abs/1809.05962 (2018).Google ScholarGoogle Scholar
  33. [33] Ling Xiang, Ji Shouling, Zou Jiaxu, Wang Jiannan, Wu Chunming, Li Bo, and Wang Ting. 2019. DEEPSEC: A uniform platform for security analysis of deep learning model. In 2019 IEEE Symposium on Security and Privacy (SP’19). IEEE, 673690.Google ScholarGoogle Scholar
  34. [34] Liu Jinyuan, Shang Jingjie, Liu Risheng, and Fan Xin. 2022. Attention-guided global-local adversarial learning for detail-preserving multi-exposure image fusion. IEEE Transactions on Circuits and Systems for Video Technology (2022).Google ScholarGoogle Scholar
  35. [35] Long Jonathan, Shelhamer Evan, and Darrell Trevor. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 34313440.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Lu Xiankai, Wang Wenguan, Ma Chao, Shen Jianbing, Shao Ling, and Porikli Fatih. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 36233632.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Lu Xiankai, Wang Wenguan, Shen Jianbing, Tai Yu-Wing, Crandall David J., and Hoi Steven C. H.. 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 89608970.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Madry A., Makelov A., Schmidt L., Tsipras D., and Vladu A.. 2017. Towards deep learning models resistant to adversarial attacks. ArXiv abs/1706.06083 (2017).Google ScholarGoogle Scholar
  39. [39] Mnih Volodymyr, Heess Nicolas, Graves Alex, and Kavukcuoglu Koray. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, Ghahramani Z., Welling M., Cortes C., Lawrence N., and Weinberger K. Q. (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/09c6c3783b4a70054da74f2538ed47c6-Paper.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Moosavi-Dezfooli S. M., Fawzi A., Fawzi O., and Frossard P.. 2017. Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Nakka K. K. and Salzmann M.. 2020. Indirect local attacks for context-aware semantic segmentation networks. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Oh Seoung Wug, Lee Joon-Young, Sunkavalli Kalyan, and Kim Seon Joo. 2018. Fast video object segmentation by reference-guided mask propagation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 73767385. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Oh Seoung Wug, Lee Joon-Young, Xu Ning, and Kim Seon Joo. 2019. Video object segmentation using space-time memory networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 92259234. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Park Hyojin, Yoo Jayeon, Jeong Seohyeong, Venkatesh Ganesh, and Kwak Nojun. 2021. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 84058414.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Paszke A., Gross S., Massa F., Lerer A., and Chintala S.. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdfGoogle ScholarGoogle Scholar
  46. [46] Perazzi F., Pont-Tuset J., McWilliams B., Gool L. Van, Gross M., and Sorkine-Hornung A.. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 724732. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Pont-Tuset Jordi, Perazzi Federico, Caelles Sergi, Arbeláez Pablo, Sorkine-Hornung Alexander, and Gool Luc Van. 2017. The 2017 DAVIS challenge on video object segmentation. ArXiv abs/1704.00675 (2017).Google ScholarGoogle Scholar
  48. [48] Robinson Andreas, Lawin Felix Järemo, Danelljan Martin, Khan Fahad Shahbaz, and Felsberg Michael. 2020. Learning fast and robust target models for video object segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 74047413. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Seong H., Hyun J., and Kim E.. 2020. Kernelized memory network for video object segmentation. In ECCV 2020: European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Szegedy C., Zaremba W., Sutskever I., Bruna J., Erhan D, Goodfellow I., and Fergus R.. 2013. Intriguing properties of neural networks. Computer Science (2013).Google ScholarGoogle Scholar
  51. [51] Voigtlaender P., Chai Y., Schroff F., Adam H., Leibe B., and Chen L. C.. 2019. FEELVOS: Fast end-to-end embedding learning for video object segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Vondrick Carl, Shrivastava Abhinav, Fathi Alireza, Guadarrama Sergio, and Murphy Kevin. 2018. Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 391408.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Wang Qiangchang, Wu Tianyi, Zheng He, and Guo Guodong. 2020. Hierarchical pyramid diverse attention networks for face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Wang Qiang, Zhang Li, Bertinetto Luca, Hu Weiming, and Torr Philip H. S.. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13281338.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Wang Wenguan, Lu Xiankai, Shen Jianbing, Crandall David J., and Shao Ling. 2019. Zero-shot video object segmentation via attentive graph neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 92369245.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Wei Xingxing, Liang Siyuan, Chen Ning, and Cao Xiaochun. 2018. Transferable adversarial attacks for image and video object detection. arXiv preprint arXiv:1811.12641 (2018).Google ScholarGoogle Scholar
  57. [57] Wei Xingxing, Zhu Jun, Yuan Sha, and Su Hang. 2019. Sparse adversarial perturbations for videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 89738980.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Wei Zhipeng, Chen Jingjing, Wei Xingxing, Jiang Linxi, Chua Tat-Seng, Zhou Fengfeng, and Jiang Yu-Gang. 2020. Heuristic black-box adversarial attacks on video recognition models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1233812345.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Wei Zhipeng, Chen Jingjing, Wu Zuxuan, and Jiang Yu-Gang. 2022. Boosting the transferability of video adversarial examples via temporal translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 26592667.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Wiyatno Rey and Xu Anqi. 2019. Physical adversarial textures that fool visual object tracking. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 48214830. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Xie C., Wang J., Zhang Z., Zhou Y., and Yuille A.. 2017. Adversarial examples for semantic segmentation and object detection. In 2017 IEEE International Conference on Computer Vision (ICCV’17).Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Xie Saining and Tu Zhuowen. 2015. Holistically-nested edge detection. In 2015 IEEE International Conference on Computer Vision (ICCV’15). 13951403. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Yang Linjie, Wang Yanran, Xiong Xuehan, Yang Jianchao, and Katsaggelos Aggelos K.. 2018. Efficient video object segmentation via network modulation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 64996507. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Yang Zongxin, Wei Yunchao, and Yang Yi. 2020. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision. Springer, 332348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Yoon Jae Shin, Rameau Francois, Kim Junsik, Lee Seokju, Shin Seunghak, and Kweon In So. 2017. Pixel-level matching for video object segmentation using convolutional neural networks. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 21862195. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zeng Xiaohui, Liao Renjie, Gu Li, Xiong Yuwen, Fidler Sanja, and Urtasun Raquel. 2019. DMM-Net: Differentiable mask-matching network for video object segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 39283937. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zhang Dingwen, Han Junwei, Yang Le, and Xu Dong. 2018. SPFTN: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2 (2018), 475489.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Zhang Weichao, Wang Guanjun, Huang Mengxing, Wang Hongyu, and Wen Shaoping. 2021. Generative adversarial networks for abnormal event detection in videos based on self-attention mechanism. IEEE Access 9 (2021), 124847124860.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Zhang Yizhuo, Wu Zhirong, Peng Houwen, and Lin Stephen. 2020. A transductive approach for video object segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 69476956. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhao Y., Zhu H., Liang R., Shen Q., Zhang S., and Chen K.. 2018. Seeing isn’t believing: Practical adversarial attack against object detectors. ArXiv abs/1812.10217 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Attention-guided Adversarial Attack for Video Object Segmentation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 6
      December 2023
      493 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3632517
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 November 2023
      • Online AM: 2 September 2023
      • Accepted: 11 August 2023
      • Revised: 23 January 2023
      • Received: 12 January 2022
      Published in tist Volume 14, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)230
      • Downloads (Last 6 weeks)35

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text