Abstract
Video Object Segmentation (VOS) methods have made many breakthroughs with the help of the continuous development and advancement of deep learning. However, the deep learning model is vulnerable to malicious adversarial attacks, which mislead the model to make wrong decisions by adding adversarial perturbation that humans cannot perceive to the input image. Threats to deep learning models remind us that video object segmentation methods are also vulnerable to attacks, thereby threatening their security. Therefore, we study adversarial attacks on the VOS task to better identify the vulnerabilities of the VOS method, which in turn provides an opportunity to improve its robustness. In this paper, we propose an attention-guided adversarial attack method, which uses spatial attention blocks to capture features with global dependencies to construct correlations between consecutive video frames, and performs multipath aggregation to effectively integrate spatial-temporal perturbation, thereby guiding the deconvolution network to generate adversarial examples with strong attack capability. Specifically, the class loss function is designed to enable the deconvolution network to better activate noise in other regions and suppress the activation related to the object class based on the enhanced feature map of the object class. At the same time, attentional feature loss is designed to enhance the transferability against attack. The experimental results on the DAVIS dataset show that the proposed attention-guided adversarial attack method can significantly reduce the segmentation accuracy of OSVOS, and the J&F mean on DAVIS 2016 can reach 73.6% drop rate. The generated adversarial examples are also highly transferable to other video object segmentation models.
- [1] . 2017. One-shot video object segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5320–5329.
DOI: Google ScholarCross Ref - [2] . 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213–229.Google ScholarDigital Library
- [3] . 2020. State-aware tracker for real-time video object segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 9381–9390.
DOI: Google ScholarCross Ref - [4] . 2018. Blazingly fast video object segmentation with pixel-wise metric learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1189–1198.
DOI: Google ScholarCross Ref - [5] . 2018. Fast and accurate online video object segmentation via tracking parts. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7415–7424.
DOI: Google ScholarCross Ref - [6] . 2018. Boosting adversarial attacks with momentum. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9185–9193.
DOI: Google ScholarCross Ref - [7] . 2021. Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection. In Proceedings of the 29th ACM International Conference on Multimedia. 5546–5554.Google ScholarDigital Library
- [8] . 2019. STA: Spatial-temporal attention for large-scale video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8287–8294.Google ScholarDigital Library
- [9] . 2014. Explaining and harnessing adversarial examples. Computer Science (2014).Google Scholar
- [10] . 2021. Learning to adversarially blur visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10839–10848.Google ScholarCross Ref
- [11] . 2020. SPARK: Spatial-aware online incremental attack against visual tracking. In European Conference on Computer Vision. Springer, 202–219.Google ScholarDigital Library
- [12] . 2018. Reinforcement cutting-agent learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9080–9089.Google ScholarCross Ref
- [13] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarCross Ref
- [14] . 2020. DIPNet: Dynamic identity propagation network for video object segmentation. In IEEE Winter Conference on Applications of Computer Vision (WACV’20).Google ScholarCross Ref
- [15] . 2018. MaskRNN: Instance level video object segmentation. Advances in Neural Information Processing Systems 2017-December (2018), 325–334.Google Scholar
- [16] . 2021. Scribble-supervised video object segmentation. IEEE/CAA Journal of Automatica Sinica 9, 2 (2021), 339–353.Google ScholarCross Ref
- [17] . 2020. Fast video object segmentation with temporal aggregation network and dynamic template matching. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 8876–8886.
DOI: Google ScholarCross Ref - [18] . 2018. Adversarial attacks for optical flow-based action recognition classifiers. arXiv preprint arXiv:1811.11875 (2018).Google Scholar
- [19] . 2014. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2014).Google Scholar
- [20] . 2020. Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems 33 (2020), 19545–19560.Google Scholar
- [21] . 2015. Spatial transformer networks. Advances in Neural Information Processing Systems 28 (2015).Google Scholar
- [22] . 2017. Video propagation networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3154–3164.
DOI: Google ScholarCross Ref - [23] . 2021. IoU attack: Towards temporally coherent black-box adversarial attack for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6709–6718.Google ScholarCross Ref
- [24] . 2019. Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia. 864–872.Google ScholarDigital Library
- [25] . 2019. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8953–8962.Google ScholarCross Ref
- [26] . 2019. Lucid data dreaming for video object segmentation. International Journal of Computer Vision (2019).Google ScholarDigital Library
- [27] . 2017. Learning video object segmentation from static images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google Scholar
- [28] . 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012).Google Scholar
- [29] . 2020. Finding Achilles’ heel: Adversarial attack on multi-modal action recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 3829–3837.Google ScholarDigital Library
- [30] . 2016. Adversarial examples in the physical world. ArXiv abs/1607.02533 (2016).Google Scholar
- [31] . 2018. Video object segmentation with joint re-identification and attention-aware mask propagation. (2018).Google Scholar
- [32] . 2018. Robust adversarial perturbation on deep proposal-based models. ArXiv abs/1809.05962 (2018).Google Scholar
- [33] . 2019. DEEPSEC: A uniform platform for security analysis of deep learning model. In 2019 IEEE Symposium on Security and Privacy (SP’19). IEEE, 673–690.Google Scholar
- [34] . 2022. Attention-guided global-local adversarial learning for detail-preserving multi-exposure image fusion. IEEE Transactions on Circuits and Systems for Video Technology (2022).Google Scholar
- [35] . 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.Google ScholarCross Ref
- [36] . 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3623–3632.Google ScholarCross Ref
- [37] . 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8960–8970.Google ScholarCross Ref
- [38] . 2017. Towards deep learning models resistant to adversarial attacks. ArXiv abs/1706.06083 (2017).Google Scholar
- [39] . 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, , , , , and (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/09c6c3783b4a70054da74f2538ed47c6-Paper.pdfGoogle ScholarDigital Library
- [40] . 2017. Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarCross Ref
- [41] . 2020. Indirect local attacks for context-aware semantic segmentation networks. In European Conference on Computer Vision.Google ScholarDigital Library
- [42] . 2018. Fast video object segmentation by reference-guided mask propagation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7376–7385.
DOI: Google ScholarCross Ref - [43] . 2019. Video object segmentation using space-time memory networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 9225–9234.
DOI: Google ScholarCross Ref - [44] . 2021. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8405–8414.Google ScholarCross Ref
- [45] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdfGoogle Scholar
- [46] . 2016. A benchmark dataset and evaluation methodology for video object segmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 724–732.
DOI: Google ScholarCross Ref - [47] . 2017. The 2017 DAVIS challenge on video object segmentation. ArXiv abs/1704.00675 (2017).Google Scholar
- [48] . 2020. Learning fast and robust target models for video object segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 7404–7413.
DOI: Google ScholarCross Ref - [49] . 2020. Kernelized memory network for video object segmentation. In ECCV 2020: European Conference on Computer Vision.Google ScholarDigital Library
- [50] . 2013. Intriguing properties of neural networks. Computer Science (2013).Google Scholar
- [51] . 2019. FEELVOS: Fast end-to-end embedding learning for video object segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarCross Ref
- [52] . 2018. Tracking emerges by colorizing videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 391–408.Google ScholarDigital Library
- [53] . 2020. Hierarchical pyramid diverse attention networks for face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarCross Ref
- [54] . 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1328–1338.Google ScholarCross Ref
- [55] . 2019. Zero-shot video object segmentation via attentive graph neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9236–9245.Google ScholarCross Ref
- [56] . 2018. Transferable adversarial attacks for image and video object detection. arXiv preprint arXiv:1811.12641 (2018).Google Scholar
- [57] . 2019. Sparse adversarial perturbations for videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8973–8980.Google ScholarDigital Library
- [58] . 2020. Heuristic black-box adversarial attacks on video recognition models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12338–12345.Google ScholarCross Ref
- [59] . 2022. Boosting the transferability of video adversarial examples via temporal translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2659–2667.Google ScholarCross Ref
- [60] . 2019. Physical adversarial textures that fool visual object tracking. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 4821–4830.
DOI: Google ScholarCross Ref - [61] . 2017. Adversarial examples for semantic segmentation and object detection. In 2017 IEEE International Conference on Computer Vision (ICCV’17).Google ScholarCross Ref
- [62] . 2015. Holistically-nested edge detection. In 2015 IEEE International Conference on Computer Vision (ICCV’15). 1395–1403.
DOI: Google ScholarDigital Library - [63] . 2018. Efficient video object segmentation via network modulation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6499–6507.
DOI: Google ScholarCross Ref - [64] . 2020. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision. Springer, 332–348.Google ScholarDigital Library
- [65] . 2017. Pixel-level matching for video object segmentation using convolutional neural networks. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 2186–2195.
DOI: Google ScholarCross Ref - [66] . 2019. DMM-Net: Differentiable mask-matching network for video object segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 3928–3937.
DOI: Google ScholarCross Ref - [67] . 2018. SPFTN: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 2 (2018), 475–489.Google ScholarCross Ref
- [68] . 2021. Generative adversarial networks for abnormal event detection in videos based on self-attention mechanism. IEEE Access 9 (2021), 124847–124860.Google ScholarCross Ref
- [69] . 2020. A transductive approach for video object segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 6947–6956.
DOI: Google ScholarCross Ref - [70] . 2018. Seeing isn’t believing: Practical adversarial attack against object detectors. ArXiv abs/1812.10217 (2018).Google Scholar
Index Terms
- Attention-guided Adversarial Attack for Video Object Segmentation
Recommendations
Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks
MM '23: Proceedings of the 31st ACM International Conference on MultimediaVideo object segmentation (VOS) is a fundamental task for computer vision and multimedia. Despite significant progress of VOS models in recent works, there has been little research on the VOS models' adversarial robustness, posing serious security risks ...
PlAA: Pixel-level Adversarial Attack on Attention for Deep Neural Network
Artificial Neural Networks and Machine Learning – ICANN 2022AbstractDeep Neural Networks (DNNs) have demonstrated excellent performance in many fields. However, existing studies have shown that deep neural networks are very susceptible to well-designed adversarial samples. Adversarial samples cause the system to ...
Adversarial Attack against Modeling Attack on PUFs
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019The Physical Unclonable Function (PUF) has been proposed for the identification and authentication of devices and cryptographic key generation. A strong PUF provides an extremely large number of device-specific challenge-response pairs (CRP) which can ...
Comments