skip to main content
10.1145/3581783.3611759acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

Published:27 October 2023Publication History

ABSTRACT

Audio-visual zero-shot learning (ZSL) has attracted board attention, as it could classify video data from classes that are not observed during training. However, most of the existing methods are restricted to background scene bias and fewer motion details by employing a single-stream network to process scenes and motion information as a unified entity. In this paper, we address this challenge by proposing a novel dual-stream architecture Motion-Decoupled Spiking Transformer (MDFT) to explicitly decouple the contextual semantic information and highly sparsity dynamic motion information. Specifically, The Recurrent Joint Learning Unit (RJLU) could extract contextual semantic information effectively and understand the environment in which actions occur by capturing joint knowledge between different modalities. By converting RGB images to events, our approach effectively captures motion information while mitigating the influence of background scene biases, leading to more accurate classification results. We utilize the inherent strengths of Spiking Neural Networks (SNNs) to process highly sparsity event data efficiently. Additionally, we introduce a Discrepancy Analysis Block (DAB) to model the audio motion features. To enhance the efficiency of SNNs in extracting dynamic temporal and motion information, we dynamically adjust the threshold of Leaky Integrate-and-Fire (LIF) neurons based on the statistical cues of global motion and contextual semantic information. Our experiments demonstrate the effectiveness of MDFT, which consistently outperforms state-of-the-art methods across mainstream benchmarks. Moreover, we find that motion information serves as a powerful regularization for video networks, where using it improves the accuracy of HM and ZSL by 19.1% and 38.4%, respectively.

References

  1. Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, and Vinay P Namboodiri. Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021.Google ScholarGoogle ScholarCross RefCross Ref
  2. Otniel-Bogdan Mercea, Lukas Riesch, A Koepke, and Zeynep Akata. Audio-visual generalised zero-shot learning with cross-modal attention and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.Google ScholarGoogle Scholar
  3. Otniel-Bogdan Mercea, Thomas Hummel, A Koepke, and Zeynep Akata. Temporal and cross-modal attention for audio-visual zero-shot learning. European Conference on Computer Vision (ECCV), 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jinpeng Wang, Yuting Gao, Ke Li, Jianguo Hu, Xinyang Jiang, Xiaowei Guo, Rongrong Ji, and Xing Sun. Enhancing unsupervised video representation learning by decoupling the scene and the motion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021.Google ScholarGoogle ScholarCross RefCross Ref
  5. Zhuangzhuang Chen, Jin Zhang, Zhuonan Lai, Jie Chen, Zun Liu, and Jianqiang Li. Geometry-aware guided loss for deep crack recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.Google ScholarGoogle ScholarCross RefCross Ref
  6. Zhuangzhuang Chen, Jin Zhang, Pan Wang, Jie Chen, and Jianqiang Li. When active learning meets implicit semantic data augmentation. In European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  7. Jianqiang Li, Zhuang-Zhuang Chen, Luxiang Huang, Min Fang, Bing Li, Xianghua Fu, Huihui Wang, and Qingguo Zhao. Automatic classification of fetal heart rate based on convolutional neural network. IEEE Internet of Things Journal, 2019.Google ScholarGoogle Scholar
  8. Jianqiang Li, Zhuangzhuang Chen, Jie Chen, and Qiuzhen Lin. Diversity-sensitive generative adversarial network for terrain mapping under limited human intervention. IEEE Transactions on Cybernetics, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  9. Sanath Narayan, Akshita Gupta, Fahad Shahbaz Khan, Cees GM Snoek, and Ling Shao. Latent embedding feedback and discriminative features for zero-shot classification. In European Conference on Computer Vision (ECCV), 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.Google ScholarGoogle ScholarCross RefCross Ref
  11. Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 4281--4289, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  12. Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018.Google ScholarGoogle ScholarCross RefCross Ref
  13. Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.Google ScholarGoogle ScholarCross RefCross Ref
  15. Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017.Google ScholarGoogle ScholarCross RefCross Ref
  16. Alina Roitberg, Manuel Martinez, Monica Haurilet, and Rainer Stiefelhagen. Towards a fair evaluation of zero-shot action recognition using external data. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018.Google ScholarGoogle Scholar
  17. Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.Google ScholarGoogle ScholarCross RefCross Ref
  18. Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, and Marcus Rohrbach. A new split for evaluating true zero-shot action recognition. In Pattern Recognition: 43rd DAGM German Conference (DAGM GCPR), 2022.Google ScholarGoogle Scholar
  19. Meera Hahn, Andrew Silva, and James M Rehg. Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484, 2019.Google ScholarGoogle Scholar
  20. Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In International conference on machine learning (ICML), 2015.Google ScholarGoogle Scholar
  21. Huang Xie, Okko Räsänen, and Tuomas Virtanen. Zero-shot audio classification with factored linear and nonlinear acoustic-semantic projections. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.Google ScholarGoogle ScholarCross RefCross Ref
  22. Huang Xie and Tuomas Virtanen. Zero-shot audio classification via semantic embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. Advances in Neural Information Processing Systems (NeurIPS), 2016.Google ScholarGoogle Scholar
  25. Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2018.Google ScholarGoogle Scholar
  26. Henri Rebecq, Rene Ranftl, Vladlen Koltun, and Davide Scaramuzza. Events-to-video: Bringing modern computer vision to event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.Google ScholarGoogle Scholar
  27. Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.Google ScholarGoogle ScholarCross RefCross Ref
  28. Jacques Kaiser, J. Camilo Vasquez Tieck, Christian Hubschneider, Peter Wolf, Michael Weber, Michael Hoff, Alexander Friedrich, Konrad Wojtasik, Arne Roennau, Ralf Kohlhaas, Rüdiger Dillmann, and J. Marius Zöllner. Towards a framework for end-to-end control of a simulated vehicle with spiking neural networks. In IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), 2016.Google ScholarGoogle Scholar
  29. Henri Rebecq, Daniel Gehrig, and Davide Scaramuzza. Esim: an open event camera simulator. In Proceedings of The 2nd Conference on Robot Learning, Proceedings of Machine Learning Research (PMLR), 2018.Google ScholarGoogle Scholar
  30. Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2021.Google ScholarGoogle Scholar
  31. Yanqi Chen, Zhaofei Yu, Wei Fang, Zhengyu Ma, Tiejun Huang, and Yonghong Tian. State transition of dendritic spines improves learning of sparse spiking neural networks. In International Conference on Machine Learning (ICML), 2022.Google ScholarGoogle Scholar
  32. Yanqi Chen, Zhaofei Yu, Wei Fang, Tiejun Huang, and Yonghong Tian. Pruning of deep spiking neural networks through gradient rewiring. International Joint Conferences on Artificial Intelligence (IJCAI), 2021.Google ScholarGoogle ScholarCross RefCross Ref
  33. Liwei Huang, Zhengyu Ma, Liutao Yu, Huihui Zhou, and Yonghong Tian. Deep spiking neural networks with high representation similarity model visual pathways of macaque and mouse. arXiv preprint arXiv:2303.06060, 2023.Google ScholarGoogle Scholar
  34. Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.Google ScholarGoogle ScholarCross RefCross Ref
  35. Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Xiaopeng Fan, and Yonghong Tian. Neuron-based spiking transmission and reasoning network for robust image-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022.Google ScholarGoogle Scholar
  36. Yuki Asano, Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. Labelling unlabelled videos from scratch with multi-modal self-supervision. Advances in Neural Information Processing Systems (NeurIPS), 2020.Google ScholarGoogle Scholar
  37. Jun Haeng Lee, Tobi Delbruck, and Michael Pfeiffer. Training deep spiking neural networks using backpropagation. Frontiers in Neuroscience, 2016.Google ScholarGoogle Scholar
  38. Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evalu- ation of output embeddings for fine-grained image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.Google ScholarGoogle Scholar
  39. Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems (NeurIPS), 2013.Google ScholarGoogle Scholar
  40. Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. Attribute prototype network for zero-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 2020.Google ScholarGoogle Scholar
  41. Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. f-vaegan-d2: A feature generating framework for any-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.Google ScholarGoogle Scholar
  42. Kranti Parida, Neeraj Matiyali, Tanaya Guha, and Gaurav Sharma. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020.Google ScholarGoogle ScholarCross RefCross Ref
  43. Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 2018.Google ScholarGoogle Scholar
  44. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.Google ScholarGoogle Scholar
  45. Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.Google ScholarGoogle Scholar

Index Terms

  1. Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)192
      • Downloads (Last 6 weeks)30

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader