Skip to main content
Log in

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

RGB-T tracking means that given the object position in the first frame, the tracker is trained to predict the position of the object in consecutive frames by taking full advantage of the complementary information of RGB and thermal infrared images. As the amount of data increases, unsupervised training has great potential for development in RGB-T tracking task. As we all know, features extracted from different convolutional layers can provide different levels information in the image. In this paper, we propose a framework for visual tracking based on the attention mechanism fusion of multi-modal and multi-level features. This fusion method can give full play to the advantages of multi-level and multi-modal information. Specificly, we use a feature fusion module to fuse these features from different levels and different modalities at the same time. We use cycle consistency based on a correlation filter to implement unsupervised training of the model to reduce the cost of annotated data. The proposed tracker is evaluated on two popular benchmark datasets, GTOT and RGB-T234. Experimental results show that our tracker performs favorably against other state-of-the-art unsupervised trackers with a real-time tracking speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015; Conference date: 07-05-2015 Through 09-05-2015

  2. Chen H, Li Y, Su D (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recog 86:376–385. https://doi.org/10.1016/j.patcog.2018.08.007, https://www.sciencedirect.com/science/article/pii/S0031320318303054

    Article  Google Scholar 

  3. Chen Y, Zhu X, Gong S (2018) Semi-supervised deep learning with memory. In: Proceedings of the european conference on computer vision (ECCV)

  4. Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K Attentional feature fusion. arXiv:2009.14082

  5. Fu Z, Liu Q, Fu Z, Wang Y (2021) Stmtrack: Template-free visual tracking with space-time memory networks. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR46437.2021.01356, pp 13769–13778

  6. Gao Y, Li C, Zhu Y, Tang J, He T, Wang F (2019) Deep adaptive fusion network for high performance RGBT tracking. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW). https://doi.org/10.1109/ICCVW.2019.00017, pp 91–99

  7. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2018.00745, pp 7132–7141

  8. Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42 (8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  9. Kristan M, Matas J, Leonardis A, Vojir T, Pflugfelder RP, Fernández G, Nebehay G, Porikli F, Cehovin L A novel performance evaluation methodology for single-target trackers. arXiv:1503.01313

  10. Li C, Cheng H, Hu S, Liu X, Tang J, Lin L (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans Image Process 25(12):5743–5756. https://doi.org/10.1109/TIP.2016.2614135

    Article  MathSciNet  MATH  Google Scholar 

  11. Li C, Zhu C, Zhang J, Luo B, Wu X, Tang J (2019) Learning local-global multi-graph descriptors for RGB-T object tracking. IEEE Trans Circ Syst Video Technol 29(10):2913–2926. https://doi.org/10.1109/TCSVT.2018.2874312

    Article  Google Scholar 

  12. Li CL, Lu A, Zheng AH, Tu Z, Tang J (2019) Multi-adapter RGBT tracking. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW). https://doi.org/10.1109/ICCVW.2019.00279, pp 2262–2270

  13. Li C, Liang X, Lu Y, Zhao N, Tang J RGB-T object tracking: Benchmark and baseline. arXiv:1805.08982

  14. Li C, Zhu C, Huang Y, Tang J, Wang L Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking

  15. Lu X, Wang W, Danelljan M, Zhou T, Shen J, Gool LV Video object segmentation with episodic graph memory networks. arXiv:2007.07020

  16. Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. In: Proceedings of the 27th international conference on neural information processing systems - Volume 2, NIPS’14. MIT Press, Cambridge, pp 2204–2212

  17. Oh SW, Lee J, Xu N, Kim SJ Video object segmentation using space-time memory networks. arXiv:1904.00607

  18. Sio CH, Ma Y-J, Shuai H-H, Chen J-C, Cheng W-H (2020) S2SiamFC: self-supervised fully convolutional siamese network for visual tracking. Association for Computing Machinery, New York, pp 1948–1957. https://doi.org/10.1145/3394171.3413611

    Google Scholar 

  19. Shen Q, Qiao L, Guo J, Li P, Li X, Li B, Feng W, Gan W, Wu W, Ouyang W (2022) Unsupervised learning of accurate siamese tracking. arXiv:2204.01475

  20. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS’17. Curran Associates Inc., Red Hook, pp 6000–6010

  21. Wang N, Yan Yeung D (2013) Learning a deep compact image representation for visual tracking. In: In NIPS, pp 809–817

  22. Wang X, Jabri A, Efros AA Learning correspondence from the cycle-consistency of time. arXiv:1903.07593

  23. Wang N, Song Y, Ma C, Zhou W, Liu W, Li H Unsupervised deep tracking. arXiv:1904.01828

  24. Wang X, Shu X, Zhang S, Jiang B, Wang Y, Tian Y, Wu F MFGNet: dynamic modality-aware filter generation for RGB-T tracking. arXiv:2107.10433

  25. Wang Q, Gao J, Xing J, Zhang M, Hu W DCFNet: discriminant correlation filters network for visual tracking. arXiv:1704.04057

  26. Woo S, Park J, Lee J-Y, Kweon IS (2018) CBAM: convolutional block attention module. arXiv:1807.06521

  27. Xu Q, Mei Y, Liu J, Li C (2021) Multimodal cross-layer bilinear pooling for RGBT tracking. IEEE Trans Multimed :1–1. https://doi.org/10.1109/TMM.2021.3055362

  28. Yang R, Zhu Y, Wang X, Li C, Tang J (2019) Learning target-oriented dual attention for robust RGB-T tracking. In: 2019 IEEE international conference on image processing (ICIP). https://doi.org/10.1109/ICIP.2019.8803528, pp 3975–3979

  29. Yuan W, Wang MY, Chen Q (2020) Self-supervised object tracking with cycle-consistent siamese networks. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 10351–10358

  30. Yuan D, Chang X, Huang P-Y, Liu Q, He Z (2021) Self-supervised deep correlation tracking. IEEE Trans Image Process 30:976–985. https://doi.org/10.1109/TIP.2020.3037518

    Article  Google Scholar 

  31. Zhou T, Li J, Wang S, Tao R, Shen J (2020) MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338. https://doi.org/10.1109/TIP.2020.3013162

    Article  MATH  Google Scholar 

  32. Zhou T, Li J, Li X, Shao L Target-aware object discovery and association for unsupervised video multi-object segmentation. arXiv:2104.04782

  33. Zhou T, Wang W, Qi S, Ling H, Shen J Cascaded human-object interaction recognition. arXiv:2003.04262

  34. Zhou T, Li L, Li X, Feng C-M, Li J, Shao L (2022) Group-wise learning for weakly supervised semantic segmentation. IEEE Trans Image Process 31:799–811. https://doi.org/10.1109/TIP.2021.3132834

    Article  Google Scholar 

  35. Zheng J, Ma C, Peng H, Yang X (2021) Learning to track objects from unlabeled videos. In: Proceedings of the IEEE/CVF international conference on computer vision

  36. Zhang P, Wang D, Lu H, Yang X Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int J Comput Vis :129. https://doi.org/10.1007/s11263-021-01495-3

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62172417, 62101555, 62106268, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20201346, in part by Xuzhou Key Research and Development Program under Grant KC22287.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Yao.

Ethics declarations

Conflict of Interests

The authors declare no conflict of interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, S., Yao, R., Zhou, Y. et al. Unsupervised RGB-T object tracking with attentional multi-modal feature fusion. Multimed Tools Appl 82, 23595–23613 (2023). https://doi.org/10.1007/s11042-023-14362-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14362-9

Keywords

Navigation