Abstract
RGB-T tracking means that given the object position in the first frame, the tracker is trained to predict the position of the object in consecutive frames by taking full advantage of the complementary information of RGB and thermal infrared images. As the amount of data increases, unsupervised training has great potential for development in RGB-T tracking task. As we all know, features extracted from different convolutional layers can provide different levels information in the image. In this paper, we propose a framework for visual tracking based on the attention mechanism fusion of multi-modal and multi-level features. This fusion method can give full play to the advantages of multi-level and multi-modal information. Specificly, we use a feature fusion module to fuse these features from different levels and different modalities at the same time. We use cycle consistency based on a correlation filter to implement unsupervised training of the model to reduce the cost of annotated data. The proposed tracker is evaluated on two popular benchmark datasets, GTOT and RGB-T234. Experimental results show that our tracker performs favorably against other state-of-the-art unsupervised trackers with a real-time tracking speed.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015; Conference date: 07-05-2015 Through 09-05-2015
Chen H, Li Y, Su D (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recog 86:376–385. https://doi.org/10.1016/j.patcog.2018.08.007, https://www.sciencedirect.com/science/article/pii/S0031320318303054
Chen Y, Zhu X, Gong S (2018) Semi-supervised deep learning with memory. In: Proceedings of the european conference on computer vision (ECCV)
Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K Attentional feature fusion. arXiv:2009.14082
Fu Z, Liu Q, Fu Z, Wang Y (2021) Stmtrack: Template-free visual tracking with space-time memory networks. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR46437.2021.01356, pp 13769–13778
Gao Y, Li C, Zhu Y, Tang J, He T, Wang F (2019) Deep adaptive fusion network for high performance RGBT tracking. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW). https://doi.org/10.1109/ICCVW.2019.00017, pp 91–99
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2018.00745, pp 7132–7141
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42 (8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Kristan M, Matas J, Leonardis A, Vojir T, Pflugfelder RP, Fernández G, Nebehay G, Porikli F, Cehovin L A novel performance evaluation methodology for single-target trackers. arXiv:1503.01313
Li C, Cheng H, Hu S, Liu X, Tang J, Lin L (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans Image Process 25(12):5743–5756. https://doi.org/10.1109/TIP.2016.2614135
Li C, Zhu C, Zhang J, Luo B, Wu X, Tang J (2019) Learning local-global multi-graph descriptors for RGB-T object tracking. IEEE Trans Circ Syst Video Technol 29(10):2913–2926. https://doi.org/10.1109/TCSVT.2018.2874312
Li CL, Lu A, Zheng AH, Tu Z, Tang J (2019) Multi-adapter RGBT tracking. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW). https://doi.org/10.1109/ICCVW.2019.00279, pp 2262–2270
Li C, Liang X, Lu Y, Zhao N, Tang J RGB-T object tracking: Benchmark and baseline. arXiv:1805.08982
Li C, Zhu C, Huang Y, Tang J, Wang L Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking
Lu X, Wang W, Danelljan M, Zhou T, Shen J, Gool LV Video object segmentation with episodic graph memory networks. arXiv:2007.07020
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. In: Proceedings of the 27th international conference on neural information processing systems - Volume 2, NIPS’14. MIT Press, Cambridge, pp 2204–2212
Oh SW, Lee J, Xu N, Kim SJ Video object segmentation using space-time memory networks. arXiv:1904.00607
Sio CH, Ma Y-J, Shuai H-H, Chen J-C, Cheng W-H (2020) S2SiamFC: self-supervised fully convolutional siamese network for visual tracking. Association for Computing Machinery, New York, pp 1948–1957. https://doi.org/10.1145/3394171.3413611
Shen Q, Qiao L, Guo J, Li P, Li X, Li B, Feng W, Gan W, Wu W, Ouyang W (2022) Unsupervised learning of accurate siamese tracking. arXiv:2204.01475
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS’17. Curran Associates Inc., Red Hook, pp 6000–6010
Wang N, Yan Yeung D (2013) Learning a deep compact image representation for visual tracking. In: In NIPS, pp 809–817
Wang X, Jabri A, Efros AA Learning correspondence from the cycle-consistency of time. arXiv:1903.07593
Wang N, Song Y, Ma C, Zhou W, Liu W, Li H Unsupervised deep tracking. arXiv:1904.01828
Wang X, Shu X, Zhang S, Jiang B, Wang Y, Tian Y, Wu F MFGNet: dynamic modality-aware filter generation for RGB-T tracking. arXiv:2107.10433
Wang Q, Gao J, Xing J, Zhang M, Hu W DCFNet: discriminant correlation filters network for visual tracking. arXiv:1704.04057
Woo S, Park J, Lee J-Y, Kweon IS (2018) CBAM: convolutional block attention module. arXiv:1807.06521
Xu Q, Mei Y, Liu J, Li C (2021) Multimodal cross-layer bilinear pooling for RGBT tracking. IEEE Trans Multimed :1–1. https://doi.org/10.1109/TMM.2021.3055362
Yang R, Zhu Y, Wang X, Li C, Tang J (2019) Learning target-oriented dual attention for robust RGB-T tracking. In: 2019 IEEE international conference on image processing (ICIP). https://doi.org/10.1109/ICIP.2019.8803528, pp 3975–3979
Yuan W, Wang MY, Chen Q (2020) Self-supervised object tracking with cycle-consistent siamese networks. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 10351–10358
Yuan D, Chang X, Huang P-Y, Liu Q, He Z (2021) Self-supervised deep correlation tracking. IEEE Trans Image Process 30:976–985. https://doi.org/10.1109/TIP.2020.3037518
Zhou T, Li J, Wang S, Tao R, Shen J (2020) MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338. https://doi.org/10.1109/TIP.2020.3013162
Zhou T, Li J, Li X, Shao L Target-aware object discovery and association for unsupervised video multi-object segmentation. arXiv:2104.04782
Zhou T, Wang W, Qi S, Ling H, Shen J Cascaded human-object interaction recognition. arXiv:2003.04262
Zhou T, Li L, Li X, Feng C-M, Li J, Shao L (2022) Group-wise learning for weakly supervised semantic segmentation. IEEE Trans Image Process 31:799–811. https://doi.org/10.1109/TIP.2021.3132834
Zheng J, Ma C, Peng H, Yang X (2021) Learning to track objects from unlabeled videos. In: Proceedings of the IEEE/CVF international conference on computer vision
Zhang P, Wang D, Lu H, Yang X Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int J Comput Vis :129. https://doi.org/10.1007/s11263-021-01495-3
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 62172417, 62101555, 62106268, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20201346, in part by Xuzhou Key Research and Development Program under Grant KC22287.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare no conflict of interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, S., Yao, R., Zhou, Y. et al. Unsupervised RGB-T object tracking with attentional multi-modal feature fusion. Multimed Tools Appl 82, 23595–23613 (2023). https://doi.org/10.1007/s11042-023-14362-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14362-9