Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Li, Shenglan; Yao, Rui; Zhou, Yong; Zhu, Hancheng; Liu, Bing; Zhao, Jiaqi; Shao, Zhiwen

doi:10.1007/s11042-023-14362-9

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Published: 02 February 2023

Volume 82, pages 23595–23613, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Shenglan Li^1,2,
Rui Yao ORCID: orcid.org/0000-0003-2734-915X^1,2,
Yong Zhou^1,2,
Hancheng Zhu^1,2,
Bing Liu^1,2,
Jiaqi Zhao^1,2 &
…
Zhiwen Shao^1,2

561 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

RGB-T tracking means that given the object position in the first frame, the tracker is trained to predict the position of the object in consecutive frames by taking full advantage of the complementary information of RGB and thermal infrared images. As the amount of data increases, unsupervised training has great potential for development in RGB-T tracking task. As we all know, features extracted from different convolutional layers can provide different levels information in the image. In this paper, we propose a framework for visual tracking based on the attention mechanism fusion of multi-modal and multi-level features. This fusion method can give full play to the advantages of multi-level and multi-modal information. Specificly, we use a feature fusion module to fuse these features from different levels and different modalities at the same time. We use cycle consistency based on a correlation filter to implement unsupervised training of the model to reduce the cost of annotated data. The proposed tracker is evaluated on two popular benchmark datasets, GTOT and RGB-T234. Experimental results show that our tracker performs favorably against other state-of-the-art unsupervised trackers with a real-time tracking speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

TEFNet: Target-Aware Enhanced Fusion Network for RGB-T Tracking

Deep Triply Attention Network for RGBT Tracking

Article 07 June 2023

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015; Conference date: 07-05-2015 Through 09-05-2015
Chen H, Li Y, Su D (2019) Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recog 86:376–385. https://doi.org/10.1016/j.patcog.2018.08.007, https://www.sciencedirect.com/science/article/pii/S0031320318303054
Article Google Scholar
Chen Y, Zhu X, Gong S (2018) Semi-supervised deep learning with memory. In: Proceedings of the european conference on computer vision (ECCV)
Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K Attentional feature fusion. arXiv:2009.14082
Fu Z, Liu Q, Fu Z, Wang Y (2021) Stmtrack: Template-free visual tracking with space-time memory networks. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR46437.2021.01356, pp 13769–13778
Gao Y, Li C, Zhu Y, Tang J, He T, Wang F (2019) Deep adaptive fusion network for high performance RGBT tracking. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW). https://doi.org/10.1109/ICCVW.2019.00017, pp 91–99
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2018.00745, pp 7132–7141
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42 (8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Article Google Scholar
Kristan M, Matas J, Leonardis A, Vojir T, Pflugfelder RP, Fernández G, Nebehay G, Porikli F, Cehovin L A novel performance evaluation methodology for single-target trackers. arXiv:1503.01313
Li C, Cheng H, Hu S, Liu X, Tang J, Lin L (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans Image Process 25(12):5743–5756. https://doi.org/10.1109/TIP.2016.2614135
Article MathSciNet MATH Google Scholar
Li C, Zhu C, Zhang J, Luo B, Wu X, Tang J (2019) Learning local-global multi-graph descriptors for RGB-T object tracking. IEEE Trans Circ Syst Video Technol 29(10):2913–2926. https://doi.org/10.1109/TCSVT.2018.2874312
Article Google Scholar
Li CL, Lu A, Zheng AH, Tu Z, Tang J (2019) Multi-adapter RGBT tracking. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW). https://doi.org/10.1109/ICCVW.2019.00279, pp 2262–2270
Li C, Liang X, Lu Y, Zhao N, Tang J RGB-T object tracking: Benchmark and baseline. arXiv:1805.08982
Li C, Zhu C, Huang Y, Tang J, Wang L Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking
Lu X, Wang W, Danelljan M, Zhou T, Shen J, Gool LV Video object segmentation with episodic graph memory networks. arXiv:2007.07020
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. In: Proceedings of the 27th international conference on neural information processing systems - Volume 2, NIPS’14. MIT Press, Cambridge, pp 2204–2212
Oh SW, Lee J, Xu N, Kim SJ Video object segmentation using space-time memory networks. arXiv:1904.00607
Sio CH, Ma Y-J, Shuai H-H, Chen J-C, Cheng W-H (2020) S2SiamFC: self-supervised fully convolutional siamese network for visual tracking. Association for Computing Machinery, New York, pp 1948–1957. https://doi.org/10.1145/3394171.3413611
Google Scholar
Shen Q, Qiao L, Guo J, Li P, Li X, Li B, Feng W, Gan W, Wu W, Ouyang W (2022) Unsupervised learning of accurate siamese tracking. arXiv:2204.01475
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS’17. Curran Associates Inc., Red Hook, pp 6000–6010
Wang N, Yan Yeung D (2013) Learning a deep compact image representation for visual tracking. In: In NIPS, pp 809–817
Wang X, Jabri A, Efros AA Learning correspondence from the cycle-consistency of time. arXiv:1903.07593
Wang N, Song Y, Ma C, Zhou W, Liu W, Li H Unsupervised deep tracking. arXiv:1904.01828
Wang X, Shu X, Zhang S, Jiang B, Wang Y, Tian Y, Wu F MFGNet: dynamic modality-aware filter generation for RGB-T tracking. arXiv:2107.10433
Wang Q, Gao J, Xing J, Zhang M, Hu W DCFNet: discriminant correlation filters network for visual tracking. arXiv:1704.04057
Woo S, Park J, Lee J-Y, Kweon IS (2018) CBAM: convolutional block attention module. arXiv:1807.06521
Xu Q, Mei Y, Liu J, Li C (2021) Multimodal cross-layer bilinear pooling for RGBT tracking. IEEE Trans Multimed :1–1. https://doi.org/10.1109/TMM.2021.3055362
Yang R, Zhu Y, Wang X, Li C, Tang J (2019) Learning target-oriented dual attention for robust RGB-T tracking. In: 2019 IEEE international conference on image processing (ICIP). https://doi.org/10.1109/ICIP.2019.8803528, pp 3975–3979
Yuan W, Wang MY, Chen Q (2020) Self-supervised object tracking with cycle-consistent siamese networks. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 10351–10358
Yuan D, Chang X, Huang P-Y, Liu Q, He Z (2021) Self-supervised deep correlation tracking. IEEE Trans Image Process 30:976–985. https://doi.org/10.1109/TIP.2020.3037518
Article Google Scholar
Zhou T, Li J, Wang S, Tao R, Shen J (2020) MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process 29:8326–8338. https://doi.org/10.1109/TIP.2020.3013162
Article MATH Google Scholar
Zhou T, Li J, Li X, Shao L Target-aware object discovery and association for unsupervised video multi-object segmentation. arXiv:2104.04782
Zhou T, Wang W, Qi S, Ling H, Shen J Cascaded human-object interaction recognition. arXiv:2003.04262
Zhou T, Li L, Li X, Feng C-M, Li J, Shao L (2022) Group-wise learning for weakly supervised semantic segmentation. IEEE Trans Image Process 31:799–811. https://doi.org/10.1109/TIP.2021.3132834
Article Google Scholar
Zheng J, Ma C, Peng H, Yang X (2021) Learning to track objects from unlabeled videos. In: Proceedings of the IEEE/CVF international conference on computer vision
Zhang P, Wang D, Lu H, Yang X Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int J Comput Vis :129. https://doi.org/10.1007/s11263-021-01495-3

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62172417, 62101555, 62106268, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20201346, in part by Xuzhou Key Research and Development Program under Grant KC22287.

Author information

Authors and Affiliations

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Bing Liu, Jiaqi Zhao & Zhiwen Shao
Engineering Research Center of Mine Digitization, Ministry of Education of the Peoples Republic of China, Xuzhou, China
Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Bing Liu, Jiaqi Zhao & Zhiwen Shao

Authors

Shenglan Li
View author publications
You can also search for this author in PubMed Google Scholar
Rui Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hancheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwen Shao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Yao.

Ethics declarations

Conflict of Interests

The authors declare no conflict of interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, S., Yao, R., Zhou, Y. et al. Unsupervised RGB-T object tracking with attentional multi-modal feature fusion. Multimed Tools Appl 82, 23595–23613 (2023). https://doi.org/10.1007/s11042-023-14362-9

Download citation

Received: 01 February 2022
Revised: 30 June 2022
Accepted: 02 January 2023
Published: 02 February 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11042-023-14362-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Abstract

Access this article

Similar content being viewed by others

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

TEFNet: Target-Aware Enhanced Fusion Network for RGB-T Tracking

Deep Triply Attention Network for RGBT Tracking

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Abstract

Access this article

Similar content being viewed by others

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

TEFNet: Target-Aware Enhanced Fusion Network for RGB-T Tracking

Deep Triply Attention Network for RGBT Tracking

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation