计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 96-103.doi: 10.11896/jsjkx.220900075

• 计算机图形学&多媒体 • 上一篇    下一篇

融合跟踪器:融合图像特征和事件特征的单目标跟踪框架

王琳1, 刘哲1, 史殿习1,2,3, 周晨磊3, 杨绍武1, 张拥军2   

  1. 1 国防科技大学计算机学院 长沙410073
    2 军事科学院国防科技创新研究院 北京100166
    3 天津(滨海)人工智能创新中心 天津300457
  • 收稿日期:2022-09-08 修回日期:2022-12-09 出版日期:2023-10-10 发布日期:2023-10-10
  • 通讯作者: 史殿习(dxshi@nudt.edu.cn)
  • 作者简介:(wanglin12@nudt.edu.cn)
  • 基金资助:
    国家自然科学基金(91948303)

Fusion Tracker:Single-object Tracking Framework Fusing Image Features and Event Features

WANG Lin1, LIU Zhe1, SHI Dianxi1,2,3, ZHOU Chenlei3, YANG Shaowu1, ZHANG Yongjun2   

  1. 1 School of Computer Science,National University of Defense Technology,Changsha 410073,China
    2 National Innovation Institute of Defense Technology,Academy of Military Sciences,Beijing 100166,China
    3 Tianjin Artificial Intelligence Innovation Center,Tianjin 300457,China
  • Received:2022-09-08 Revised:2022-12-09 Online:2023-10-10 Published:2023-10-10
  • About author:WANG Lin,born in 1998,postgraduate.His main research interests include event camera,deep learning and compu-ter vision.SHI Dianxi,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include distributed object middleware technology,adaptive software technology,artificial intelligence,and robot operating systems.
  • Supported by:
    National Natural Science Foundation of China(91948303).

摘要: 目标跟踪是计算机视觉领域的一项基本研究问题。作为主流目标跟踪方法传感器,传统相机可以提供丰富的场景信息。但是由于受到采样原理的限制,传统相机在极端光照条件下会出现过曝光或欠曝光的问题,且在高速运动场景中存在运动模糊的现象。而事件相机是一种仿生传感器,它能够感知光照强度变化输出事件流,具有高动态范围、高时间分辨率等优点,但难以捕捉静态目标。受传统相机和事件相机的特性启发,提出了一种双模态融合的单目标跟踪方法,称为融合跟踪器(Fusion Tracker)。该方法通过特征增强的方式自适应地融合来自传统相机和事件相机数据中的视觉线索,同时设计一种基于注意力机制的特征匹配网络,将模板帧的目标线索与搜索帧相匹配,建立长期特征关联,使跟踪器关注目标信息。融合跟踪器可以解决特征匹配过程中相关性运算导致的语义丢失问题,提升目标跟踪的性能。在两个公开数据集上的实验展示了所提方法的优越性,并且通过消融实验验证了融合跟踪器中关键部分的有效性。融合跟踪器可以有效提升在复杂场景中目标跟踪任务的鲁棒性,为下游应用提供可靠的跟踪结果。

关键词: 目标跟踪, 深度学习, 事件相机, 特征融合, 注意力机制

Abstract: Object tracking is a fundamental research problem in the field of computer vision.As the mainstream object tracking method sensor,conventional cameras can provide rich scene information.However,due to the limitation of sampling principle,conventional cameras suffer from overexposure or underexposure under extreme lighting conditions,and there is motion blur in high-speed motion scenes.In contrast,event camera is a bionic sensor that can sense light intensity changes to output event streams,with the advantages of high dynamic range and high temporal resolution,but it is difficult to capture static targets.Inspired by the characteristics of conventional and event cameras,a dual-modal fusion single-target tracking method,called fusion tracker,is proposed.The method adaptively fuses visual cues from conventional and event camera data by feature enhancement,while designing an attention mechanism-based feature matching network to match object cues of template frames with search frames to establish long-term feature associations and make the tracker focus on object information.The fusion tracker can solve the semantic loss problem caused by correlation operations during feature matching and improve the performance of object tra-cking.Experiments on two publicly available datasets demonstrate the superiority of our approach and validate the effectiveness of the key parts of the fusion tracker by ablation experiments.The fusion tracker can effectively improve the robustness of object tracking tasks in complex scenarios and provide reliable tracking results for downstream applications.

Key words: Object tracking, Deep learning, Event cameras, Featurefusion, Attention mechanisms

中图分类号: 

  • TP391
[1]DONG X,SHEN J,SHAO L,et al.CLNet:A compact latent network for fast adjusting Siamese trackers[C]//European Conference on Computer Vision.Cham:Springer,2020:378-395.
[2]DANELLJAN M,GOOL L V,TIMOFTE R.Probabilistic re-gression for visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:7183-7192.
[3]CHENG X,CUI Y P,SONG C,et al.Target tracking algorithm based on spatio-temporal attention mechanism [J].Computer Science,2021,48(4):123-129.
[4]ZHAO Y,YU Z B,LI Y C.A twin tracking algorithm based on mutual attention guidance[J].Computer Science,2022,49(3):163-169.
[5]GALLEGO G,DELBRUCK T,ORCHARD G,et al.Event-based vision:A survey[J].IEEE Transactions on Pattern Ana-lysis and Machine Intelligence,2020,44(1):154-180.
[6]LICHTSEINER P,POSCH C,DELBRUCK T.A 128×128120 dB 15 μs Latency Asynchronous Temporal Contrast Vision Sensor[J].IEEE Journal of Solid-State Circuits,2008,43(2):566-576.
[7]PIATKOWSKA E,BELBACHIR A N,SCHRAML S,et al.Spatiotemporal multiple persons tracking using dynamic vision sensor[C]//2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.IEEE,2012:35-40.
[8]BARRANCO F,FERMULLER C,ROS E.Real-time clustering and multi-target tracking using event-based sensors[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2018:5764-5769.
[9]MOEYS D P,CORRADI F,LI C,et al.A sensitive dynamic and active pixel vision sensor for color or neural imaging applications[J].IEEE Transactions on Biomedical Circuits and System,2017,12(1):123-126.
[10]GEHRIG D,REBECQ H,GALLEGO G,et al.EKLT:Asyn-chronous photometric feature tracking using events and frames[J].International Journal of Computer Vision,2020,128(3):601-618.
[11]LI B,YAN J,WU W,et al.High performance visual trackingwith siamese region proposal network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8971-8980.
[12]SONG Y,MA C,WU X,et al.Vital:Visual tracking via adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8990-8999.
[13]WANG X,LI C,LUO B,et al.Sint++:Robust visual tracking via adversarial positive instance generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4864-4873.
[14]GUO Q,FENG W,CHEN Z,et al.Effects of blur and deblurring to visual object tracking[J].arXiv:1908.07904,2019.
[15]GALOOGAHI H K,FANG A,HUANG C,et al.Need forspeed:A benchmark for higher frame rate object tracking[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1125-1134.
[16]LI C,LIANG X,LU Y,et al.RGB-T object tracking:Benchmark and baseline[J].Pattern Recognition,2019,96:106977.
[17]LUKEZIC A,KART U,KAPYLA J,et al.Cdtb:A color anddepth visual object tracking dataset and benchmark[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:10013-10022.
[18]MITROKHIM A,FERMULLER C,PARAMESHWARA C,et al.Event-based moving object detection and tracking[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS).IEEE,2018:1-9.
[19]CHEN H,SUTER D,WU Q,et al.End-to-end learning of object motion estimation from retinal events for event-based object tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(7):10534-10541.
[20]WANG X,LI J,ZHU L,et al.VisEvent:Reliable Object Tra-cking via Collaboration of Frame and Event Flows[J].arXiv:2108.05015,2021.
[21]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[J].arXiv:1706.03762,2017.
[22]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[23]LUSCHER C,BECK E,IRIE K,et al.RWTH ASR Systems for LibriSpeech:Hybrid vs Attention--w/o Data Augmentation[J].arXiv:1905.03072,2019.
[24]SYNNAEVE G,XU Q,KAHN J,et al.End-to-end asr:from su-pervised tosemi-supervised learning with modern architectures[J].arXiv:1911.08460,2019.
[25]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//International Conference on Machine Learning.PMLR,2018:4055-4064.
[26]CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformers[C]//European Conference on Computer Vision.Cham:Springer,2020:213-229.
[27]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//European Conference on Computer Vision.Cham:Springer,2014:740-755.
[28]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149.
[29]CHEN H,SHUTER D,WU Q,et al.End-to-end learning of object motion estimation from retinal events for event-based object tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:10534-10541.
[30]CHEN H,WU Q,LIANG Y,et al.Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tra-cking[C]//Proceedings of the 27th ACM International Confe-rence on Multimedia.2019:473-481.
[31]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[32]ZHANG J,YANG X,FU Y,et al.Object tracking by jointly exploiting frame and event domain[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:13043-13052.
[33]YANG C,LAMDOUAR H,LU E,et al.Self-supervised videoobject segmentation by motion grouping[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:7177-7188.
[34]CHEN X,YAN B,ZHU J,et al.Transformer tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8126-8135.
[35]UNION G I O.A Metric and a Loss for Bounding Box Regression[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),Long Beach,CA,USA.2019:658-666.
[36]RUSSAKOVSKY O,DENG J,SU H,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115(3):211-252.
[37]GLOROT X,BENGIO Y.Understanding the difficulty of trai-ning deep feedforward neural networks[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.2010:249-256.
[38]LOSHCHILOV I,HUTTER F.Decoupled weight decay regularization[J].arXiv:1711.05101,2017.
[39]CHEN Z,ZHONG B,LI G,et al.Siamese box adaptive network for visual tracking[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:6668-6677.
[40]LI B,YAN J,WU W,et al.High performance visual trackingwith siamese region proposal network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8971-8980.
[41]DANELLJAN M,BHAT G,KHAN F S,et al.Atom:Accurate tracking byoverlap maximization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4660-4669.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!