Neural Modeling and Real-Time Environment Training of Human Binocular Stereo Visual Tracking

Wang, Jiaguo; Meng, Xianghao; Xu, Hanyuan; Pei, Yang

doi:10.1007/s12559-022-10091-7

Neural Modeling and Real-Time Environment Training of Human Binocular Stereo Visual Tracking

Published: 19 December 2022

Volume 15, pages 710–730, (2023)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Jiaguo Wang¹,
Xianghao Meng¹,
Hanyuan Xu¹ &
…
Yang Pei ORCID: orcid.org/0000-0002-6209-7886¹

178 Accesses
Explore all metrics

Abstract

Simulating the human natural visual system is beneficial for understanding brain intelligence and exploiting new aspects of computer vision. Previous studies have proposed many progressive models and experiments for visual tracking; however, only a few consider all factors involved in visual tracking. Improvements in cross-modal sensory fusion, online physical environment training, and leveraging machine learning are required. In this paper, we present a balanced visual tracking study between neuroscience models and deep-learning methods. In our visual tracking framework, we modify the original region proposal network and interconnect binocular R-CNNs with a new region of interest (RoI) model. Ground frame prediction can be implemented by localization fusion from binocular R-CNNs, as well as external sensory information, such as a dense disparity map. In the behavior stage, visual-motor transformation is implemented through the online training of saccades, pursuit, and vergence networks in the real environment. As demonstrated on a robot, our framework can learn tracking skills through online parameter updates using physical data collected from the robot. The framework achieves performance highly similar to human behaviors and better accuracy than recent models. Moreover, using prediction from our ground vision model to guide binocular, RoI pooling can improve the efficiency of object recognition and localization and reduce visual tracking errors by 27% compared with the original network. In conclusion, this study proposed an effective binocular tracking framework that draws inspiration from brain structures. The performance showed improved accuracy and robustness in tracking random moving targets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 15

Fig. 24

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Fully-Convolutional Siamese Networks for Object Tracking

Deep Learning vs. Traditional Computer Vision

Data Availability

The data and programs used in this study are available from the corresponding author upon reasonable request.

References

Escobar MJ, Masson GS, Vieville T, et al. Action recognition using a bio-inspired feedforward spiking network. Int J Comput Vis. 2009;82:284.
Article Google Scholar
Akbarinia A, Parraga CA. Feedback and surround modulated boundary detection. Int J Comput Vis. 2018;126:1367–80.
Article Google Scholar
Tsotsos JK. Motion understanding: task-directed attention and representations that link perception with action. Int J Comput Vision. 2001;45:265–80.
Article MATH Google Scholar
Gupta S, Tolani V, Davidson J, et al. Cognitive mapping and planning for visual navigation. Int J Comput Vis. 2020;128:1311–30.
Article Google Scholar
Porr B, Nürenberg B, Wörgötter F. A VLSI-compatible computer vision algorithm for stereoscopic depth analysis in real-time. Int J Comput Vision. 2002;49:39–55.
Article MATH Google Scholar
Yuille AL, Liu C. Deep Nets: what have they ever done for vision? Int J Comput Vis. 2021;129:781–802.
Article Google Scholar
He K, Gkioxari G, Dollár P, et al. Mask R-CNN. Proc IEEE Int Conf Comput Vis. 2017;2961–2969.
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015;28.
Girshick R. Fast R-CNN. Proc IEEE Int Conf Comput Vis. 2015;1440–1448.
Mahler J, Liang J, Niyaz S, Laskey M, Doan R, Liu X, ... Goldberg K. Dex-Net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. 2017. arXiv preprint arXiv:1703.09312.
Wang Z, Fey AM. Deep learning with convolutional neural network for objective skill evaluation in robot-assisted surgery. Int J Comput Assist Radiol Surg. 2018;13(12):1959–70.
Article Google Scholar
Mei C, Sibley G, Cummins M, et al. RSLAM: a system for large-scale mapping in constant-time using stereo. Int J Comput Vis. 2011;94:198–214.
Article Google Scholar
Tai L, Paolo G, Liu M. Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. 2017 IEEE/RSJ Int Conf Intell Robots Syst (IROS). 2017;31–36. IEEE.
Hossain S, Lee DJ. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors. 2019;19(15):3371.
Article Google Scholar
Voigtlaender P, Luiten J, Torr PH, Leibe B. Siam R-CNN: visual tracking by re-detection. Proc IEEE/CVF Conf Comput Vis Pattern Recognit. 2020;6578–6588.
Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection. Proc IEEE Conf Comput Vis Pattern Recognit. 2016;779–788.
Antonelli M, Gibaldi A, Beuth F, Duran AJ, Canessa A, Chessa M, Sabatini SP. A hierarchical system for a distributed representation of the peripersonal space of a humanoid robot. IEEE Trans Auton Ment Dev. 2014a;6(4):259–73.
Article Google Scholar
Vannucci L, Ambrosano A, Cauli N, Albanese U, Falotico E, Ulbrich S, ... Laschi C. A visual tracking model implemented on the iCub robot as a use case for a novel neurorobotic toolkit integrating brain and physics simulation. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). 2015;1179–1184. IEEE.
Pietrini P, Furey ML, Ricciardi E, Gobbini MI, Wu WHC, Cohen L, Haxby JV. Beyond sensory images: object-based representation in the human ventral pathway. Proc Natl Acad Sci. 2004;101(15):5658–63.
Article Google Scholar
Yantis S, Schwarzbach J, Serences JT, et al. Transient neural activity in human parietal cortex during spatial attention shifts. Nat Neurosci. 2002;5(10):995–1002.
Article Google Scholar
Sakata H, Kusunoki M. Organization of space perception: neural representation of three-dimensional space in the posterior parietal cortex. Curr Opin Neurobiol. 1992;2(2):170–4.
Article Google Scholar
Peng J, Srikaew A, Wilkes M, Kawamura K, Peters A. An active vision system for mobile robots. In Smc 2000 conference proceedings. 2000 IEEE International Conference On Systems, Man and Cybernetics.’cybernetics Evolving to Systems, Humans, Organizations, and Their Complex Interactions’. 2000;2:1472–1477. IEEE.
Zhang X, Tay ALP. A physical system for binocular vision through saccade generation and vergence control. Cyber Sys: An Int J. 2009;40(6):549–68.
Article MATH Google Scholar
Antonelli M, et al. A hierarchical system for a distributed representation of the peripersonal space of a humanoid robot. IEEE Trans Auton Ment Dev. 2014b;6(4):259–73.
Article Google Scholar
Kyriakoulis N, Gasteratos A, Mouroutsos SG. An adaptive fuzzy system for the control of the vergence angle on a robotic head. J Intell Fuzzy Syst. 2010;21(6):385–94.
Article MATH Google Scholar
Zhang X, Tay LP. A spatial variant approach for vergence control in complex scenes. Image Vis Comput. 2011;29(1):64–77.
Article Google Scholar
Rea F, Sandini G, Metta G. Motor biases in visual attention for a humanoid robot. 2014 IEEE-RAS Int Conf Humanoid Robots. 2014;779–786. IEEE.
Rea F, Sandini G, Metta G. Motor biases in visual attention for a humanoid robot. 2014 IEEE-RAS Int Conf Humanoid Robots. 2014;779 –786. IEEE.
Kowler E. Eye movements: The past 25 years. Vision Res. 2011;51(13):1457–83.
Article Google Scholar
O’Driscoll GA, Wolff ALV, Benkelfat C, Florencio PS, Lal S, Evans AC. Functional neuroanatomy of smooth pursuit and predictive saccades. NeuroReport. 2000;11(6):1335–40.
Article Google Scholar
Orban de Xivry JJ, Lefevre P. Saccades and pursuit: two outcomes of a single sensorimotor process. J Physiol. 2007;584(1):11–23.
Article Google Scholar
McPeek RM, Keller EL. Saccade target selection in the superior colliculus during a visual search task. J Neurophysiol. 2002;88(4):2019–34.
Article Google Scholar
Wang X, van de Weem J, Jonker P. An advanced active vision system imitating human eye movements. 2013 16th Int Conf Adv Robotics (ICAR). 2013;1–6. IEEE.
Falotico E, Zambrano D, Muscolo GG, Marazzato L, Dario P, Laschi C. Implementation of a bio-inspired visual tracking model on the iCub robot. In 19th International Symposium in Robot and Human Interactive Communication. 2010;564–569. IEEE.
Coombs D, Brown C. Real-time binocular smooth pursuit. Int J Comput Vis. 1993;11(2):147–64.
Article Google Scholar
Das S, Ahuja N. Performance analysis of stereo, vergence, and focus as depth cues for active vision. IEEE Trans Pattern Anal Mach Intell. 1995;17(12):1213–9.
Article Google Scholar
Monaco JP, Bovik AC, Cormack LK. Active, foveated, uncalibrated stereovision. Int J Comput Vis. 2009;85(2):192–207.
Article Google Scholar
Mishra A, Aloimonos Y, Fah CL. Active segmentation with fixation. In 2009 IEEE 12th Int Conf Comput Vis. 2009;468–475. IEEE.
Gibaldi A, Vanegas M, Canessa A, Sabatini SP. A portable bio-inspired architecture for efficient robotic vergence control. Int J Comput Vis. 2017;121(2):281–302.
Article Google Scholar
Zhang Z, Sattler T, Scaramuzza D. Reference pose generation for long-term visual localization via learned features and view synthesis. Int J Comput Vis. 2021;129(4):821–44.
Article Google Scholar
Zhou H, Ummenhofer B, Brox T. DeepTAM: deep tracking and mapping with convolutional neural networks. Int J Comput Vis. 2020;128(3):756–69.
Article Google Scholar
Chen H, Li Y, Deng Y, Lin G. CNN-based RGB-D salient object detection: learn, select, and fuse. Int J Comput Vis. 2021;129(7):2076–96.
Article Google Scholar
Yoon JH, Lee CR, Yang MH, et al. Structural constraint data association for online multi-object tracking. Int J Comput Vis. 2019;127:1–21.
Article Google Scholar
Zhong L, Zhang L. A robust monocular 3D object tracking method combining statistical and photometric constraints. Int J Comput Vis. 2019;127(8):973–92.
Article MATH Google Scholar
Sturm J, Engelhard N, Endres F, Burgard W, Cremers D. A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ Int Conf Intell Robots Syst. 2012;573–580. IEEE.
Lemaire T, Berger C, Jung IK, Lacroix S. Vision-based slam: stereo and monocular approaches. Int J Comput Vis. 2007;74(3):343–64.
Article Google Scholar
Agarwal S, Snavely N, Simon I, Seitz S, Szeliski R. Building Rome in a day. Proc ICCV. 2009;72–79.
Kuhn A, Hirschmüller H, Scharstein D, Mayer H. A TV prior for high-quality scalable multi-view stereo reconstruction. Int J Comput Vis. 2017;124(1):2–17.
Article MathSciNet Google Scholar
Kim H, Hilton A. 3D scene reconstruction from multiple spherical stereo pairs. Int J Comput Vis. 2013;104(1):94–116.
Article MathSciNet MATH Google Scholar
Liu A, Marschner S, Snavely N. Caliber: Camera localization and calibration using rigidity constraints. Int J Comput Vis. 2016;118(1):1–21.
Article MathSciNet MATH Google Scholar
Ma C, Chen L, Yong J. AU R-CNN: encoding expert prior knowledge into R-CNN for action unit detection. Neurocomputing. 2019;355:35–47.
Article Google Scholar
Zhu Q, Triesch J, Shi BE. Integration of vergence, cyclovergence, and saccades through active efficient coding. In 2020 Joint IEEE 10th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob). 2020;1–6. IEEE.
de La Bourdonnaye F, Teuliere C, Chateau T, Triesch J. Learning of binocular fixations using anomaly detection with deep reinforcement learning. In 2017 Int Joint Conf Neural Netw (IJCNN). 2017;760–767. IEEE.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. arXiv preprint arXiv:1409.1556.
Araújo AF, Antonino VO, Ponce-Guevara KL. Self-organizing subspace clustering for high-dimensional and multi-view data. Neural Netw. 2020;130:253–68.
Article Google Scholar
Du Y, Yuan C, Li B, Hu W, Yang H, Fu Z, Zhao L. Hierarchical nonlinear orthogonal adaptive-subspace self-organizing map based feature extraction for human action recognition. Proc AAAI Conf Artif Intell. 2008;32(1).
Bernardino A, Santos-Victor J. Vergence control for robotic heads using log-polar images. In Proceedings of IEEE/RSJ Int Conf Intell Robots Syst. IROS’96. 1996;3:1264–1271. IEEE.
Hansen M, Sommer G. Active depth estimation with gaze and vergence control using Gabor filters. Proc 13th Int Conf Pattern Recognit. 1996;1:287–291. IEEE.
Pereyra G, Tucker G, Chorowski J, Kaiser Ł, Hinton G. Regularizing neural networks by penalizing confident output distributions. 2017. arXiv preprint arXiv:1701.06548.

Download references

Acknowledgements

We would like to thank Dr. Xinhuan Zhou from the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, for providing technical assistance and proofreading.

Funding

This study was funded by the National Natural Science Foundation of China (grant number 12172287).

Author information

Authors and Affiliations

Northwestern Polytechnical University, Xi’an, 710072, China
Jiaguo Wang, Xianghao Meng, Hanyuan Xu & Yang Pei

Authors

Jiaguo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xianghao Meng
View author publications
You can also search for this author in PubMed Google Scholar
Hanyuan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Pei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Pei.

Ethics declarations

Ethics Approval

This article does not contain any studies involving human participants performed by any of the authors.

Conflict of Interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (MP4 38904 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Meng, X., Xu, H. et al. Neural Modeling and Real-Time Environment Training of Human Binocular Stereo Visual Tracking. Cogn Comput 15, 710–730 (2023). https://doi.org/10.1007/s12559-022-10091-7

Download citation

Received: 30 March 2022
Accepted: 04 December 2022
Published: 19 December 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s12559-022-10091-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Neural Modeling and Real-Time Environment Training of Human Binocular Stereo Visual Tracking

Abstract

Access this article

Similar content being viewed by others

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Fully-Convolutional Siamese Networks for Object Tracking

Deep Learning vs. Traditional Computer Vision

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Neural Modeling and Real-Time Environment Training of Human Binocular Stereo Visual Tracking

Abstract

Access this article

Similar content being viewed by others

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Fully-Convolutional Siamese Networks for Object Tracking

Deep Learning vs. Traditional Computer Vision

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation