View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements

Nishimura, Mai; Nobuhara, Shohei; Nishino, Ko

doi:10.1007/s11263-023-01788-9

View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements

Published: 04 May 2023

Volume 131, pages 2015–2031, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

356 Accesses
1 Altmetric
Explore all metrics

Abstract

We introduce view birdification, the problem of recovering ground-plane movements of people in a crowd from an ego-centric video captured from an observer (e.g., a person or a vehicle) also moving in the crowd. Recovered ground-plane movements would provide a sound basis for situational understanding and benefit downstream applications in computer vision and robotics. In this paper, we formulate view birdification as a geometric trajectory reconstruction problem and derive a cascaded optimization method from a Bayesian perspective. The method first estimates the observer’s movement and then localizes surrounding pedestrians for each frame while taking into account the local interactions between them. We introduce three datasets by leveraging synthetic and real trajectories of people in crowds and evaluate the effectiveness of our method. The results demonstrate the accuracy of our method and set the ground for further studies of view birdification as an important but challenging visual understanding problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 7

How Can I See My Future? FvTraj: Using First-Person View for Pedestrian Trajectory Prediction

Is Crowdsourcing for Optical Flow Ground Truth Generation Feasible?

Bird’s Eye View Perception for Autonomous Driving

References

Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2008). Nonrigid structure from motion in trajectory space. In Proceedings of NeurIPS.
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). CIAL ISTM: Human trajectory prediction in crowded spaces. In Proceedings of CVPR (pp. 961–971).
Anvari, B., & Wurdemann, H. A. (2020) Modelling social interaction between humans and service robots in large public spaces. In Proceedings of IROS (pp. 11,189–11,196). https://doi.org/10.1109/IROS45743.2020.9341133
Ardeshir, S., & Borji, A. (2016). Ego2top: Matching viewers in egocentric and top-view videos. In Proceedings of ECCV (pp. 253–268). Springer.
Ardeshir, S., Regmi, K., & Borji, A. (2016). Egotransfer: Transferring motion across egocentric and exocentric domains using deep neural networks. CoRR.
Badrinarayanan, V., Budvytis, I., & Cipolla, R. (2014). Mixture of trees probabilistic graphical model for video segmentation. IJCV, 110(1), 14–29.
Article Google Scholar
Bertoni, L., Kreiss, S., & Alahi, A. (2019). Monoloco: Monocular 3D pedestrian localization and uncertainty estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6861–6871).
Bescos, B., Fácil, J. M., Civera, J., & Neira, J. (2018). DynaSLAM: Tracking, mapping and inpainting in dynamic scenes. In Proceedings of IROS.
Brousseau, P. A., & Roy, S. (2019). Calibration of axial fisheye cameras through generic virtual central models. In Proceedings of ICCV.
Cao, Z., Gao, H., Mangalam, K., Cai, Q., Vo, M., & Malik, J. (2020). Long-term human motion prediction with scene context. In Proceedings of ECCV.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief propagation for early vision. IJCV, 70(1), 41–54.
Article Google Scholar
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Article MathSciNet Google Scholar
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018). Social GAN: Socially acceptable trajectories with generative adversarial networks. In Proceedings of CVPR (pp. 2255–2264).
Hähnel, D., Schulz, D., & Burgard, W. (2002). Map building with mobile robots in populated environments. In Proceedings of IROS (pp. 496–501).
Hahnel, D., Triebel, R., Burgard, W., & Thrun, S. (2003). Map building with mobile robots in dynamic environments. In 2003 IEEE international conference on robotics and automation (Cat. No. 03CH37422) (Vol. 2, pp. 1557–1563). IEEE.
Helbing, D., & Molnar, P. (1995). Social force model for pedestrian dynamics. Physical Review E, 51(5), 4282.
Article Google Scholar
Henein, M., Zhang, J., Mahony, R., & Ila, V. (2020). Dynamic slam: The need for speed. In Proceedings of ICRA (pp. 2123–2129). IEEE.
Hu, H. N., Yang, Y. H., Fischer, T., Yu, F., Darrell, T., & Sun, M. (2021). Monocular quasi-dense 3D object tracking. arXiv:2103.07351
Huang, J., Yang, S., Mu, T. J., & Hu, S. M. (2020). Clustervo: Clustering moving instances and estimating visual odometry for self and surroundings. In Proceedings of CVPR (pp. 2168–2177).
Ivanovic, B., & Pavone, M. (2019). The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of ICCV (pp. 2375–2384).
Jensen, S. H. N., Doest, M. E. B., Aanaes, H., & Bue, A. D. (2020). A benchmark and evaluation of non-rigid structure from motion. In IJCV.
Kratz, L., & Nishino, K. (2009). Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In Proceedings of CVPR (pp. 1446–1453). IEEE.
Kreiss, S., Bertoni, L., & Alahi, A. (2022). Openpifpaf: Composite fields for semantic keypoint detection and spatio-temporal association. IEEE Transactions on Intelligent Transportation Systems, 23(8), 13498–13511. https://doi.org/10.1109/TITS.2021.3124981
Article Google Scholar
Kumar, S., Dai, Y., & Li, H. (2016). Multi-body non-rigid structure-from-motion. In Proceedings of 3DV (pp. 148–156).
Lee, K. H., Matthew, K., Adrien, G., Jie, L., Chao, F., Sudeep, P., & Wolfram, B. (2020). Pillarflow: End-to-end birds-eye-view flow estimation for autonomous driving. In Proceedings of IROS.
Lerner, A., Chrysanthou, Y., & Lischinski, D. (2007). Crowds by example. Computer Graphics Forum, 26(3), 655–664.
Article Google Scholar
Lezama, J., Alahari, K., Sivic, J., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In Proceedings of CVPR (pp. 3369–3376). https://doi.org/10.1109/CVPR.2011.6044588
Li, P., Qin, T., et al. (2018). Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving. In Proceedings of ECCV (pp. 646–661).
Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., & Li, Z. (2023). Bevdepth: Acquisition of reliable depth for multi-view 3D object detection. In Proceedings of AAAI
Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., & Dai, J. (2022). Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of ECCV.
Lin, C. C., & Wang, M. S. (2012). A vision based top-view transformation model for a vehicle parking assistant. Sensors, 12(4), 4431–4446.
Article Google Scholar
Luo, Y., Zhang, C., Zhao, M., Zhou, H., & Sun, J. (2020). Where, what, whether: Multi-modal learning meets pedestrian detection. In Proceedings CVPR (pp. 14,065–14,073).
Lv, Z., Dellaert, F., Rehg, J. M., & Geiger, A. (2019). Taking a deeper look at the inverse compositional algorithm. In Proceedings of CVPR (pp. 4581–4590).
Makansi, O., Çiçek, Ö., Buchicchio, K., & Brox, T. (2020). Multimodal future localization and emergence prediction for objects in egocentric view with a reachability prior. In Proceedings of CVPR (pp. 4354–4363). http://lmb.informatik.uni-freiburg.de/Publications/2020/MCBB20
Mani, K., Daga, S., Garg, S., Narasimhan, S. S., Krishna, M., & Jatavallabhula, K. M. (2020). Monolayout: Amodal scene layout from a single image. In Proceedings of WACV (pp. 1689–1697).
Martin-Martin*, R., Patel*, M., Rezatofighi*, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., & Savarese, S. (2021). JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments. TPAMI.
Mehran, R., Oyama, A., Shah, M. (2009). Abnormal crowd behavior detection using social force model. In: Proc. CVPR, pp. 935–942. IEEE.
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
Moore, T., & Stouch, D. (2014). A generalized extended kalman filter implementation for the robot operating system. In Proceedings of the 13th international conference on intelligent autonomous systems (IAS-13). Springer.
Mustafa, A., Kim, H., Guillemaut, J. Y., & Hilton, A. (2015). General dynamic scene reconstruction from multiple view video. In Proceedings of ICCV.
Nishimura, M., & Yonetani, R. (2020). L2b: Learning to balance the safety-efficiency trade-off in interactive crowd-aware robot navigation. In Proceedings of IROS (pp. 11,004–11,010). https://doi.org/10.1109/IROS45743.2020.9341519.
Nishimura, M., Nobuhara, S., & Nishino, K. (2021). View birdification in the crowd: Ground-plane localization from perceived movements.
Nistér, D. (2004). An efficient solution to the five-point relative pose problem. TPAMI, 26(6), 756–770.
Article Google Scholar
Park, H. S., Jain, E., & Sheikh, Y. (2012). 3D social saliency from head-mounted cameras. Proceedings of NeurIPS, 25, 422–430.
Google Scholar
Pellegrini, S., Ess, A., Schindler, K., & Van Gool, L. (2009). You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of ICCV (pp. 261–268).
Regmi, K., & Borji, A. (2018). Cross-view image synthesis using conditional gans. In Proceedings of CVPR (pp. 3501–3510).
Rockstar Games. https://www.rockstargames.com.
Saputra, M. R. U., Markham, A., & Trigoni, N. (2018). Visual slam and structure from motion in dynamic environments: A survey. ACM Computing Surveys, 51(2).
Schöller, C., Aravantinos, V., Lay, F., & Knoll, A. (2020). What the constant velocity model can teach us about pedestrian motion prediction. IEEE Robotics and Automation Letters, 5(2), 1696–1703.
Article Google Scholar
Script Hook V. http://www.dev-c.com/gtav/.
Soran, B., Farhadi, A., & Shapiro, L. (2014). Action recognition in the presence of one egocentric and multiple static cameras. In Proceedings of ACCV (pp. 178–193). Springer.
Sundararaman, R., De Almeida Braga, C., Marchand, E., & Pettre, J. (2021). Tracking pedestrian heads in dense crowd. In Proceedings of CVPR (pp. 3865–3875).
Tai, L., Zhang, J., Liu, M., & Burgard, W. (2018). Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In Proceedings of ICRA (pp. 1111–1117). IEEE.
Taneja, A., Ballan, L., Pollefeys, & M. (2010). Modeling dynamic scenes recorded with freely moving cameras. In Proceedings of ACCV (pp. 613–626).
Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J. J., & Yan, Y. (2019). Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of CVPR.
Tang, H., Xu, D., Yan, Y., Torr, P. H., & Sebe, N. (2020). Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of CVPR.
Van Den Berg, J., Guy, S. J., Lin, M., & Manocha, D.(2011). Reciprocal n-body collision avoidance. In Robotics research (pp. 3–19). Springer.
Visscher, P. M. (2008). Sizing up human height variation. Nature Genetics, 40, 489–490.
Article Google Scholar
Wang, Q., Gao, J., Lin, W., & Yuan, Y. (2019). Learning from synthetic data for crowd counting in the wild. In Proceedings of CVPR (pp. 8198–8207).
Wang, Z., Zheng, L., Liu, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In Proceedings of ECCV.
Xiu, Y., Li, J., Wang, H., Fang, Y., & Lu, C. (2018). Pose Flow: Efficient online pose tracking. In Proceedings of BMVC.
Yagi, T., Mangalam, K., Yonetani, R., & Sato, Y. (2018). Future person localization in first-person videos. In Proceedings of CVPR (pp. 7593–7602).
Yu, C., Liu, Z., Liu, X.J., Xie, F., Yang, Y., Wei, Q., & Fei, Q. (2018). Ds-slam: A semantic visual slam towards dynamic environments. In Proceedings of IROS (pp. 1168–1174). IEEE.
Zhang, J., Yu, D., Liew, J.H., Nie, X., & Feng, J. (2021). Body meshes as points. arXiv preprint arXiv:2105.02467.
Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In Proceedings of ECCV.
Zhu, X., Yin, Z., Shi, J., Li, H., & Lin, D. (2018). Generative adversarial frontal view to bird view synthesis. In Proceedings of 3DV (pp. 454–463). IEEE.

Download references

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Kyoto, Japan
Mai Nishimura, Shohei Nobuhara & Ko Nishino
OMRON SINIC X, Tokyo, Japan
Mai Nishimura

Authors

Mai Nishimura
View author publications
You can also search for this author in PubMed Google Scholar
Shohei Nobuhara
View author publications
You can also search for this author in PubMed Google Scholar
Ko Nishino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mai Nishimura.

Additional information

Communicated by Xiaowei Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 98460 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nishimura, M., Nobuhara, S. & Nishino, K. View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements. Int J Comput Vis 131, 2015–2031 (2023). https://doi.org/10.1007/s11263-023-01788-9

Download citation

Received: 16 May 2022
Accepted: 27 March 2023
Published: 04 May 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11263-023-01788-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements

Abstract

Access this article

Similar content being viewed by others

How Can I See My Future? FvTraj: Using First-Person View for Pedestrian Trajectory Prediction

Is Crowdsourcing for Optical Flow Ground Truth Generation Feasible?

Bird’s Eye View Perception for Autonomous Driving

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements

Abstract

Access this article

Similar content being viewed by others

How Can I See My Future? FvTraj: Using First-Person View for Pedestrian Trajectory Prediction

Is Crowdsourcing for Optical Flow Ground Truth Generation Feasible?

Bird’s Eye View Perception for Autonomous Driving

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation