Skip to main content
Log in

View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We introduce view birdification, the problem of recovering ground-plane movements of people in a crowd from an ego-centric video captured from an observer (e.g., a person or a vehicle) also moving in the crowd. Recovered ground-plane movements would provide a sound basis for situational understanding and benefit downstream applications in computer vision and robotics. In this paper, we formulate view birdification as a geometric trajectory reconstruction problem and derive a cascaded optimization method from a Bayesian perspective. The method first estimates the observer’s movement and then localizes surrounding pedestrians for each frame while taking into account the local interactions between them. We introduce three datasets by leveraging synthetic and real trajectories of people in crowds and evaluate the effectiveness of our method. The results demonstrate the accuracy of our method and set the ground for further studies of view birdification as an important but challenging visual understanding problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2008). Nonrigid structure from motion in trajectory space. In Proceedings of NeurIPS.

  • Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). CIAL ISTM: Human trajectory prediction in crowded spaces. In Proceedings of CVPR (pp. 961–971).

  • Anvari, B., & Wurdemann, H. A. (2020) Modelling social interaction between humans and service robots in large public spaces. In Proceedings of IROS (pp. 11,189–11,196). https://doi.org/10.1109/IROS45743.2020.9341133

  • Ardeshir, S., & Borji, A. (2016). Ego2top: Matching viewers in egocentric and top-view videos. In Proceedings of ECCV (pp. 253–268). Springer.

  • Ardeshir, S., Regmi, K., & Borji, A. (2016). Egotransfer: Transferring motion across egocentric and exocentric domains using deep neural networks. CoRR.

  • Badrinarayanan, V., Budvytis, I., & Cipolla, R. (2014). Mixture of trees probabilistic graphical model for video segmentation. IJCV, 110(1), 14–29.

    Article  Google Scholar 

  • Bertoni, L., Kreiss, S., & Alahi, A. (2019). Monoloco: Monocular 3D pedestrian localization and uncertainty estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6861–6871).

  • Bescos, B., Fácil, J. M., Civera, J., & Neira, J. (2018). DynaSLAM: Tracking, mapping and inpainting in dynamic scenes. In Proceedings of IROS.

  • Brousseau, P. A., & Roy, S. (2019). Calibration of axial fisheye cameras through generic virtual central models. In Proceedings of ICCV.

  • Cao, Z., Gao, H., Mangalam, K., Cai, Q., Vo, M., & Malik, J. (2020). Long-term human motion prediction with scene context. In Proceedings of ECCV.

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief propagation for early vision. IJCV, 70(1), 41–54.

    Article  Google Scholar 

  • Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.

    Article  MathSciNet  Google Scholar 

  • Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018). Social GAN: Socially acceptable trajectories with generative adversarial networks. In Proceedings of CVPR (pp. 2255–2264).

  • Hähnel, D., Schulz, D., & Burgard, W. (2002). Map building with mobile robots in populated environments. In Proceedings of IROS (pp. 496–501).

  • Hahnel, D., Triebel, R., Burgard, W., & Thrun, S. (2003). Map building with mobile robots in dynamic environments. In 2003 IEEE international conference on robotics and automation (Cat. No. 03CH37422) (Vol. 2, pp. 1557–1563). IEEE.

  • Helbing, D., & Molnar, P. (1995). Social force model for pedestrian dynamics. Physical Review E, 51(5), 4282.

    Article  Google Scholar 

  • Henein, M., Zhang, J., Mahony, R., & Ila, V. (2020). Dynamic slam: The need for speed. In Proceedings of ICRA (pp. 2123–2129). IEEE.

  • Hu, H. N., Yang, Y. H., Fischer, T., Yu, F., Darrell, T., & Sun, M. (2021). Monocular quasi-dense 3D object tracking. arXiv:2103.07351

  • Huang, J., Yang, S., Mu, T. J., & Hu, S. M. (2020). Clustervo: Clustering moving instances and estimating visual odometry for self and surroundings. In Proceedings of CVPR (pp. 2168–2177).

  • Ivanovic, B., & Pavone, M. (2019). The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of ICCV (pp. 2375–2384).

  • Jensen, S. H. N., Doest, M. E. B., Aanaes, H., & Bue, A. D. (2020). A benchmark and evaluation of non-rigid structure from motion. In IJCV.

  • Kratz, L., & Nishino, K. (2009). Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In Proceedings of CVPR (pp. 1446–1453). IEEE.

  • Kreiss, S., Bertoni, L., & Alahi, A. (2022). Openpifpaf: Composite fields for semantic keypoint detection and spatio-temporal association. IEEE Transactions on Intelligent Transportation Systems, 23(8), 13498–13511. https://doi.org/10.1109/TITS.2021.3124981

    Article  Google Scholar 

  • Kumar, S., Dai, Y., & Li, H. (2016). Multi-body non-rigid structure-from-motion. In Proceedings of 3DV (pp. 148–156).

  • Lee, K. H., Matthew, K., Adrien, G., Jie, L., Chao, F., Sudeep, P., & Wolfram, B. (2020). Pillarflow: End-to-end birds-eye-view flow estimation for autonomous driving. In Proceedings of IROS.

  • Lerner, A., Chrysanthou, Y., & Lischinski, D. (2007). Crowds by example. Computer Graphics Forum, 26(3), 655–664.

    Article  Google Scholar 

  • Lezama, J., Alahari, K., Sivic, J., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In Proceedings of CVPR (pp. 3369–3376). https://doi.org/10.1109/CVPR.2011.6044588

  • Li, P., Qin, T., et al. (2018). Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving. In Proceedings of ECCV (pp. 646–661).

  • Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., & Li, Z. (2023). Bevdepth: Acquisition of reliable depth for multi-view 3D object detection. In Proceedings of AAAI

  • Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., & Dai, J. (2022). Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of ECCV.

  • Lin, C. C., & Wang, M. S. (2012). A vision based top-view transformation model for a vehicle parking assistant. Sensors, 12(4), 4431–4446.

    Article  Google Scholar 

  • Luo, Y., Zhang, C., Zhao, M., Zhou, H., & Sun, J. (2020). Where, what, whether: Multi-modal learning meets pedestrian detection. In Proceedings CVPR (pp. 14,065–14,073).

  • Lv, Z., Dellaert, F., Rehg, J. M., & Geiger, A. (2019). Taking a deeper look at the inverse compositional algorithm. In Proceedings of CVPR (pp. 4581–4590).

  • Makansi, O., Çiçek, Ö., Buchicchio, K., & Brox, T. (2020). Multimodal future localization and emergence prediction for objects in egocentric view with a reachability prior. In Proceedings of CVPR (pp. 4354–4363). http://lmb.informatik.uni-freiburg.de/Publications/2020/MCBB20

  • Mani, K., Daga, S., Garg, S., Narasimhan, S. S., Krishna, M., & Jatavallabhula, K. M. (2020). Monolayout: Amodal scene layout from a single image. In Proceedings of WACV (pp. 1689–1697).

  • Martin-Martin*, R., Patel*, M., Rezatofighi*, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., & Savarese, S. (2021). JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments. TPAMI.

  • Mehran, R., Oyama, A., Shah, M. (2009). Abnormal crowd behavior detection using social force model. In: Proc. CVPR, pp. 935–942. IEEE.

  • Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.

  • Moore, T., & Stouch, D. (2014). A generalized extended kalman filter implementation for the robot operating system. In Proceedings of the 13th international conference on intelligent autonomous systems (IAS-13). Springer.

  • Mustafa, A., Kim, H., Guillemaut, J. Y., & Hilton, A. (2015). General dynamic scene reconstruction from multiple view video. In Proceedings of ICCV.

  • Nishimura, M., & Yonetani, R. (2020). L2b: Learning to balance the safety-efficiency trade-off in interactive crowd-aware robot navigation. In Proceedings of IROS (pp. 11,004–11,010). https://doi.org/10.1109/IROS45743.2020.9341519.

  • Nishimura, M., Nobuhara, S., & Nishino, K. (2021). View birdification in the crowd: Ground-plane localization from perceived movements.

  • Nistér, D. (2004). An efficient solution to the five-point relative pose problem. TPAMI, 26(6), 756–770.

    Article  Google Scholar 

  • Park, H. S., Jain, E., & Sheikh, Y. (2012). 3D social saliency from head-mounted cameras. Proceedings of NeurIPS, 25, 422–430.

    Google Scholar 

  • Pellegrini, S., Ess, A., Schindler, K., & Van Gool, L. (2009). You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of ICCV (pp. 261–268).

  • Regmi, K., & Borji, A. (2018). Cross-view image synthesis using conditional gans. In Proceedings of CVPR (pp. 3501–3510).

  • Rockstar Games. https://www.rockstargames.com.

  • Saputra, M. R. U., Markham, A., & Trigoni, N. (2018). Visual slam and structure from motion in dynamic environments: A survey. ACM Computing Surveys, 51(2).

  • Schöller, C., Aravantinos, V., Lay, F., & Knoll, A. (2020). What the constant velocity model can teach us about pedestrian motion prediction. IEEE Robotics and Automation Letters, 5(2), 1696–1703.

    Article  Google Scholar 

  • Script Hook V. http://www.dev-c.com/gtav/.

  • Soran, B., Farhadi, A., & Shapiro, L. (2014). Action recognition in the presence of one egocentric and multiple static cameras. In Proceedings of ACCV (pp. 178–193). Springer.

  • Sundararaman, R., De Almeida Braga, C., Marchand, E., & Pettre, J. (2021). Tracking pedestrian heads in dense crowd. In Proceedings of CVPR (pp. 3865–3875).

  • Tai, L., Zhang, J., Liu, M., & Burgard, W. (2018). Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In Proceedings of ICRA (pp. 1111–1117). IEEE.

  • Taneja, A., Ballan, L., Pollefeys, & M. (2010). Modeling dynamic scenes recorded with freely moving cameras. In Proceedings of ACCV (pp. 613–626).

  • Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J. J., & Yan, Y. (2019). Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of CVPR.

  • Tang, H., Xu, D., Yan, Y., Torr, P. H., & Sebe, N. (2020). Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of CVPR.

  • Van Den Berg, J., Guy, S. J., Lin, M., & Manocha, D.(2011). Reciprocal n-body collision avoidance. In Robotics research (pp. 3–19). Springer.

  • Visscher, P. M. (2008). Sizing up human height variation. Nature Genetics, 40, 489–490.

    Article  Google Scholar 

  • Wang, Q., Gao, J., Lin, W., & Yuan, Y. (2019). Learning from synthetic data for crowd counting in the wild. In Proceedings of CVPR (pp. 8198–8207).

  • Wang, Z., Zheng, L., Liu, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In Proceedings of ECCV.

  • Xiu, Y., Li, J., Wang, H., Fang, Y., & Lu, C. (2018). Pose Flow: Efficient online pose tracking. In Proceedings of BMVC.

  • Yagi, T., Mangalam, K., Yonetani, R., & Sato, Y. (2018). Future person localization in first-person videos. In Proceedings of CVPR (pp. 7593–7602).

  • Yu, C., Liu, Z., Liu, X.J., Xie, F., Yang, Y., Wei, Q., & Fei, Q. (2018). Ds-slam: A semantic visual slam towards dynamic environments. In Proceedings of IROS (pp. 1168–1174). IEEE.

  • Zhang, J., Yu, D., Liew, J.H., Nie, X., & Feng, J. (2021). Body meshes as points. arXiv preprint arXiv:2105.02467.

  • Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In Proceedings of ECCV.

  • Zhu, X., Yin, Z., Shi, J., Li, H., & Lin, D. (2018). Generative adversarial frontal view to bird view synthesis. In Proceedings of 3DV (pp. 454–463). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mai Nishimura.

Additional information

Communicated by Xiaowei Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 98460 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nishimura, M., Nobuhara, S. & Nishino, K. View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements. Int J Comput Vis 131, 2015–2031 (2023). https://doi.org/10.1007/s11263-023-01788-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01788-9

Keywords

Navigation