Abstract
We introduce view birdification, the problem of recovering ground-plane movements of people in a crowd from an ego-centric video captured from an observer (e.g., a person or a vehicle) also moving in the crowd. Recovered ground-plane movements would provide a sound basis for situational understanding and benefit downstream applications in computer vision and robotics. In this paper, we formulate view birdification as a geometric trajectory reconstruction problem and derive a cascaded optimization method from a Bayesian perspective. The method first estimates the observer’s movement and then localizes surrounding pedestrians for each frame while taking into account the local interactions between them. We introduce three datasets by leveraging synthetic and real trajectories of people in crowds and evaluate the effectiveness of our method. The results demonstrate the accuracy of our method and set the ground for further studies of view birdification as an important but challenging visual understanding problem.
Similar content being viewed by others
References
Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2008). Nonrigid structure from motion in trajectory space. In Proceedings of NeurIPS.
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). CIAL ISTM: Human trajectory prediction in crowded spaces. In Proceedings of CVPR (pp. 961–971).
Anvari, B., & Wurdemann, H. A. (2020) Modelling social interaction between humans and service robots in large public spaces. In Proceedings of IROS (pp. 11,189–11,196). https://doi.org/10.1109/IROS45743.2020.9341133
Ardeshir, S., & Borji, A. (2016). Ego2top: Matching viewers in egocentric and top-view videos. In Proceedings of ECCV (pp. 253–268). Springer.
Ardeshir, S., Regmi, K., & Borji, A. (2016). Egotransfer: Transferring motion across egocentric and exocentric domains using deep neural networks. CoRR.
Badrinarayanan, V., Budvytis, I., & Cipolla, R. (2014). Mixture of trees probabilistic graphical model for video segmentation. IJCV, 110(1), 14–29.
Bertoni, L., Kreiss, S., & Alahi, A. (2019). Monoloco: Monocular 3D pedestrian localization and uncertainty estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6861–6871).
Bescos, B., Fácil, J. M., Civera, J., & Neira, J. (2018). DynaSLAM: Tracking, mapping and inpainting in dynamic scenes. In Proceedings of IROS.
Brousseau, P. A., & Roy, S. (2019). Calibration of axial fisheye cameras through generic virtual central models. In Proceedings of ICCV.
Cao, Z., Gao, H., Mangalam, K., Cai, Q., Vo, M., & Malik, J. (2020). Long-term human motion prediction with scene context. In Proceedings of ECCV.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief propagation for early vision. IJCV, 70(1), 41–54.
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018). Social GAN: Socially acceptable trajectories with generative adversarial networks. In Proceedings of CVPR (pp. 2255–2264).
Hähnel, D., Schulz, D., & Burgard, W. (2002). Map building with mobile robots in populated environments. In Proceedings of IROS (pp. 496–501).
Hahnel, D., Triebel, R., Burgard, W., & Thrun, S. (2003). Map building with mobile robots in dynamic environments. In 2003 IEEE international conference on robotics and automation (Cat. No. 03CH37422) (Vol. 2, pp. 1557–1563). IEEE.
Helbing, D., & Molnar, P. (1995). Social force model for pedestrian dynamics. Physical Review E, 51(5), 4282.
Henein, M., Zhang, J., Mahony, R., & Ila, V. (2020). Dynamic slam: The need for speed. In Proceedings of ICRA (pp. 2123–2129). IEEE.
Hu, H. N., Yang, Y. H., Fischer, T., Yu, F., Darrell, T., & Sun, M. (2021). Monocular quasi-dense 3D object tracking. arXiv:2103.07351
Huang, J., Yang, S., Mu, T. J., & Hu, S. M. (2020). Clustervo: Clustering moving instances and estimating visual odometry for self and surroundings. In Proceedings of CVPR (pp. 2168–2177).
Ivanovic, B., & Pavone, M. (2019). The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of ICCV (pp. 2375–2384).
Jensen, S. H. N., Doest, M. E. B., Aanaes, H., & Bue, A. D. (2020). A benchmark and evaluation of non-rigid structure from motion. In IJCV.
Kratz, L., & Nishino, K. (2009). Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In Proceedings of CVPR (pp. 1446–1453). IEEE.
Kreiss, S., Bertoni, L., & Alahi, A. (2022). Openpifpaf: Composite fields for semantic keypoint detection and spatio-temporal association. IEEE Transactions on Intelligent Transportation Systems, 23(8), 13498–13511. https://doi.org/10.1109/TITS.2021.3124981
Kumar, S., Dai, Y., & Li, H. (2016). Multi-body non-rigid structure-from-motion. In Proceedings of 3DV (pp. 148–156).
Lee, K. H., Matthew, K., Adrien, G., Jie, L., Chao, F., Sudeep, P., & Wolfram, B. (2020). Pillarflow: End-to-end birds-eye-view flow estimation for autonomous driving. In Proceedings of IROS.
Lerner, A., Chrysanthou, Y., & Lischinski, D. (2007). Crowds by example. Computer Graphics Forum, 26(3), 655–664.
Lezama, J., Alahari, K., Sivic, J., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In Proceedings of CVPR (pp. 3369–3376). https://doi.org/10.1109/CVPR.2011.6044588
Li, P., Qin, T., et al. (2018). Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving. In Proceedings of ECCV (pp. 646–661).
Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., & Li, Z. (2023). Bevdepth: Acquisition of reliable depth for multi-view 3D object detection. In Proceedings of AAAI
Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., & Dai, J. (2022). Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of ECCV.
Lin, C. C., & Wang, M. S. (2012). A vision based top-view transformation model for a vehicle parking assistant. Sensors, 12(4), 4431–4446.
Luo, Y., Zhang, C., Zhao, M., Zhou, H., & Sun, J. (2020). Where, what, whether: Multi-modal learning meets pedestrian detection. In Proceedings CVPR (pp. 14,065–14,073).
Lv, Z., Dellaert, F., Rehg, J. M., & Geiger, A. (2019). Taking a deeper look at the inverse compositional algorithm. In Proceedings of CVPR (pp. 4581–4590).
Makansi, O., Çiçek, Ö., Buchicchio, K., & Brox, T. (2020). Multimodal future localization and emergence prediction for objects in egocentric view with a reachability prior. In Proceedings of CVPR (pp. 4354–4363). http://lmb.informatik.uni-freiburg.de/Publications/2020/MCBB20
Mani, K., Daga, S., Garg, S., Narasimhan, S. S., Krishna, M., & Jatavallabhula, K. M. (2020). Monolayout: Amodal scene layout from a single image. In Proceedings of WACV (pp. 1689–1697).
Martin-Martin*, R., Patel*, M., Rezatofighi*, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., & Savarese, S. (2021). JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments. TPAMI.
Mehran, R., Oyama, A., Shah, M. (2009). Abnormal crowd behavior detection using social force model. In: Proc. CVPR, pp. 935–942. IEEE.
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
Moore, T., & Stouch, D. (2014). A generalized extended kalman filter implementation for the robot operating system. In Proceedings of the 13th international conference on intelligent autonomous systems (IAS-13). Springer.
Mustafa, A., Kim, H., Guillemaut, J. Y., & Hilton, A. (2015). General dynamic scene reconstruction from multiple view video. In Proceedings of ICCV.
Nishimura, M., & Yonetani, R. (2020). L2b: Learning to balance the safety-efficiency trade-off in interactive crowd-aware robot navigation. In Proceedings of IROS (pp. 11,004–11,010). https://doi.org/10.1109/IROS45743.2020.9341519.
Nishimura, M., Nobuhara, S., & Nishino, K. (2021). View birdification in the crowd: Ground-plane localization from perceived movements.
Nistér, D. (2004). An efficient solution to the five-point relative pose problem. TPAMI, 26(6), 756–770.
Park, H. S., Jain, E., & Sheikh, Y. (2012). 3D social saliency from head-mounted cameras. Proceedings of NeurIPS, 25, 422–430.
Pellegrini, S., Ess, A., Schindler, K., & Van Gool, L. (2009). You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of ICCV (pp. 261–268).
Regmi, K., & Borji, A. (2018). Cross-view image synthesis using conditional gans. In Proceedings of CVPR (pp. 3501–3510).
Rockstar Games. https://www.rockstargames.com.
Saputra, M. R. U., Markham, A., & Trigoni, N. (2018). Visual slam and structure from motion in dynamic environments: A survey. ACM Computing Surveys, 51(2).
Schöller, C., Aravantinos, V., Lay, F., & Knoll, A. (2020). What the constant velocity model can teach us about pedestrian motion prediction. IEEE Robotics and Automation Letters, 5(2), 1696–1703.
Script Hook V. http://www.dev-c.com/gtav/.
Soran, B., Farhadi, A., & Shapiro, L. (2014). Action recognition in the presence of one egocentric and multiple static cameras. In Proceedings of ACCV (pp. 178–193). Springer.
Sundararaman, R., De Almeida Braga, C., Marchand, E., & Pettre, J. (2021). Tracking pedestrian heads in dense crowd. In Proceedings of CVPR (pp. 3865–3875).
Tai, L., Zhang, J., Liu, M., & Burgard, W. (2018). Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In Proceedings of ICRA (pp. 1111–1117). IEEE.
Taneja, A., Ballan, L., Pollefeys, & M. (2010). Modeling dynamic scenes recorded with freely moving cameras. In Proceedings of ACCV (pp. 613–626).
Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J. J., & Yan, Y. (2019). Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of CVPR.
Tang, H., Xu, D., Yan, Y., Torr, P. H., & Sebe, N. (2020). Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of CVPR.
Van Den Berg, J., Guy, S. J., Lin, M., & Manocha, D.(2011). Reciprocal n-body collision avoidance. In Robotics research (pp. 3–19). Springer.
Visscher, P. M. (2008). Sizing up human height variation. Nature Genetics, 40, 489–490.
Wang, Q., Gao, J., Lin, W., & Yuan, Y. (2019). Learning from synthetic data for crowd counting in the wild. In Proceedings of CVPR (pp. 8198–8207).
Wang, Z., Zheng, L., Liu, Y., & Wang, S. (2020). Towards real-time multi-object tracking. In Proceedings of ECCV.
Xiu, Y., Li, J., Wang, H., Fang, Y., & Lu, C. (2018). Pose Flow: Efficient online pose tracking. In Proceedings of BMVC.
Yagi, T., Mangalam, K., Yonetani, R., & Sato, Y. (2018). Future person localization in first-person videos. In Proceedings of CVPR (pp. 7593–7602).
Yu, C., Liu, Z., Liu, X.J., Xie, F., Yang, Y., Wei, Q., & Fei, Q. (2018). Ds-slam: A semantic visual slam towards dynamic environments. In Proceedings of IROS (pp. 1168–1174). IEEE.
Zhang, J., Yu, D., Liew, J.H., Nie, X., & Feng, J. (2021). Body meshes as points. arXiv preprint arXiv:2105.02467.
Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In Proceedings of ECCV.
Zhu, X., Yin, Z., Shi, J., Li, H., & Lin, D. (2018). Generative adversarial frontal view to bird view synthesis. In Proceedings of 3DV (pp. 454–463). IEEE.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Xiaowei Zhou.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file 1 (mp4 98460 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nishimura, M., Nobuhara, S. & Nishino, K. View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements. Int J Comput Vis 131, 2015–2031 (2023). https://doi.org/10.1007/s11263-023-01788-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-023-01788-9