Abstract
In this work, we show how to estimate a device’s position and orientation indoors by echolocation, i.e., by interpreting the echoes of an audio signal that the device itself emits. Established visual localization methods rely on the device’s camera and yield excellent accuracy if unique visual features are in view and depicted clearly. We argue that audio sensing can offer complementary information to vision for device localization, since audio is invariant to adverse visual conditions and can reveal scene information beyond a camera’s field of view. We first propose a strategy for learning an audio representation that captures the scene geometry around a device using supervision transfer from vision. Subsequently, we leverage this audio representation to complement vision in three device localization tasks: relative pose estimation, place recognition, and absolute pose regression. Our proposed methods outperform state-of-the-art vision models on new audio-visual benchmarks for the Replica and Matterport3D datasets.
K. Yang and C. Godard—Work done while at Niantic, during Karren’s internship.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR (2016)
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval. In: ICCV (2015)
Balntas, V., Li, S., Prisacariu, V.: RelocNet: continuous metric learning relocalisation using neural nets. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 782–799. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_46
Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: BMVC (2016)
Bhowmik, A., Gumhold, S., Rother, C., Brachmann, E.: Reinforced feature points: optimizing feature detection and description for a high-level task. In: CVPR, June 2020
Brachmann, E., et al.: DSAC - differentiable RANSAC for camera localization. In: CVPR (2017)
Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S., Rother, C.: Uncertainty-driven 6d pose estimation of objects and scenes from a single RGB image. In: CVPR (2016)
Brachmann, E., Rother, C.: Learning less is more - 6D camera localization via 3D surface regression. In: CVPR (2018)
Brachmann, E., Rother, C.: Expert sample consensus applied to camera re-localization. In: ICCV (2019)
Brachmann, E., Rother, C.: Neural-guided RANSAC: learning where to sample model hypotheses. In: ICCV (2019)
Brachmann, E., Rother, C.: Visual camera re-localization from RGB and RGB-D images using DSAC. TPAMI (2021)
Brahmbhatt, S., Gu, J., Kim, K., Hays, J., Kautz, J.: Geometry-aware learning of maps for camera localization. In: CVPR (2018)
Bui, M., et al.: 6D camera relocalization in ambiguous scenes via continuous multimodal inference. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 139–157. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_9
Cai, R., Hariharan, B., Snavely, N., Averbuch-Elor, H.: Extreme rotation estimation using dense correlation volumes. In: CVPR (2021)
Castle, R., Klein, G., Murray, D.W.: Video-rate localization in multiple maps for wearable augmented reality. In: 2008 12th IEEE International Symposium on Wearable Computers, pp. 15–22. IEEE (2008)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)
Chen, C., Al-Halah, Z., Grauman, K.: Semantic audio-visual navigation. In: CVPR (2021)
Chen, C., et al.: Audio-visual embodied navigation. Environment 97, 103 (2019)
Chen, C., et al.: SoundSpaces: audio-visual navigation in 3D environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 17–36. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_2
Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S.K., Grauman, K.: Learning to set waypoints for audio-visual navigation. arXiv preprint arXiv:2008.09622 (2020)
Chen, K., Snavely, N., Makadia, A.: Wide-baseline relative camera pose estimation with directional learning. In: CVPR (2021)
Chen, Z., Hu, X., Owens, A.: Structure from silence: learning scene structure from ambient sound. arXiv preprint arXiv:2111.05846 (2021)
Christensen, J.H., Hornauer, S., Stella, X.Y.: Batvision: learning to see 3D spatial layout with two ears. In: ICRA (2020)
Debski, A., Grajewski, W., Zaborowski, W., Turek, W.: Open-source localization device for indoor mobile robots. Procedia Comput. Sci. 76, 139–146 (2015)
Dokmanić, I., Parhizkar, R., Walther, A., Lu, Y.M., Vetterli, M.: Acoustic echoes reveal room shape. Proc. Natl. Acad. Sci. 110(30), 12186–12191 (2013)
Dusmanu, M., et al.: D2-net: a trainable CNN for joint detection and description of local features. arXiv preprint arXiv:1905.03561 (2019)
Eliakim, I., Cohen, Z., Kosa, G., Yovel, Y.: A fully autonomous terrestrial bat-like acoustic robot. PLoS Comput. Biol. 14(9), e1006406 (2018)
En, S., Lechervy, A., Jurie, F.: RPNet: an end-to-end network for relative camera pose estimation. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 738–745. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_46
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Fisher III, J.W., Darrell, T., Freeman, W., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: NeurIPS (2000)
Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: ICCV (2019)
Gao, R., Chen, C., Al-Halah, Z., Schissler, C., Grauman, K.: VisualEchoes: spatial image representation learning through echolocation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 658–676. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_38
Gao, R., Grauman, K.: 2.5D visual sound. In: CVPR (2019)
Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: CVPR (2020)
Garg, S., Fischer, T., Milford, M.: Where is your place, visual place recognition? IJCAI (2021)
Greene, N.: Environment mapping and other applications of world projections. IEEE Comput. Graphics Appl. 6(11), 21–29 (1986)
Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-NetVLAD: multi-scale fusion of locally-global descriptors for place recognition. In: CVPR (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hershey, J., Movellan, J.: Audio vision: using audio-visual synchrony to locate sounds. In: NeurIPS (1999)
Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: WACV (2019)
Humenberger, M., et al.: Robust image retrieval-based visual localization using Kapture. arXiv:2007.13867 (2020)
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR (2010)
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: ICCV (2019)
Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: CVPR (2017)
Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: ICCV (2015)
Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: CVPR (2005)
Laskar, Z., Melekhov, I., Kalia, S., Kannala, J.: Camera relocalization by computing pairwise relative poses using convolutional neural network. In: ICCV Workshops (2017)
Li, X., Wang, S., Zhao, Y., Verbeek, J., Kannala, J.: Hierarchical scene coordinate classification and regression for visual localization. In: CVPR (2020)
Li, Y., Snavely, N., Huttenlocher, D.P.: Location recognition using prioritized feature matching. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 791–804. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_57
Li, Y., Snavely, N., Huttenlocher, D., Fua, P.: Worldwide pose estimation using 3D point clouds. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 15–29. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_2
Lim, H., Sinha, S.N., Cohen, M.F., Uyttendaele, M.: Real-time image-based 6-dof localization in large-scale environments. In: CVPR (2012)
Lindell, D.B., Wetzstein, G., Koltun, V.: Acoustic non-line-of-sight imaging. In: CVPR (2019)
Liu, D., Cui, Y., Yan, L., Mousas, C., Yang, B., Chen, Y.: Densernet: weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Long, X., Gan, C., De Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: purely attention based local feature integration for video classification. In: CVPR (2018)
Masone, C., Caputo, B.: A survey on deep visual place recognition. IEEE Access 9, 19516–19547 (2021)
Melekhov, I., Ylioinas, J., Kannala, J., Rahtu, E.: Relative camera pose estimation using convolutional neural networks. In: Blanc-Talon, J., Penne, R., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2017. LNCS, vol. 10617, pp. 675–687. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70353-4_57
Morgado, P., Li, Y., Nvasconcelos, N.: Learning representations from audio-visual spatial alignment. In: NeurIPS, vol. 33, 4733–4744 (2020)
Morgado, P., Nvasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: NeurIPS, vol. 31 (2018)
Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. In: CVPR (2021)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Parida, K.K., Srivastava, S., Sharma, G.: Beyond image to depth: improving depth prediction using echoes. In: CVPR (2021)
Politis, A., Mesaros, A., Adavanne, S., Heittola, T., Virtanen, T.: Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Trans. Audio Speech Language Process. 29, 684–698 (2020)
Poursaeed, O., et al.: Deep fundamental matrix estimation without correspondences. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11131, pp. 485–497. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11015-4_35
Purushwalkam, S., et al.: Audio-visual floorplan reconstruction. In: ICCV (2021)
Raguram, R., Frahm, J.-M., Pollefeys, M.: A comparative analysis of RANSAC techniques leading to adaptive real-time random sample consensus. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 500–513. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88688-4_37
Ranftl, R., Koltun, V.: Deep fundamental matrix estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 292–309. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_18
Revaud, J., Weinzaepfel, P., de Souza, C.R., Humenberger, M.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS (2019)
de Sa, V.R.: Learning classification with unlabeled data. In: NeurIPS (1994)
Sarlin, P.E., Cadena, C., Siegwart, R., Dymczyk, M.: From coarse to fine: robust hierarchical localization at large scale. In: CVPR (2019)
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: CVPR (2020)
Sarlin, P.E., et al.: Back to the feature: learning robust camera localization from pixels to pose. In: CVPR (2021). arxiv.org/abs/2103.09213
Sattler, T., Havlena, M., Radenovic, F., Schindler, K., Pollefeys, M.: Hyperpoints and fine vocabularies for large-scale location recognition. In: ICCV (2015)
Sattler, T., Leibe, B., Kobbelt, L.: Improving image-based localization by active correspondence search. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 752–765. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_54
Sattler, T., Leibe, B., Kobbelt, L.: Efficient & effective prioritized matching for large-scale image-based localization. In: PAMI (2017)
Sattler, T., et al..: Are large-scale 3D models really necessary for accurate visual localization? In: CVPR (2017)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)
Shavit, Y., Ferens, R., Keller, Y.: Learning multi-scene absolute pose regression with transformers. In: ICCV, pp. 2733–2742, October 2021
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in RGB-D images. In: CVPR (2013)
Singh, N., Mentch, J., Ng, J., Beveridge, M., Drori, I.: Image2reverb: cross-modal reverb impulse response synthesis. In: ICCV (2021)
Sohl-Dickstein, J., et al.: A device for human ultrasonic echolocation. IEEE Trans. Biomed. Eng. 62(6), 1526–1534 (2015)
Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019)
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)
Sun, W., Jiang, W., Trulls, E., Tagliasacchi, A., Yi, K.M.: ACNe: attentive context normalization for robust permutation-equivariant learning. In: CVPR, June 2020
Svarm, L., Enqvist, O., Oskarsson, M., Kahl, F.: Accurate localization and pose estimation for large 3D models. In: CVPR (2014)
Svärm, L., Enqvist, O., Kahl, F., Oskarsson, M.: City-scale localization for cameras with known vertical direction. TPAMI (2017)
Taira, H., et aal.: InLoc: indoor visual localization with dense matching and view synthesis. In: CVPR (2018)
Taira, H., et al.: InLoc: indoor visual localization with dense matching and view synthesis. In: TPAMI (2021)
Taubner, F., Tschopp, F., Novkovic, T., Siegwart, R., Furrer, F.: LCD-line clustering and description for place recognition. In: 2020 International Conference on 3D Vision (3DV) (2020)
Thrun, S.: Affine structure from sound. In: NeurIPS (2005)
Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. In: CVPR (2015)
Türkoğlu, M.Ö., Brachmann, E., Schindler, K., Brostow, G., Monszpart, A.: Visual camera re-localization using graph neural networks and relative pose supervision. In: 3DV. IEEE (2021)
Tyszkiewicz, M., Fua, P., Trulls, E.: Disk: learning local features with policy gradient. In: NeurIPS (2020)
Valentin, J., Nießner, M., Shotton, J., Fitzgibbon, A., Izadi, S., Torr, P.: Exploiting uncertainty in regression forests for accurate camera relocalization. In: CVPR (2015)
Vasudevan, A.B., Dai, D., Van Gool, L.: Semantic object prediction and spatial sound super-resolution with binaural sounds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 638–655. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_37
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Villalpando, A.P., Schillaci, G., Hafner, V.V., Guzmán, B.L.: Ego-noise predictions for echolocation in wheeled robots. In: ALIFE 2019: The 2019 Conference on Artificial Life, pp. 567–573. MIT Press (2019)
Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T., Schmalstieg, D.: Real-time detection and tracking for augmented reality on mobile phones. IEEE Trans. Visual Comput. Graphics 16(3), 355–368 (2009)
Walch, F., Hazirbas, C., Leal-Taixé, L., Sattler, T., Hilsenbeck, S., Cremers, D.: Image-Based Localization Using LSTMs for Structured Feature Correlation. In: ICCV (2017)
Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: CVPR (2020)
Winkelbauer, D., Denninger, M., Triebel, R.: Learning to localize in new environments from synthetic training data. In: ICRA (2021)
Wu, Z., Jiang, Y.G., Wang, X., Ye, H., Xue, X.: Multi-stream multi-class fusion of deep networks for video classification. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 791–800 (2016)
Yang, K., Lin, W.Y., Barman, M., Condessa, F., Kolter, Z.: Defending multimodal fusion models against single-source adversaries. In: CVPR (2021)
Yang, K., Russell, B., Salamon, J.: Telling left from right: learning spatial correspondence of sight and sound. In: CVPR (2020)
Yi, K.M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: CVPR (2018)
Yue, H., Miao, J., Yu, Y., Chen, W., Wen, C.: Robust loop closure detection based on bag of superpoints and graph verification. In: IROS (2019)
Zhang, Z., et al.: Generative modeling of audible shapes for object perception. In: ICCV (2017)
Zhou, Q., Sattler, T., Pollefeys, M., Leal-Taixé, L.: To learn or not to learn: visual localization from essential matrices. In: ICRA (2019)
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, K., Firman, M., Brachmann, E., Godard, C. (2022). Camera Pose Estimation and Localization with Active Audio Sensing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-19836-6_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)