Skip to main content
Log in

ROS open-source audio recognizer: ROAR environmental sound detection tools for robot programming

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Advances in audio recognition have enabled the real-world success of a wide variety of interactive voice systems over the last two decades. More recently, these same techniques have shown promise in recognizing non-speech audio events. Sounds are ubiquitous in real-world manipulation, such as the click of a button, the crash of an object being knocked over, and the whine of activation from an electric power tool. Surprisingly, very few autonomous robots leverage audio feedback to improve their performance. Modern audio recognition techniques exist that are capable of learning and recognizing real-world sounds, but few implementations exist that are easily incorporated into modern robotic programming frameworks. This paper presents a new software library known as the ROS Open-source Audio Recognizer (ROAR). ROAR provides a complete set of end-to-end tools for online supervised learning of new audio events, feature extraction, automatic one-class Support Vector Machine model tuning, and real-time audio event detection. Through implementation on a Barrett WAM arm, we show that combining the contextual information of the manipulation action with a set of learned audio events yields significant improvements in robotic task-completion rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.). (2008). Springer handbook of speech processing. Berlin: Springer.

    Google Scholar 

  • Borst, C., Wimbock, T., Schmidt, F., Fuchs, M., Brunner, B., Zacharias, F., et al. (2009). Rollin’ Justin—mobile platform with variable base. In Proceedings of the IEEE international conference on robotics and automation.

  • Cai, R., Lu, L., Hanjalic, A., Zhang, H. J., & Cai, L. H. (2006). A flexible framework for key audio effects detection and auditory context inference. IEEE Transactions on Audio, Speech, and Language Processing, 14, 1026–1039.

    Article  Google Scholar 

  • Chu, S., Narayanan, S., Kuo, C. C. J., Matarić, M.J. (2006). Where am I? Scene recognition for mobile robots using audio features. In Proceedings of the IEEE international conference on multimedia and expo (pp. 885–888).

  • Ciocarlie, M., Hsiao, K., Jones, G. E., Chitta, S., Rusu, R. B., & Sucan, I. A. (2010). Towards reliable grasping and manipulation in household environments. In Proceedings of the international symposium on experimental, robotics.

  • Cohen, I. (2002). Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Processing Letters, 9(1), 12–15.

    Article  Google Scholar 

  • Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Dufaux, A. (2001). Detection and recognition of impulsive sound signals. Ph.D. thesis, University of Neuchâtel.

  • Eaton, J. W. (2002). GNU Octave Manual. Network Theory Limited.

  • Ellis, D. P. W. (2005). PLP and RASTA (and MFCC, and inversion) in Matlab (2005). www.ee.columbia.edu/~dpwe/resources/matlab/rastamat. Online web resource.

  • Graf, B., Hans, M., & Schraft, R. D. (2004). Care-O-bot II—development of a next generation robotic home assistant. Autonomous Robots, 16(2), 193–205.

    Article  Google Scholar 

  • Gray, S. R., Romano, J. M., Brindza, J. P., Kim, S., Kuchenbecker, K. J., Kumar, V. (2011). Planning manipulation and grasping tasks with a redundant arm. In Proceedings of the ASME international design engineering technical conferences & computers and information in, engineering conference.

  • Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4), 1738–1752.

    Article  Google Scholar 

  • Jain, A., & Kemp, C. C. (2010). EL-E: An assistive mobile manipulator that autonomously fetches objects from flat surfaces. Autonomous Robots, 28(1), 45–64.

    Article  Google Scholar 

  • Lim, A., Mizumoto, T., Cahier, L. K., Otsuka, T., Takahashi, T., Komatani, K., Ogata, T., Okuno, H. G. (2010). Robot musical accompaniment: Integrating audio and visual cues for real-time synchronization with a human flutist. In Proceedings of the IEEE international conference on intelligent robots and systems.

  • Nakamura, T., Nagai, T., & Iwahashi, N. (2007). Multimodel object categorization by a robot. In Proceedings of the IEEE international conference on intelligent robots and systems.

  • Okuno, H. G., & Nakadai, K. (2007). Computational auditory scene analysis and its application to robot audition: Five years experience. In Proceedings of the second international conference on informatics research for development of knowledge society infrastructure (pp. 69–76).

  • Oppenheim, A. V., & Schafer, R. W. (2004). From frequency to quefrency: A history of the cepstrum. IEEE Signal Processing Magazine, 21(5), 95–106.

    Article  Google Scholar 

  • Portêlo, J., Bugalho, M., Trancoso, I., Neto, J., Abad, A., & Serralheiro, A. (2009). Non-speech audio event detection. In Proceedings of the IEEE international conference on acoustics, speech and, signal processing (pp. 1973–1976).

  • Quigley, M., Gerkey, B., Conley, K., Faust, J., Foote, T., Leibs, J., et al. (2009). ROS: An Open-source Robot Operating System. In Open-source software workshop of the IEEE international conference on robotics and automation.

  • Rabaoui, A., Davy, M., Rossignol, S., Lachiri, Z., & Ellouze, N. (2007). Improved one-class SVM classifier for sounds classification. In Proceedings of the IEEE conference on advanced video and signal based surveillance (pp. 117–122).

  • Ramo, J., Siddiqi, A., Dubrawski, A., Gordon, G., & Sharma, A. (2010). Automatic state discovery for unstructured audio scene classification. In Proceedings of IEEE international conference on acoustic speech and signal processing.

  • Rodemann, T., Joublin, F., & Goerick, C. (2009). Filtering environmental sounds using basic audio cues in robot audition. In Proceedings of the international conference on advanced robotics.

  • Rojas, J., & Peters, R. A, I. I. (2005). Sensory integration with articulated motion on a humanoid robot. Applied Bionics and Biomechanics, 2(3–4), 171–178.

    Article  Google Scholar 

  • Romano, J. M., Hsiao, K., Niemeyer, G., Chitta, S., & Kuchenbecker, K. J. (2011). Human-inspired robotic grasp control with tactile sensing. IEEE Transactions on Robotics.

  • Sakagami, Y., Watanabe, R., & Aoyama, C.: The intelligent ASIMO: System overview and integration. In Proceedings of the IEEE international conference on intelligent robotics and systems (pp. 2478–2483).

  • Sarle, W. S. (1997). Neural network FAQ. ftp://ftp.sas.com/pub/neural/FAQ.html. Periodic posting to the Usenet newsgroup comp.ai.neural-nets.

  • Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13, 1443–1471.

    Article  MATH  Google Scholar 

  • Sinapov, J., & Stoytchev, A. (2009). From acoustic object recognition to object categorization by a humanoid robot. In Proceedings of the RSS 2009 workshop: Mobile manipulation in human, environments.

  • Sonnenburg, S., Raetsch, G., Henschel, S., Widmer, C., Behr, J., Zien, A., et al. (2010). The SHOGUN machine learning toolbox. Journal of Machine Learning Research, 11, 1799–1802.

    MATH  Google Scholar 

  • Srinivasa, S., Ferguson, D., Helfrich, C., Berenson, D., Collet, A., Diankov, R., et al. (2009). Herb: A home exploring robotic butler. Autonomous Robots, 28(1), 5–20.

    Article  Google Scholar 

  • Torres-Jara, E., Natale, L., & Fitzpatrick, P. (2005). Tapping into touch. In Proceedings of the fifth international workshop on epigenetic robotics: Modeling cognitive development in robotic systems (pp. 79–86).

  • Valin, J. M., Michaud, F., & Rouat, J. (2007). Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems, 55(3), 216–228.

    Google Scholar 

  • Valin, J. M., Yamamoto, S., Rouat, J., Michaud, F., Nakadai, K., & Okuno, H. G. (2007). Robust recognition of simultaneous speech by a mobile robot. IEEE Transactions on Robotics, 23(4), 742–752.

    Google Scholar 

  • Vaseghi, S. V. (2000). Advanced digital signal processing and noise reduction. Chichester: Wiley.

  • Wu, X., Gong, H., Chen, P., Zhong, Z., & Xu, Y. (2009). Surveillance robot utilizing video and audio information. Journal of Intelligent & Robotic Systems, 55, 403–421.

    Google Scholar 

Download references

Acknowledgments

This work was supported by funding from the DARPA Autonomous Robotic Manipulation Software Track (US Army RDECOM contract W91CRB-10-C-0127) and by the University of Pennsylvania.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joseph M. Romano.

Electronic Supplementary Material

The Below is the Electronic Supplementary Material.

ESM 1 (MP4 118736 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Romano, J.M., Brindza, J.P. & Kuchenbecker, K.J. ROS open-source audio recognizer: ROAR environmental sound detection tools for robot programming. Auton Robot 34, 207–215 (2013). https://doi.org/10.1007/s10514-013-9323-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-013-9323-6

Keywords

Navigation