skip to main content
research-article

Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides

Published:28 June 2017Publication History
Skip Abstract Section

Abstract

In this article, we address the problem of creating a smart audio guide that adapts to the actions and interests of museum visitors. As an autonomous agent, our guide perceives the context and is able to interact with users in an appropriate fashion. To do so, it understands what the visitor is looking at, if the visitor is moving inside the museum hall, or if he or she is talking with a friend. The guide performs automatic recognition of artworks, and it provides configurable interface features to improve the user experience and the fruition of multimedia materials through semi-automatic interaction.

Our smart audio guide is backed by a computer vision system capable of working in real time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy, we perform additional video processing using shape-based filtering, artwork tracking, and temporal filtering. The system has been deployed on an NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1 and tested in a real-world environment (Bargello Museum of Florence).

Skip Supplemental Material Section

Supplemental Material

References

  1. Rao Muhammad Anwer, Fahad Shahbaz Khan, Joost van de Weijer, and Jorma Laaksonen. 2016. Combining holistic and part-based deep representations for computational painting categorization. In Proceedings of the ACM on International Conference on Multimedia Retrieval (ICMR’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what individual SUS scores mean: Adding an adjective rating scale. J. Usabil. Stud. 4, 3 (2009), 114--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jonathan P. Bowen and Silvia Filippini-Fantoni. 2004. Personalization and the web from a museum perspective. In Proceedings of Museums and the Web (MW’04).Google ScholarGoogle Scholar
  4. John Brooke. 1996. SUS-A quick and dirty usability scale. Usability Evaluation in Industry (1996), 189--194.Google ScholarGoogle Scholar
  5. Kenneth Bullington and JM Fraser. 1959. Engineering aspects of TASI. Trans. Am. Inst. Electr. Eng. Part I: Commun. Electron. 78, 3 (1959), 256--260.Google ScholarGoogle Scholar
  6. Erik Cohen. 1985. The tourist guide: The origins, structure and dynamics of a role. Ann. Tour. Res. 12, 1 (1985), 5--29.Google ScholarGoogle ScholarCross RefCross Ref
  7. Alberto Del Bimbo, Walter Nunziati, and Pietro Pala. 2009. David: Discriminant analysis for verification of monuments in image data. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Thomas Drugman, Yannis Stylianou, Yusuke Kida, and Masami Akamine. 2016. Voice activity detection: Merging source and filter-based information. IEEE Sign. Process. Lett. 23, 2 (2016), 252--256.Google ScholarGoogle ScholarCross RefCross Ref
  9. Veron Eliseo and Levasseur Martine. 1991. Ethnographie de l’exposition. Études Et Recherche, Centre Georges Pompidou, Bibliothèque Publique d’information (1991).Google ScholarGoogle Scholar
  10. Benjamin Elizalde and Gerald Friedland. 2013. Lost in segmentation: Three approaches for speech/non-speech detection in consumer-produced videos. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’13).Google ScholarGoogle ScholarCross RefCross Ref
  11. Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. 2014. Scalable object detection using deep neural networks. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2155--2162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Florian Eyben, Felix Weninger, Stefano Squartini, and Björn Schuller. 2013. Real-life voice activity detection with lstm recurrent neural networks and an application to hollywood movies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13).Google ScholarGoogle ScholarCross RefCross Ref
  13. Ross Girshick. 2015. Fast R-CNN. In Proc. of IEEE International Conference on Computer Vision (ICCV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Loc Nguyen Huynh, Rajesh Krishna Balan, and Youngki Lee. 2016. DeepSense: A GPU-based deep convolutional neural network framework on commodity mobile devices. In Proceedings of the Workshop on Wearable Systems and Applications (WearSys’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Svebor Karaman, Andrew D. Bagdanov, Lea Landucci, Gianpaolo D’Amico, Andrea Ferracani, Daniele Pezzatini, and Alberto Del Bimbo. 2016. Personalized multimedia content delivery on an interactive table by passive observation of museum visitors. Multimed. Tools Appl. 75, 7 (2016), 3787--3811. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jens Keil, Laia Pujol, Maria Roussou, Timo Engelke, Michael Schmitt, Ulrich Bockholt, and Stamatia Eleftheratou. 2013. A digital look at physical museum exhibits: Designing personalized stories with handheld augmented reality in museums. In Proceedings of the Digital Heritage International Congress (DigitalHeritage’13).Google ScholarGoogle ScholarCross RefCross Ref
  18. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’12). 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tsvi Kuflik, Zvi Boger, and Massimo Zancanaro. 2012. Analysis and Prediction of Museum Visitors’ Behavioral Pattern Types. Springer, Berlin, 161--176.Google ScholarGoogle Scholar
  20. Seyyed Salar Latifi Oskouei, Hossein Golestani, Matin Hashemi, and Soheil Ghiasi. 2016. CNNdroid: GPU-accelerated execution of trained deep convolutional neural networks on android. In Proceedings of ACM Multimedia (MM’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15) Workshops. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). http://arxiv.org/abs/1512.02325Google ScholarGoogle Scholar
  23. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. IJCV 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P. Murphy. 2015. Im2Calories: Towards an automated mobile vision food diary. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ananya Misra. 2012. Speech/nonspeech segmentation in web videos. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech’12).Google ScholarGoogle ScholarCross RefCross Ref
  26. Saman Mousazadeh and Israel Cohen. 2011. AR-GARCH in presence of noise: Parameter estimation and its application to voice activity detection. IEEE Trans. Aud. Speech Lang. Process. 19, 4 (2011), 916--926. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fatih Nayebi, Jean-Marc Desharnais, and Alain Abran. 2012. The state of the art of mobile application usability evaluation. In Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE’12).Google ScholarGoogle ScholarCross RefCross Ref
  28. Jakob Nielsen and Rolf Molich. 1990. Heuristic evaluation of user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’90). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. David Picard, Philippe-Henri Gosselin, and Marie-Claude Gaspard. 2015. Challenges in content-based image indexing of cultural heritage collections. IEEE Sign. Process. Mag. 32, 4 (2015), 95--102.Google ScholarGoogle ScholarCross RefCross Ref
  30. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  31. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jeff Sauro and James R. Lewis. 2012. Quantifying the User Experience: Practical Statistics for User Research. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  34. Ryosuke Tanno, Koichi Okamoto, and Keiji Yanai. 2016. DeepFoodCam: A DCNN-based Real-time mobile food recognition system. In Proceedings of International Workshop on Multimedia Assisted Dietary Management (MADiMa’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. William M. Trochim and others. 2006. Likert scaling. Research Methods Knowledge Base, 2nd ed. (2006).Google ScholarGoogle Scholar
  36. Fabio Vesperini, Paolo Vecchiotti, Emanuele Principi, Stefano Squartini, and Francesco Piazza. 2016. Deep neural networks for multi-room voice activity detection: Advancements and comparative evaluation. In Proceedings of International Joint Conference on Neural Networks (IJCNN’16).Google ScholarGoogle ScholarCross RefCross Ref
  37. Yiwen Wang, Natalia Stash, Rody Sambeek, Yuri Schuurmans, Lora Aroyo, Guus Schreiber, and Peter Gorgels. 2009. Cultivating personalized museum tours online and on-site. Interdisc. Sci. Rev. 34, 2--3 (2009), 139--153.Google ScholarGoogle ScholarCross RefCross Ref
  38. Kyoung-Ho Woo, Tae-Young Yang, Kun-Jung Park, and Chungyong Lee. 2000. Robust voice activity detection algorithm for estimating noise spectrum. Electron. Lett. 36, 2 (2000), 180--181.Google ScholarGoogle ScholarCross RefCross Ref
  39. Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  40. Keiji Yanai, Ryosuke Tanno, and Koichi Okamoto. 2016. Efficient mobile implementation of a CNN-based object recognition system. In Proceedings of ACM Multimedia (MM’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Massimo Zancanaro, Tsvi Kuflik, Zvi Boger, Dina Goren-Bar, and Dan Goldwasser. 2007. Analyzing museum visitors’ behavior patterns. In Proceedings of the International Conference User Modeling (UM’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision (ECCV’14). 391--405.Google ScholarGoogle Scholar

Index Terms

  1. Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Multimedia Computing, Communications, and Applications
              ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 13, Issue 3s
              Special Section on Deep Learning for Mobile Multimedia and Special Section on Best Papers from ACM MMSys/NOSSDAV 2016
              August 2017
              258 pages
              ISSN:1551-6857
              EISSN:1551-6865
              DOI:10.1145/3119899
              Issue’s Table of Contents

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 28 June 2017
              • Revised: 1 March 2017
              • Accepted: 1 March 2017
              • Received: 1 November 2016
              Published in tomm Volume 13, Issue 3s

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader