Abstract
In this article, we address the problem of creating a smart audio guide that adapts to the actions and interests of museum visitors. As an autonomous agent, our guide perceives the context and is able to interact with users in an appropriate fashion. To do so, it understands what the visitor is looking at, if the visitor is moving inside the museum hall, or if he or she is talking with a friend. The guide performs automatic recognition of artworks, and it provides configurable interface features to improve the user experience and the fruition of multimedia materials through semi-automatic interaction.
Our smart audio guide is backed by a computer vision system capable of working in real time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy, we perform additional video processing using shape-based filtering, artwork tracking, and temporal filtering. The system has been deployed on an NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1 and tested in a real-world environment (Bargello Museum of Florence).
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides
- Rao Muhammad Anwer, Fahad Shahbaz Khan, Joost van de Weijer, and Jorma Laaksonen. 2016. Combining holistic and part-based deep representations for computational painting categorization. In Proceedings of the ACM on International Conference on Multimedia Retrieval (ICMR’16). Google ScholarDigital Library
- Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what individual SUS scores mean: Adding an adjective rating scale. J. Usabil. Stud. 4, 3 (2009), 114--123. Google ScholarDigital Library
- Jonathan P. Bowen and Silvia Filippini-Fantoni. 2004. Personalization and the web from a museum perspective. In Proceedings of Museums and the Web (MW’04).Google Scholar
- John Brooke. 1996. SUS-A quick and dirty usability scale. Usability Evaluation in Industry (1996), 189--194.Google Scholar
- Kenneth Bullington and JM Fraser. 1959. Engineering aspects of TASI. Trans. Am. Inst. Electr. Eng. Part I: Commun. Electron. 78, 3 (1959), 256--260.Google Scholar
- Erik Cohen. 1985. The tourist guide: The origins, structure and dynamics of a role. Ann. Tour. Res. 12, 1 (1985), 5--29.Google ScholarCross Ref
- Alberto Del Bimbo, Walter Nunziati, and Pietro Pala. 2009. David: Discriminant analysis for verification of monuments in image data. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’09). Google ScholarDigital Library
- Thomas Drugman, Yannis Stylianou, Yusuke Kida, and Masami Akamine. 2016. Voice activity detection: Merging source and filter-based information. IEEE Sign. Process. Lett. 23, 2 (2016), 252--256.Google ScholarCross Ref
- Veron Eliseo and Levasseur Martine. 1991. Ethnographie de l’exposition. Études Et Recherche, Centre Georges Pompidou, Bibliothèque Publique d’information (1991).Google Scholar
- Benjamin Elizalde and Gerald Friedland. 2013. Lost in segmentation: Three approaches for speech/non-speech detection in consumer-produced videos. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’13).Google ScholarCross Ref
- Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. 2014. Scalable object detection using deep neural networks. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2155--2162. Google ScholarDigital Library
- Florian Eyben, Felix Weninger, Stefano Squartini, and Björn Schuller. 2013. Real-life voice activity detection with lstm recurrent neural networks and an application to hollywood movies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13).Google ScholarCross Ref
- Ross Girshick. 2015. Fast R-CNN. In Proc. of IEEE International Conference on Computer Vision (ICCV). Google ScholarDigital Library
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR’14). Google ScholarDigital Library
- Loc Nguyen Huynh, Rajesh Krishna Balan, and Youngki Lee. 2016. DeepSense: A GPU-based deep convolutional neural network framework on commodity mobile devices. In Proceedings of the Workshop on Wearable Systems and Applications (WearSys’16). Google ScholarDigital Library
- Svebor Karaman, Andrew D. Bagdanov, Lea Landucci, Gianpaolo D’Amico, Andrea Ferracani, Daniele Pezzatini, and Alberto Del Bimbo. 2016. Personalized multimedia content delivery on an interactive table by passive observation of museum visitors. Multimed. Tools Appl. 75, 7 (2016), 3787--3811. Google ScholarDigital Library
- Jens Keil, Laia Pujol, Maria Roussou, Timo Engelke, Michael Schmitt, Ulrich Bockholt, and Stamatia Eleftheratou. 2013. A digital look at physical museum exhibits: Designing personalized stories with handheld augmented reality in museums. In Proceedings of the Digital Heritage International Congress (DigitalHeritage’13).Google ScholarCross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’12). 1097--1105. Google ScholarDigital Library
- Tsvi Kuflik, Zvi Boger, and Massimo Zancanaro. 2012. Analysis and Prediction of Museum Visitors’ Behavioral Pattern Types. Springer, Berlin, 161--176.Google Scholar
- Seyyed Salar Latifi Oskouei, Hossein Golestani, Matin Hashemi, and Soheil Ghiasi. 2016. CNNdroid: GPU-accelerated execution of trained deep convolutional neural networks on android. In Proceedings of ACM Multimedia (MM’16). Google ScholarDigital Library
- Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15) Workshops. Google ScholarDigital Library
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). http://arxiv.org/abs/1512.02325Google Scholar
- David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. IJCV 2004. Google ScholarDigital Library
- Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P. Murphy. 2015. Im2Calories: Towards an automated mobile vision food diary. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). Google ScholarDigital Library
- Ananya Misra. 2012. Speech/nonspeech segmentation in web videos. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech’12).Google ScholarCross Ref
- Saman Mousazadeh and Israel Cohen. 2011. AR-GARCH in presence of noise: Parameter estimation and its application to voice activity detection. IEEE Trans. Aud. Speech Lang. Process. 19, 4 (2011), 916--926. Google ScholarDigital Library
- Fatih Nayebi, Jean-Marc Desharnais, and Alain Abran. 2012. The state of the art of mobile application usability evaluation. In Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE’12).Google ScholarCross Ref
- Jakob Nielsen and Rolf Molich. 1990. Heuristic evaluation of user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’90). Google ScholarDigital Library
- David Picard, Philippe-Henri Gosselin, and Marie-Claude Gaspard. 2015. Challenges in content-based image indexing of cultural heritage collections. IEEE Sign. Process. Mag. 32, 4 (2015), 95--102.Google ScholarCross Ref
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’16).Google ScholarCross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS’15). Google ScholarDigital Library
- Jeff Sauro and James R. Lewis. 2012. Quantifying the User Experience: Practical Statistics for User Research. Morgan Kaufmann. Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarCross Ref
- Ryosuke Tanno, Koichi Okamoto, and Keiji Yanai. 2016. DeepFoodCam: A DCNN-based Real-time mobile food recognition system. In Proceedings of International Workshop on Multimedia Assisted Dietary Management (MADiMa’16). Google ScholarDigital Library
- William M. Trochim and others. 2006. Likert scaling. Research Methods Knowledge Base, 2nd ed. (2006).Google Scholar
- Fabio Vesperini, Paolo Vecchiotti, Emanuele Principi, Stefano Squartini, and Francesco Piazza. 2016. Deep neural networks for multi-room voice activity detection: Advancements and comparative evaluation. In Proceedings of International Joint Conference on Neural Networks (IJCNN’16).Google ScholarCross Ref
- Yiwen Wang, Natalia Stash, Rody Sambeek, Yuri Schuurmans, Lora Aroyo, Guus Schreiber, and Peter Gorgels. 2009. Cultivating personalized museum tours online and on-site. Interdisc. Sci. Rev. 34, 2--3 (2009), 139--153.Google ScholarCross Ref
- Kyoung-Ho Woo, Tae-Young Yang, Kun-Jung Park, and Chungyong Lee. 2000. Robust voice activity detection algorithm for estimating noise spectrum. Electron. Lett. 36, 2 (2000), 180--181.Google ScholarCross Ref
- Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’16).Google ScholarCross Ref
- Keiji Yanai, Ryosuke Tanno, and Koichi Okamoto. 2016. Efficient mobile implementation of a CNN-based object recognition system. In Proceedings of ACM Multimedia (MM’16). Google ScholarDigital Library
- Massimo Zancanaro, Tsvi Kuflik, Zvi Boger, Dina Goren-Bar, and Dan Goldwasser. 2007. Analyzing museum visitors’ behavior patterns. In Proceedings of the International Conference User Modeling (UM’07). Google ScholarDigital Library
- C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision (ECCV’14). 391--405.Google Scholar
Index Terms
- Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides
Recommendations
A review of audio guides in the era of smart tourism
The tourism industry is intimately related to the development of technology. Currently, technology-integrated "smart tourism" provides convenience and interactivity, and offers personalized services to tourists. However, the use of the technology-...
Outdoor Object Recognition for Smart Audio Guides
MM '17: Proceedings of the 25th ACM international conference on MultimediaWe present a smart audio guide that adapts itself to the environment the user is navigating into. The system builds automatically a point of interest database exploiting Wikipedia and Google APIs as source. We rely on a computer vision system, to ...
To the Castle! A comparison of two audio guides to enable public discovery of historical events
This paper describes and compares two audio guides used to inform the general public about local historical events, specifically the 1831 Reform Riot as it happened in and around Nottingham in England. One audio guide consisted of a guided walk, ...
Comments