research-article

Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides

Authors:
Lorenzo Seidenari

University of Florence - MICC, Firenze, Italy

University of Florence - MICC, Firenze, Italy

0000-0003-4816-0268
View Profile

,
Claudio Baecchi

University of Florence - MICC, Firenze, Italy

University of Florence - MICC, Firenze, Italy
View Profile

,
Tiberio Uricchio

University of Florence - MICC, Firenze, Italy

University of Florence - MICC, Firenze, Italy
View Profile

,
Andrea Ferracani

University of Florence - MICC, Firenze, Italy

University of Florence - MICC, Firenze, Italy
View Profile

,
Marco Bertini

University of Florence - MICC, Firenze, Italy

University of Florence - MICC, Firenze, Italy

0000-0002-1364-218X
View Profile

,
Alberto Del Bimbo

University of Florence - MICC, Firenze, Italy

University of Florence - MICC, Firenze, Italy
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13 Issue 3sArticle No.: 35pp 1–21https://doi.org/10.1145/3092832

Published:28 June 2017Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

In this article, we address the problem of creating a smart audio guide that adapts to the actions and interests of museum visitors. As an autonomous agent, our guide perceives the context and is able to interact with users in an appropriate fashion. To do so, it understands what the visitor is looking at, if the visitor is moving inside the museum hall, or if he or she is talking with a friend. The guide performs automatic recognition of artworks, and it provides configurable interface features to improve the user experience and the fruition of multimedia materials through semi-automatic interaction.

Our smart audio guide is backed by a computer vision system capable of working in real time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy, we perform additional video processing using shape-based filtering, artwork tracking, and temporal filtering. The system has been deployed on an NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1 and tested in a real-world environment (Bargello Museum of Florence).

Supplemental Material

Available for Download

zip

seidenari.zip (25.3 MB)

Supplemental movie, appendix, image and software files for, Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides

References

Rao Muhammad Anwer, Fahad Shahbaz Khan, Joost van de Weijer, and Jorma Laaksonen. 2016. Combining holistic and part-based deep representations for computational painting categorization. In Proceedings of the ACM on International Conference on Multimedia Retrieval (ICMR’16). Google ScholarDigital Library
Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what individual SUS scores mean: Adding an adjective rating scale. J. Usabil. Stud. 4, 3 (2009), 114--123. Google ScholarDigital Library
Jonathan P. Bowen and Silvia Filippini-Fantoni. 2004. Personalization and the web from a museum perspective. In Proceedings of Museums and the Web (MW’04).Google Scholar
John Brooke. 1996. SUS-A quick and dirty usability scale. Usability Evaluation in Industry (1996), 189--194.Google Scholar
Kenneth Bullington and JM Fraser. 1959. Engineering aspects of TASI. Trans. Am. Inst. Electr. Eng. Part I: Commun. Electron. 78, 3 (1959), 256--260.Google Scholar
Erik Cohen. 1985. The tourist guide: The origins, structure and dynamics of a role. Ann. Tour. Res. 12, 1 (1985), 5--29.Google ScholarCross Ref
Alberto Del Bimbo, Walter Nunziati, and Pietro Pala. 2009. David: Discriminant analysis for verification of monuments in image data. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’09). Google ScholarDigital Library
Thomas Drugman, Yannis Stylianou, Yusuke Kida, and Masami Akamine. 2016. Voice activity detection: Merging source and filter-based information. IEEE Sign. Process. Lett. 23, 2 (2016), 252--256.Google ScholarCross Ref
Veron Eliseo and Levasseur Martine. 1991. Ethnographie de l’exposition. Études Et Recherche, Centre Georges Pompidou, Bibliothèque Publique d’information (1991).Google Scholar
Benjamin Elizalde and Gerald Friedland. 2013. Lost in segmentation: Three approaches for speech/non-speech detection in consumer-produced videos. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’13).Google ScholarCross Ref
Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. 2014. Scalable object detection using deep neural networks. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2155--2162. Google ScholarDigital Library
Florian Eyben, Felix Weninger, Stefano Squartini, and Björn Schuller. 2013. Real-life voice activity detection with lstm recurrent neural networks and an application to hollywood movies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13).Google ScholarCross Ref
Ross Girshick. 2015. Fast R-CNN. In Proc. of IEEE International Conference on Computer Vision (ICCV). Google ScholarDigital Library
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR’14). Google ScholarDigital Library
Loc Nguyen Huynh, Rajesh Krishna Balan, and Youngki Lee. 2016. DeepSense: A GPU-based deep convolutional neural network framework on commodity mobile devices. In Proceedings of the Workshop on Wearable Systems and Applications (WearSys’16). Google ScholarDigital Library
Svebor Karaman, Andrew D. Bagdanov, Lea Landucci, Gianpaolo D’Amico, Andrea Ferracani, Daniele Pezzatini, and Alberto Del Bimbo. 2016. Personalized multimedia content delivery on an interactive table by passive observation of museum visitors. Multimed. Tools Appl. 75, 7 (2016), 3787--3811. Google ScholarDigital Library
Jens Keil, Laia Pujol, Maria Roussou, Timo Engelke, Michael Schmitt, Ulrich Bockholt, and Stamatia Eleftheratou. 2013. A digital look at physical museum exhibits: Designing personalized stories with handheld augmented reality in museums. In Proceedings of the Digital Heritage International Congress (DigitalHeritage’13).Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’12). 1097--1105. Google ScholarDigital Library
Tsvi Kuflik, Zvi Boger, and Massimo Zancanaro. 2012. Analysis and Prediction of Museum Visitors’ Behavioral Pattern Types. Springer, Berlin, 161--176.Google Scholar
Seyyed Salar Latifi Oskouei, Hossein Golestani, Matin Hashemi, and Soheil Ghiasi. 2016. CNNdroid: GPU-accelerated execution of trained deep convolutional neural networks on android. In Proceedings of ACM Multimedia (MM’16). Google ScholarDigital Library
Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15) Workshops. Google ScholarDigital Library
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). http://arxiv.org/abs/1512.02325Google Scholar
David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. IJCV 2004. Google ScholarDigital Library
Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P. Murphy. 2015. Im2Calories: Towards an automated mobile vision food diary. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). Google ScholarDigital Library
Ananya Misra. 2012. Speech/nonspeech segmentation in web videos. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech’12).Google ScholarCross Ref
Saman Mousazadeh and Israel Cohen. 2011. AR-GARCH in presence of noise: Parameter estimation and its application to voice activity detection. IEEE Trans. Aud. Speech Lang. Process. 19, 4 (2011), 916--926. Google ScholarDigital Library
Fatih Nayebi, Jean-Marc Desharnais, and Alain Abran. 2012. The state of the art of mobile application usability evaluation. In Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE’12).Google ScholarCross Ref
Jakob Nielsen and Rolf Molich. 1990. Heuristic evaluation of user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’90). Google ScholarDigital Library
David Picard, Philippe-Henri Gosselin, and Marie-Claude Gaspard. 2015. Challenges in content-based image indexing of cultural heritage collections. IEEE Sign. Process. Mag. 32, 4 (2015), 95--102.Google ScholarCross Ref
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’16).Google ScholarCross Ref
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS’15). Google ScholarDigital Library
Jeff Sauro and James R. Lewis. 2012. Quantifying the User Experience: Practical Statistics for User Research. Morgan Kaufmann. Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarCross Ref
Ryosuke Tanno, Koichi Okamoto, and Keiji Yanai. 2016. DeepFoodCam: A DCNN-based Real-time mobile food recognition system. In Proceedings of International Workshop on Multimedia Assisted Dietary Management (MADiMa’16). Google ScholarDigital Library
William M. Trochim and others. 2006. Likert scaling. Research Methods Knowledge Base, 2nd ed. (2006).Google Scholar
Fabio Vesperini, Paolo Vecchiotti, Emanuele Principi, Stefano Squartini, and Francesco Piazza. 2016. Deep neural networks for multi-room voice activity detection: Advancements and comparative evaluation. In Proceedings of International Joint Conference on Neural Networks (IJCNN’16).Google ScholarCross Ref
Yiwen Wang, Natalia Stash, Rody Sambeek, Yuri Schuurmans, Lora Aroyo, Guus Schreiber, and Peter Gorgels. 2009. Cultivating personalized museum tours online and on-site. Interdisc. Sci. Rev. 34, 2--3 (2009), 139--153.Google ScholarCross Ref
Kyoung-Ho Woo, Tae-Young Yang, Kun-Jung Park, and Chungyong Lee. 2000. Robust voice activity detection algorithm for estimating noise spectrum. Electron. Lett. 36, 2 (2000), 180--181.Google ScholarCross Ref
Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR’16).Google ScholarCross Ref
Keiji Yanai, Ryosuke Tanno, and Koichi Okamoto. 2016. Efficient mobile implementation of a CNN-based object recognition system. In Proceedings of ACM Multimedia (MM’16). Google ScholarDigital Library
Massimo Zancanaro, Tsvi Kuflik, Zvi Boger, Dina Goren-Bar, and Dan Goldwasser. 2007. Analyzing museum visitors’ behavior patterns. In Proceedings of the International Conference User Modeling (UM’07). Google ScholarDigital Library
C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision (ECCV’14). 391--405.Google Scholar

Index Terms

Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools
2. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Retrieval on mobile devices
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search
  2. Information systems applications
    1. Mobile information processing systems
    2. Multimedia information systems
      1. Multimedia databases

Recommendations

A review of audio guides in the era of smart tourism

The tourism industry is intimately related to the development of technology. Currently, technology-integrated "smart tourism" provides convenience and interactivity, and offers personalized services to tourists. However, the use of the technology-...
Read More
Outdoor Object Recognition for Smart Audio Guides
MM '17: Proceedings of the 25th ACM international conference on Multimedia

We present a smart audio guide that adapts itself to the environment the user is navigating into. The system builds automatically a point of interest database exploiting Wikipedia and Google APIs as source. We rely on a computer vision system, to ...
Read More
To the Castle! A comparison of two audio guides to enable public discovery of historical events

This paper describes and compares two audio guides used to inform the general public about local historical events, specifically the 1831 Reform Riot as it happened in and around Nottingham in England. One audio guide consisted of a guided walk, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13, Issue 3s
Special Section on Deep Learning for Mobile Multimedia and Special Section on Best Papers from ACM MMSys/NOSSDAV 2016
August 2017
258 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3119899
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2017
- Revised: 1 March 2017
- Accepted: 1 March 2017
- Received: 1 November 2016
Published in tomm Volume 13, Issue 3s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep learning
audio guide
computer vision
cultural heritage
image retrieval
mobile computing
object detection
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 400
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

A review of audio guides in the era of smart tourism

Outdoor Object Recognition for Smart Audio Guides

To the Castle! A comparison of two audio guides to enable public discovery of historical events

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

A review of audio guides in the era of smart tourism

Outdoor Object Recognition for Smart Audio Guides

To the Castle! A comparison of two audio guides to enable public discovery of historical events

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media