survey

Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics

Authors:
Peratham Wiriyathammabhum

University of Maryland, College Park, MD

University of Maryland, College Park, MD

0000-0001-5567-3104
View Profile

,
Douglas Summers-Stay

U.S. Army Research Laboratory, Adelphi, MD

U.S. Army Research Laboratory, Adelphi, MD
View Profile

,
Cornelia Fermüller

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

,
Yiannis Aloimonos

University of Maryland, College Park, MD

University of Maryland, College Park, MD
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 49 Issue 4Article No.: 71pp 1–44https://doi.org/10.1145/3009906

Published:12 December 2016Publication History

ACM Computing Surveys

Abstract

Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.

References

Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloimonos. 2015. From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292 (2015).Google Scholar
Eren Erdal Aksoy, Alexey Abramov, Johannes Dörr, Kejun Ning, Babette Dellen, and Florentin Wörgötter. 2011. Learning the semantics of object--action relations by observation. Int. J. Robot. Res. (2011), 0278364911410459. Google ScholarDigital Library
Yiannis Aloimonos and Cornelia Fermüller. 2015. The cognitive dialogue: A new model for vision implementing common sense reasoning. Image Vis. Comput. 34 (2015), 42--44. Google ScholarDigital Library
Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 1 (2014), 2773--2832. Google ScholarDigital Library
Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. 2012a. A method of moments for mixture models and hidden Markov models. In COLT, Vol. 1. 4.Google Scholar
Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, and Sham M. Kakade. 2012b. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems. 917--925. Google ScholarDigital Library
Andrew J. Anderson, Elia Bruni, Ulisse Bordignon, Massimo Poesio, and Marco Baroni. 2013. Of words, eyes and brains: Correlating image-based distributional semantic models with neural representations of concepts. In EMNLP. 1960--1970.Google Scholar
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016a. Learning to compose neural networks for question answering. In NAACL.Google Scholar
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016b. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39--48.Google ScholarCross Ref
Mark Andrews, Gabriella Vigliocco, and David Vinson. 2009. Integrating experiential and distributional data to learn semantic representations. Psychol. Rev. 116, 3 (2009), 463.Google ScholarCross Ref
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433. Google ScholarDigital Library
Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robot. Auton. Syst. 57, 5 (2009), 469--483. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In ICLR 2015.Google Scholar
Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics--Volume 1. Association for Computational Linguistics, 86--90. Google ScholarDigital Library
Gökhan Bakir. 2007. Predicting structured data. MIT press, 2007. Google ScholarDigital Library
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2012. Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. ACL. 1533--1544.Google Scholar
Albert Bandura. 1974. Psychological Modeling: Conflicting Theories. Transaction Publishers.Google Scholar
Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, and others. 2012a. Video in sentences out. In UAI 2012. Google ScholarDigital Library
Andrei Barbu, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. 2012b. Simultaneous object detection, tracking, and event recognition. In ACS 2012.Google Scholar
Kobus Barnard, Pinar Duygulu, David Forsyth, Nando De Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (2003), 1107--1135. Google ScholarDigital Library
Kobus Barnard and David Forsyth. 2001. Learning the semantics of words and pictures. In Proceedings of the 8th IEEE International Conference on Computer Vision, 2001 (ICCV 2001), Vol. 2. IEEE, 408--415.Google ScholarCross Ref
Marco Baroni. 2016. Grounding distributional semantics in the visual world. Lang. Ling. Compass 10, 1 (2016), 3--13.Google ScholarCross Ref
Francisco Barranco, Cornelia Fermüller, and Yiannis Aloimonos. 2014. Contour motion estimation for asynchronous event-driven cameras. Proc. IEEE 102, 10 (2014), 1537--1556.Google ScholarCross Ref
Daniel Barrett, Andrei Barbu, N. Siddharth, and Jeffrey Siskind. 2016. Saying what you're looking for: Linguistics meets video search. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (Oct. 2016).Google ScholarDigital Library
Jonathan Barron and Jitendra Malik. 2015. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37, 8 (2015), 1670--1687.Google ScholarDigital Library
Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-up robust features (SURF). Comput. Vis. Image Understand. 110, 3 (2008), 346--359. Google ScholarDigital Library
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In Computer Vision--ECCV 2006. Springer, 404--417. Google ScholarDigital Library
Michael Beetz, Suat Gedikli, Jan Bandouch, Bernhard Kirchlechner, Nico von Hoyningen-Huene, and Alexander Perzylo. 2007. Visually tracking football games based on TV broadcasts. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarDigital Library
Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman. 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19, 7 (1997), 711--720. Google ScholarDigital Library
Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neur. Comput. 15, 6 (2003), 1373--1396. Google ScholarDigital Library
Islam Beltagy, Stephen Roller, Pengxiang Cheng, Katrin Erk, and Raymond J. Mooney. 2015. Representing meaning with a combination of logical form and vectors. arXiv preprint arXiv:1505.06816 (2015).Google Scholar
Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1798--1828. Google ScholarDigital Library
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (2003), 1137--1155. Google ScholarDigital Library
Yoshua Bengio, Hugo Larochelle, Pascal Lamblin, Dan Popovici, Aaron Courville, Clarence Simard, Jerome Louradour, and Dumitru Erhan. 2007. Deep architectures for baby AI. (2007).Google Scholar
A. Berg, J. Deng, and L. Fei-Fei. 2010. Large scale visual recognition challenge (ILSVRC), 2010. Retrieved from http://www. image-net.org/challenges/LSVRC (2010).Google Scholar
Tamara Berg and Alexander C. Berg. 2009. Finding iconic images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009 (CVPR Workshops 2009). IEEE, 1--8.Google ScholarCross Ref
Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee-Whye Teh, Erik Learned-Miller, and David A. Forsyth. 2004. Names and faces in the news. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Vol. 2. IEEE, II--848. Google ScholarDigital Library
Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic attribute discovery and characterization from noisy web data. In Computer Vision--ECCV 2010. Springer, 663--676. Google ScholarDigital Library
Tamara L. Berg, David Forsyth, and others. 2006. Animals on the web. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1463--1470. Google ScholarDigital Library
Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55 (2016), 409--442. Google ScholarCross Ref
David M. Blei and Michael I. Jordan. 2003. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 127--134. Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarCross Ref
Benjamin S. Bloom and others. 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. McKay, New York, NY (1956), 20--24.Google Scholar
Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel. 2005. Three-dimensional face recognition. Int. J. Comput. Vis. 64, 1 (2005), 5--30. Google ScholarDigital Library
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136--145. Google ScholarDigital Library
Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49 (2014), 1--47. Google ScholarDigital Library
Donna Byron, Alexander Koller, Jon Oberlander, Laura Stoia, and Kristina Striegnitz. 2007. Generating instructions in virtual environments (GIVE): A challenge and an evaluation testbed for NLG. (2007).Google Scholar
Angelo Cangelosi. 2006. The grounding and sharing of symbols. Pragm. Cogn. 14, 2 (2006), 275--285.Google ScholarCross Ref
Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr, and Tom M. Mitchell. 2010. Toward an architecture for never-ending language learning. In AAAI, Vol. 5. 3. Google ScholarDigital Library
Marisa Carrasco. 2011. Visual attention: The past 25 years. Vis. Res. 51, 13 (2011), 1484--1525.Google ScholarCross Ref
Joao Carreira and Cristian Sminchisescu. 2010. Constrained parametric min-cuts for automatic object segmentation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3241--3248.Google ScholarCross Ref
Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014. Semantic parsing for text to 3d scene generation. ACL 2014 (2014), 17.Google Scholar
Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. 2015. HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017--1025. Google ScholarDigital Library
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
Anthony Chemero. 2003. An outline of a theory of affordances. Ecological Psychology 15, 2 (2003), 181--195.Google ScholarCross Ref
David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1. Association for Computational Linguistics, 190--200. Google ScholarDigital Library
David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine Learning. ACM, 128--135. Google ScholarDigital Library
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015a. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
Xinlei Chen, Ashish Shrivastava, and Arpan Gupta. 2013. Neil: Extracting visual knowledge from web data. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 1409--1416. Google ScholarDigital Library
Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015b. Revisiting word embedding for contrasting meaning. In Proceedings of ACL.Google ScholarCross Ref
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder--decoder approaches. Syntax Sem. Struct. Stat. Transl. (2014), 103.Google Scholar
Myung Jin Choi, Antonio Torralba, and Alan S. Willsky. 2012. Context models and out-of-context objects. Pattern Recogn. Lett. 33, 7 (2012), 853--862. Google ScholarDigital Library
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 48. Google ScholarDigital Library
Stephen Clark and Stephen Pulman. 2007. Combining symbolic and distributional models of meaning. In AAAI Spring Symposium: Quantum Interaction. 52--55.Google Scholar
Michael D. Cohen and Paul Bacdayan. 1994. Organizational routines are stored as procedural memory: Evidence from a laboratory study. Organiz. Sci. 5, 4 (1994), 554--568. Google ScholarDigital Library
Nadav Cohen, Or Sharir, and Amnon Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory. 698--728.Google Scholar
Silvia Coradeschi, Amy Loutfi, and Britta Wrede. 2013. A short review of symbol grounding in robotic and intelligent systems. KI-Künstliche Intell. 27, 2 (2013), 129--136.Google ScholarCross Ref
Silvia Coradeschi and Alessandro Saffiotti. 2000. Anchoring symbols to sensor data: Preliminary report. In AAAI/IAAI. 129--135. Google ScholarDigital Library
Nelson Cowan. 2008. What are the differences between long-term, short-term, and working memory? Progr. Brain Res. 169 (2008), 323--338.Google ScholarCross Ref
Trevor Darrell. 2010. Learning Representations for Real-world Recognition. Retrieved from http://www.eecs.berkeley.edu/&sim;trevor/colloq.pdf UCB EECS Colloquium {Accessed: 2015 11 1}.Google Scholar
Pradipto Das, Chenliang Xu, Richard Doell, and Jason Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2634--2641. Google ScholarDigital Library
Hal Daumé III. 2007. Frustratingly easy domain adaptation. ACL 2007 (2007), 256.Google Scholar
Hal Daumé III, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Mach. Learn. 75, 3 (2009), 297--325. Google ScholarDigital Library
Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. 1269--1277. Google ScholarDigital Library
Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi, Yejin Choi, Hal Daumé III, Alexander C. Berg, and others. 2012. Detecting visual text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 762--772. Google ScholarDigital Library
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarCross Ref
Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox, and others. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2758--2766. Google ScholarDigital Library
Susan T. Dumais. 2007. LSA and information retrieval: Getting back to basics. Handb. Latent Semant. Anal. (2007), 293--321.Google Scholar
Hugh Durrant-Whyte and Tim Bailey. 2006. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 13, 2 (2006), 99--110.Google ScholarCross Ref
Pinar Duygulu, Kobus Barnard, Joao F. G. de Freitas, and David A. Forsyth. 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Computer Vision ECCV 2002. Springer, 97--112. Google ScholarDigital Library
Aleksandrs Ecins, Cornelia Fermuller, and Yiannis Aloimonos. 2014. Shadow free segmentation in still images using local density measure. In Proceedings of the 2014 IEEE International Conference on Computational Photography (ICCP). IEEE, 1--8.Google ScholarCross Ref
Aleksandrs Ecins, Cornelia Fermuller, and Yiannis Aloimonos. 2016. Cluttered scene segmentation using the symmetry constraint. In Proceedings of the International Conference in Robotics and Automation (ICRA).Google ScholarDigital Library
H. Eichenbaum. 2008. Memory. Scholarpedia 3, 3 (2008), 1747.Google ScholarCross Ref
Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In EMNLP. 1292--1302.Google Scholar
Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, Vol. 452. 457.Google ScholarCross Ref
Oren Etzioni, Michele Banko, and Michael J. Cafarella. 2006. Machine reading. In AAAI, Vol. 6. 1517--1519. Google ScholarDigital Library
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338. Google ScholarDigital Library
Rui Fang, Changsong Liu, Lanbo She, and Joyce Y. Chai. 2013. Towards situated dialogue: Revisiting referring expression generation. In EMNLP. 392--402.Google Scholar
Ali Farhadi. 2011. Designing Representational Architectures in Recognition. University of Illinois at Urbana-Champaign. Champaign, IL, USA.Google Scholar
Ali Farhadi, Ian Endres, and Derek Hoiem. 2010. Attribute-centric recognition for cross-category generalization. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2352--2359.Google ScholarCross Ref
Alireza Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009). IEEE, 1778--1785.Google ScholarCross Ref
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision--ECCV 2010. Springer, 15--29. Google ScholarDigital Library
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014).Google Scholar
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. 2015. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 (2015).Google Scholar
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2007. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106, 1 (2007), 59--70. Google ScholarDigital Library
S. L. Feng, Raghavan Manmatha, and Victor Lavrenko. 2004. Multiple Bernoulli relevance models for image and video annotation. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004 (CVPR’04)., Vol. 2. IEEE, II--1002. Google ScholarDigital Library
Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao Huang, Lucy Vanderwende, Jacob Devlin, Michel Galley, and Margaret Mitchell. 2015. A survey of current datasets for vision and language research. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 207--213.Google ScholarCross Ref
Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 2 (1936), 179--188.Google ScholarCross Ref
Daryl Fougnie. 2008. The relationship between attention and working memory. New Res. Short-term Mem. (2008), 1--45.Google Scholar
D. F. Fouhey, A. Gupta, and A. Zisserman. 2016. 3D shape attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Jianlong Fu, Jinqiao Wang, Xin-Jing Wang, Yong Rui, and Hanqing Lu. 2015. What visual attributes characterize an object class? In Computer Vision--ACCV 2014. Springer, 243--259.Google Scholar
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems. 2296--2304. Google ScholarDigital Library
D. Garcia-Gasulla, J. Béjar, U. Cortés, E. Ayguadé, and J. Labarta. 2015. Extracting visual patterns from deep learning representations. arXiv preprint arXiv:1507.08818 (2015).Google Scholar
Peter Gärdenfors. 2014. The Geometry of Meaning: Semantics Based on Conceptual Spaces. MIT Press.Google ScholarCross Ref
Konstantina Garoufi. 2014. Planning-based models of natural language generation. Lang. Ling. Compass 8, 1 (2014), 1--10.Google ScholarCross Ref
Konstantina Garoufi and Alexander Koller. 2010. Automated planning for situated natural language generation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1573--1582. Google ScholarDigital Library
Konstantina Garoufi, Maria Staudte, Alexander Koller, and Matthew W. Crocker. 2016. Exploiting listener gaze to improve situated communication in dynamic virtual environments. Cognitive Science 40, 7 (2016), 1671--1703.Google ScholarCross Ref
Dan Garrette, Katrin Erk, and Raymond Mooney. 2014. A formal approach to linking logical form and vector-space lexical semantics. In Computing Meaning. Springer, 27--48.Google Scholar
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarDigital Library
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 580--587. Google ScholarDigital Library
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210--233. Google ScholarDigital Library
Kristen Grauman and Bastian Leibe. 2010. Visual Object Recognition. Number 11. Morgan 8 Claypool Publishers. Google ScholarDigital Library
Douglas Greenlee. 1978. Semiotic and significs. Int. Stud. Philos. 10 (1978), 251--254.Google ScholarCross Ref
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Sarad Venugopalan, Randy Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 2712--2719. Google ScholarDigital Library
Gutemberg Guerra-Filho and Yiannis Aloimonos. 2007. A language for human action. Computer 40, 5 (2007), 42--51. Google ScholarDigital Library
Abhinav Gupta. 2009. Beyond nouns and verbs. (2009).Google Scholar
Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2015. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vis. 112, 2 (2015), 133--149. Google ScholarDigital Library
Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).Google Scholar
Xintong Han, Bharat Singh, Vlad I. Morariu, and Larry S. Davis. 2015. Fast automatic video retrieval using web images. arXiv preprint arXiv:1512.03384 (2015).Google Scholar
Emily M. Hand and Rama Chellappa. 2016. Attributes for improved attributes: A multi-task network for attribute classification. arXiv preprint arXiv:1604.07360 (2016).Google Scholar
Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV).Google ScholarCross Ref
Stevan Harnad. 1990. The symbol grounding problem. Physica D 42, 1 (1990), 335--346. Google ScholarDigital Library
Zellig S. Harris. 1954. Distributional structure. Word 10, 2--3 (1954), 146--162.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.Google Scholar
Geremy Heitz and Daphne Koller. 2008. Learning spatial context: Using stuff to find things. In Computer Vision--ECCV 2008. Springer, 30--43. Google ScholarDigital Library
Geoffrey E. Hinton. 1984. Distributed representations. Technical Report: Carnegie Melon University.Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. (2013), 853--899. Google ScholarDigital Library
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 50--57. Google ScholarDigital Library
Bernhard Hommel, Jochen Müsseler, Gisa Aschersleben, and Wolfgang Prinzb. 2001. The theory of event coding (TEC): A framework for perception and action planning. Behav. Brain Sci. 24 (2001), 849--937.Google ScholarCross Ref
Thanarat Horprasert, David Harwood, and Larry S. Davis. 1999. A statistical approach for real-time robust background subtraction and shadow detection. In IEEE ICCV, Vol. 99. 1--19.Google Scholar
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report. Technical Report 07-49, University of Massachusetts, Amherst.Google Scholar
Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 39--43. Google ScholarDigital Library
Julian Jaynes. 2000. The Origin of Consciousness in the Breakdown of the Bicameral Mind. Houghton Mifflin Harcourt.Google Scholar
Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. 2003. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 119--126. Google ScholarDigital Library
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Benjamin Johnston, Fangkai Yang, Rogan Mendoza, Xiaoping Chen, and Mary-Anne Williams. 2008. Ontology based object categorization for robots. In Practical Aspects of Knowledge Management. Springer, 219--231. Google ScholarDigital Library
Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. 2015. Learning visual features from large weakly supervised data. arXiv preprint arXiv:1511.02251 (2015).Google Scholar
Alap Karapurkar. 2008. Modeling human activities. Scholarly Paper Archive, Department of Computer Science, University of Maryland, College Park, MD, 20742.Google Scholar
Andrej Karpathy and Li Fei-Fei. 2015a. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google ScholarCross Ref
Andrej Karpathy and Li Fei-Fei. 2015b. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3276--3284. Google ScholarDigital Library
Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171--184. Google ScholarDigital Library
Alexander Koller and Matthew Stone. 2007. Sentence generation as a planning problem. ACL 2007 (2007), 336.Google Scholar
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalanditis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (2016), 45.Google Scholar
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. NAACL HLT 2013 (2013), 10.Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
German Kruszewski, Denis Paperno, and Marco Baroni. 2015. Deriving boolean structures from distributional vectors. Trans. Assoc. Comput. Ling. 3 (2015), 375--388.Google ScholarCross Ref
Gaurav Kulkarni, Visruth Premraj, Vicente Ordonez, Sudipta Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 12 (2013), 2891--2903. Google ScholarDigital Library
Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In ICML.Google Scholar
Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar. 2009. Attribute and simile classifiers for face verification. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. IEEE, 365--372.Google Scholar
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359--368. Google ScholarDigital Library
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1817--1824.Google ScholarCross Ref
Kevin Lai and Dieter Fox. 2010. Object recognition in 3D point clouds using web data and domain adaptation. Int. J. Robot. Res. 29, 8 (2010), 1019--1037. Google ScholarDigital Library
Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009). IEEE, 951--958.Google ScholarCross Ref
Victor Lavrenko, R. Manmatha, and Jiwoon Jeon. 2003. A model for learning the semantics of pictures. In Advances in Neural Information Processing Systems. None. Google ScholarDigital Library
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 2169--2178. Google ScholarDigital Library
Dieu-Thu Le, Jasper Uijlings, and Raffaella Bernardi. 2014. Tuhoi: Trento universal human object interaction dataset. V8L Net 2014 (2014), 17.Google Scholar
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
Chee Wee Leong and Rada Mihalcea. 2011. Going beyond text: A hybrid image-text approach for measuring word relatedness. In IJCNLP. 1403--1407.Google Scholar
Stephen C. Levinson. 2001. Pragmatics. In International Encyclopedia of Social and Behavioral Sciences: Vol. 17. Pergamon, 11948--11954.Google Scholar
Omer Levy and Yoav Goldberg. 2014a. Dependencybased word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. 302--308.Google Scholar
Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems. 2177--2185. Google ScholarDigital Library
Li-Jia Li and Li Fei-Fei. 2007. What, where and who? Classifying events by scene and object recognition. In Proceedings of the IEEE 11th International Conference on Computer Vision. IEEE, 1--8.Google ScholarCross Ref
Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220--228. Google ScholarDigital Library
Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, and Alberto Del Bimbo. 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv. 49, 1 (2016), 14. Google ScholarDigital Library
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8.Google Scholar
Changsong Liu and Joyce Yue Chai. 2015. Learning to mediate perceptual differences in situated human-robot dialogue. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press, 2288--2294. Google ScholarDigital Library
Changsong Liu, Lanbo She, Rui Fang, and Joyce Y. Chai. 2014a. Probabilistic labeling for efficient referential grounding based on collaborative discourse. In ACL (2). 13--18.Google Scholar
Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3337--3344. Google ScholarDigital Library
Tie-Yan Liu. 2009. Learning to rank for information retrieval. Found. Trends Inform. Retriev. 3, 3 (2009), 225--331. Google ScholarDigital Library
Xiaobai Liu, Yibiao Zhao, and Song-Chun Zhu. 2014b. Single-view 3d scene parsing by attributed grammar. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 684--691. Google ScholarDigital Library
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google ScholarCross Ref
David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91--110. Google ScholarDigital Library
James MacQueen and others. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281--297.Google Scholar
Ameesh Makadia, Vladimir Pavlovic, and Sanjiv Kumar. 2008. A new baseline for image annotation. In Computer Vision--ECCV 2008. Springer, 316--329. Google ScholarDigital Library
Alexis Maldonado, Humberto Alvarez, and Michael Beetz. 2012. Improving robot manipulation through fingertip perception. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2947--2954.Google ScholarCross Ref
Jitendra Malik, Pablo Arbeláez, João Carreira, Katerina Fragkiadaki, Ross Girshick, Georgia Gkioxari, Saurabh Gupta, Bharath Hariharan, Abhishek Kar, and Shubham Tulsiani. 2016. The three Rs of computer vision: Recognition, reconstruction and reorganization. Pattern Recogn. Lett. 72 (2016), 4--14. Google ScholarDigital Library
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision. 1--9. Google ScholarDigital Library
Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick Johnston, Andrew Rabinovich, and Kevin Murphy. 2015. What’s cookin’? Interpreting cooking videos using text, speech and vision. In NAACL 2015.Google ScholarCross Ref
Matthew Marge, Claire Bonial, Brendan Byrne, Taylor Cassidy, A. William Evans, Susan G. Hill, and Clare Voss. 2016. Applying the wizard-of-oz technique to multimodal human-robot dialogue. In Proceedings of RO-MAN (To appear).Google Scholar
David Marr. 1982. Vision: A Computational Investigation Into the Human Representation and Processing of Visual Information. Henry Holt and Co., Inc., New York, NY. Google ScholarDigital Library
David R. Martin, Charless C. Fowlkes, and Jitendra Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 26, 5 (2004), 530--549. Google ScholarDigital Library
Cynthia Matuszek^*, Nicholas FitzGerald^*, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded attribute learning. In Proceedings of the 2012 International Conference on Machine Learning. Edinburgh, Scotland.Google Scholar
Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. 2013. Learning to parse natural language commands to a robot control system. In Experimental Robotics. Springer, 403--415.Google Scholar
Nikolaos Mavridis. 2015. A review of verbal and non-verbal human--robot interactive communication. Robot. Auton. Syst. 63 (2015), 22--35. Google ScholarDigital Library
Nikolaos Mavridis and Deb Roy. 2006. Grounded situation models for robots: Where words and percepts meet. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4690--4697.Google ScholarCross Ref
Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015a. Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794. Google ScholarDigital Library
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015b. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43--52. Google ScholarDigital Library
Jon D. Mcauliffe and David M. Blei. 2008. Supervised topic models. In Advances in Neural Information Processing Systems. 121--128.Google Scholar
Brian McMahan and Matthew Stone. 2015. A Bayesian model of grounded color semantics. Trans. Assoc. Comput. Ling. 3 (2015), 103--115.Google ScholarCross Ref
Ken McRae, George S. Cree, Mark S. Seidenberg, and Chris McNorgan. 2005. Semantic feature production norms for a large set of living and nonliving things. Behav. Res. Methods 37, 4 (2005), 547--559.Google ScholarCross Ref
Chet Meyers and Thomas B. Jones. 1993. Promoting Active Learning. Strategies for the College Classroom. ERIC.Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google ScholarDigital Library
George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarDigital Library
Marvin Minsky. 2006. The emotion machine. New York: Pantheon (2006).Google ScholarDigital Library
Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 747--756. Google ScholarDigital Library
Saif M. Mohammad, Bonnie J. Dorr, Graeme Hirst, and Peter D. Turney. 2013. Computing lexical contrast. Comput. Ling. 39, 3 (2013), 555--590.Google ScholarCross Ref
Raymond J. Mooney. 2008. Learning to connect language and perception. In AAAI. 1598--1601. Google ScholarDigital Library
Raymond J. Mooney. 2013. Grounded Language Learning. (7 2013). 27th AAAI Conference on Artificial Intelligence, Washington 2013 Retrieved November 2, 2015 from http://videolectures.net/ aaai2013_mooney_language_learning/.Google Scholar
Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka. 1999. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management. Citeseer, 1--9.Google Scholar
Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Proceedings of the International Workshop on Artificial Intelligence and Statistics. Citeseer, 246--252.Google Scholar
Charles William Morris. 1938. Foundations of the theory of signs. (1938).Google Scholar
Venkatesh N. Murthy, Subhransu Maji, and R. Manmatha. 2015. Automatic image annotation using deep learning representations. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 603--606. Google ScholarDigital Library
Austin Myers, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Affordance detection of tool parts from geometric features. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).Google ScholarCross Ref
Douglas L. Nelson, Cathy L. McEvoy, and Thomas A. Schreiber. 2004. The university of south Florida free association, rhyme, and word fragment norms. Behav. Res. Methods Instrum. Comput. 36, 3 (2004), 402--407.Google ScholarCross Ref
Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. 2015. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS). Google ScholarDigital Library
Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 3 (2001), 145--175. Google ScholarDigital Library
Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. From large scale image categorization to entry-level categories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Google ScholarDigital Library
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143--1151. Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318. Google ScholarDigital Library
Devi Parikh. 2009. Modeling context for image understanding: When, for what, and how? (2009).Google ScholarDigital Library
Devi Parikh and Kristen Grauman. 2011. Relative attributes. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 503--510. Google ScholarDigital Library
Seyoung Park, Bruce Xiaohan Nie, and Song-Chun Zhu. 2016. Attribute and-or grammar for joint parsing of human attributes, part and pose. arXiv preprint arXiv:1605.02112 (2016).Google Scholar
Katerina Pastra and Yiannis Aloimonos. 2012. The minimalist grammar of action. Philos. Trans. Roy. Soc. B: Biol. Sci. 367, 1585 (2012), 103--117.Google ScholarCross Ref
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12 (2014), 1532--1543.Google Scholar
Jean Piaget. 2013. Play, Dreams and Imitation in Childhood. Vol. 25. Routledge.Google Scholar
Tony Plate. 1997. A common framework for distributed representation schemes for compositional structure. Connectionist Systems for Knowledge Representation and Deduction (1997), 15--34.Google Scholar
Robert Pless and Richard Souvenir. 2009. A survey of manifold learning for images. IPSJ Trans. Comput. Vis. Appl. 1 (2009), 83--94.Google ScholarCross Ref
J. Pont-Tuset, P. Arbelaez, J. Barron, F. Marques, and J. Malik. 2016. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans. Pattern Anal. Mach. Intelli. (2016). Google ScholarDigital Library
Hoifung Poon and Pedro Domingos. 2009. Unsupervised semantic parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1. Association for Computational Linguistics, 1--10. Google ScholarDigital Library
Cecilia Quiroga-Clare. 2003. Language ambiguity: A curse and a blessing. Transl. J. 7, 1 (2003).Google Scholar
Gabriel A. Radvansky and Jeffrey M. Zacks. 2014. Event Cognition. Oxford University Press.Google Scholar
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953--2961. Google ScholarDigital Library
Giacomo Rizzolatti and Laila Craighero. 2004. The mirror-neuron system. Annu. Rev. Neurosci. 27 (2004), 169--192.Google ScholarCross Ref
Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In Pattern Recognition. Springer, 184--195.Google Scholar
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3202--3212.Google ScholarCross Ref
Stephen Roller and Sabine Schulte Im Walde. 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1146--1157.Google Scholar
Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the ACL.Google ScholarCross Ref
Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500 (2000), 2323--2326.Google ScholarCross Ref
Deb Roy. 2005. Grounding words in perception and action: Computational insights. Trends Cogn. Sci. 9, 8 (2005), 390.Google ScholarCross Ref
Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 77, 1-3 (2008), 157--173. Google ScholarDigital Library
Fereshteh Sadeghi, C. Lawrence Zitnick, and Ali Farhadi. 2015. VISALOGY: Answering visual analogy questions. In Advances in Neural Information Processing Systems (NIPS-15). Google ScholarDigital Library
Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1745--1752. Google ScholarDigital Library
Karin Kipper Schuler. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon (January 1, 2005). Dissertations available from ProQuest. Paper AAI3179808. http://repository.upenn.edu/dissertations/AAI3179808.Google Scholar
Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015. Symmetric pattern based word embeddings for improved word similarity prediction. CoNLL 2015 (2015), 258.Google Scholar
Nishant Shukla, Caiming Xiong, and Song-Chun Zhu. 2015. A unified framework for human-robot knowledge transfer. In Proceedings of the 2015 AAAI Fall Symposium Series.Google Scholar
Narayanaswamy Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. 2014. Seeing what you’re told: Sentence-guided activity recognition in video. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 732--739. Google ScholarDigital Library
Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of semantic representation with visual attributes. In ACL (1). 572--582.Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4561--4569. Google ScholarDigital Library
Jeffrey Mark Siskind. 2001. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. J. Artif. Intell. Res. 15 (2001), 31--90. Google ScholarDigital Library
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling. 2 (2014), 207--218.Google ScholarCross Ref
Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 129--136.Google Scholar
Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15 (2014), 2949--2980. Google ScholarDigital Library
Mark Steedman. 1996. Surface structure and interpretation. (1996).Google Scholar
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, and others. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems. 2440--2448. Google ScholarDigital Library
Douglas Summers-Stay, Ching L. Teo, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos. 2012. Using a minimal action grammar for activity understanding in the real world. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4104--4111.Google ScholarCross Ref
Douglas Alan Summers-Stay. 2013. Productive vision: Methods for automatic image comprehension. (2013).Google Scholar
Yuyin Sun, Liefeng Bo, and Dieter Fox. 2013. Attribute based object identification. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2096--2103.Google ScholarCross Ref
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarCross Ref
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In ACL.Google Scholar
Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lars Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1701--1708. Google ScholarDigital Library
Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen. 2015. Book2movie: Aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1827--1835.Google ScholarCross Ref
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. 2005. Learning structured prediction models: A large margin approach. In Proceedings of the 22nd International Conference on Machine Learning. ACM, 896--903. Google ScholarDigital Library
Stefanie Tellex, Ross Knepper, Adrian Li, Daniela Rus, and Nicholas Roy. 2014. Asking for help using inverse semantics. Proceedings of Robotics: Science and Systems, Berkeley, USA (2014).Google ScholarCross Ref
Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI. Google ScholarDigital Library
Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 5500 (2000), 2319--2323.Google ScholarCross Ref
Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Fast 2D border ownership assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5117--5125.Google ScholarCross Ref
Ching L. Teo, Yezhou Yang, Hal Daumé III, Cornelia Fermüller, and Yiannis Aloimonos. 2012. Towards a Watson that sees: Language-guided action recognition for robots. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 374--381.Google ScholarCross Ref
Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING).Google Scholar
Jesse Thomason, Shiqi Zhang, Raymond Mooney, and Peter Stone. 2015. Learning to interpret natural language commands through human-robot dialog. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarDigital Library
Sebastian Thrun, Wolfram Burgard, and Dieter Fox. 2005. Probabilistic Robotics. MIT Press.Google Scholar
Joseph Tighe and Svetlana Lazebnik. 2010. Superparsing: Scalable nonparametric image parsing with superpixels. In European Conference on Computer Vision. Springer, 352--365. Google ScholarDigital Library
Atousa Torabi, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015).Google Scholar
Antonio Torralba, Alexei Efros, and others. 2011. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1521--1528. Google ScholarDigital Library
Anne-Marie Tousch, Stéphane Herbin, and Jean-Yves Audibert. 2012. Semantic hierarchies for image annotation: A survey. Pattern Recogn. 45, 1 (2012), 333--345. Google ScholarDigital Library
Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. Int. J. Comput. Vis. 63, 2 (2005), 113--140. Google ScholarDigital Library
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 384--394. Google ScholarDigital Library
Matthew Turk and Alex Pentland. 1991. Eigenfaces for recognition. J. Cogn. Neurosci. 3, 1 (1991), 71--86. Google ScholarDigital Library
Jasper R. R. Uijlings and Vittorio Ferrari. 2015. Situational object boundary detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4712--4721.Google Scholar
Laurens J. P. van der Maaten, Eric O. Postma, and H. Jaap van den Herik. 2009. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 10, 1--41 (2009), 66--71.Google Scholar
Bernard Vauquois. 1968. Structures profondes et traduction automatique. Le système du CETA. Rev. Roum. Ling. 13, 2 (1968), 105--130.Google Scholar
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarCross Ref
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542. Google ScholarDigital Library
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT.Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarCross Ref
Luis Von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 319--326. Google ScholarDigital Library
Matthew R. Walter, Matthew E. Antone, Ekapol Chuangsuwanich, Andrew Correa, Randall Davis, Luke Fletcher, Emilio Frazzoli, Yuli Friedman, James R. Glass, Jonathan P. How, Jeong Hwan Jeon, Sertac Karaman, Brandon Luders, Nicholas Roy, Stefanie Tellex, and Seth J. Teller. 2015. A situationally aware voice-commandable robotic forklift working alongside people in unstructured outdoor environments. J. Field Robot. 32, 4 (2015), 590--628. Google ScholarDigital Library
Chong Wang, David Blei, and Fei-Fei Li. 2009. Simultaneous image classification and annotation. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 1903--1910.Google Scholar
Meng Wang, Bingbing Ni, Xian-Sheng Hua, and Tat-Seng Chua. 2012. Assistive tagging: A survey of multimedia tagging with human-computer joint exploration. ACM Comput. Surv. 44, 4 (2012), 25. Google ScholarDigital Library
Ronald J. Williams. 1988. On the use of backpropagation in associative reinforcement learning. In Proceedings of the IEEE International Conference on Neural Networks, 1988. IEEE, 263--270.Google ScholarCross Ref
Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2015b. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015a. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1912--1920.Google Scholar
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In ICML.Google Scholar
Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, and Kate Saenko. 2015b. A multi-scale multiple instance video description network. arXiv preprint arXiv:1505.05914 (2015).Google Scholar
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning. 2048--2057.Google Scholar
Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015c. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Yezhou Yang, Cornelia Fermuller, and Yiannis Aloimonos. 2013. Detection of manipulation action consequences (MAC). In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2563--2570. Google ScholarDigital Library
Yezhou Yang, Cornelia Fermuller, Yiannis Aloimonos, and Eren Erdal Aksoy. 2015. Learning the semantics of manipulation action. The 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Vol. 1. Association for Computational Linguistics, 676--686.Google ScholarCross Ref
Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015a. Neural self talk: Image understanding via continuous questioning and answering. arXiv preprint arXiv:1512.03460 (2015).Google Scholar
Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015b. Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI-15). Google ScholarDigital Library
Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454. Google ScholarDigital Library
Yezhou Yang, Ching L. Teo, Cornelia Fermuller, and Yiannis Aloimonos. 2013. Robots with language: Multi-label visual recognition using NLP. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4256--4262.Google ScholarCross Ref
Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. 2011. Human action recognition by learning bases of action attributes and parts. In Proceedings of the 2011 International Conference on Computer Vision. IEEE, 1331--1338. Google ScholarDigital Library
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515. Google ScholarDigital Library
Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Dani Yogatama, Manaal Faruqui, Chris Dyer, and Noah A. Smith. 2014. Learning word representations with hierarchical sparse coding. arXiv preprint arXiv:1406.2035 (2014).Google Scholar
Nivasan Yogeswaran, Wenting Dang, William Taube Navaraj, Dhayalan Shakthivel, Saleem Khan, Emre Ozan Polat, Shoubhik Gupta, Hadi Heidari, Mohsen Kaboli, Leandro Lorenzelli, and others. 2015. New materials and advances in making electronic skin for interactive robots. Adv. Robot. 29, 21 (2015), 1359--1373.Google ScholarCross Ref
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67--78.Google ScholarCross Ref
Haonan Yu, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. 2015b. A compositional framework for grounding language inference, generation, and acquisition in video. J. Artif. Intell. Res. (2015), 601--713. Google ScholarDigital Library
Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded language learning from video described with sentences. In ACL (1). 53--63.Google Scholar
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2015c. Video paragraph captioning using hierarchical recurrent neural networks. arXiv preprint arXiv:1510.07712 (2015).Google Scholar
Licheng Yu, Eunbyung Park, Alexander C. Berg, and Tamara L. Berg. 2015a. Visual madlibs: Fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2461--2469. Google ScholarDigital Library
Xiaodong Yu, Cornelia Fermuller, Ching Lik Teo, Yezhou Yang, and Yiannis Aloimonos. 2011. Active scene recognition with vision and language. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 810--817. Google ScholarDigital Library
Konstantinos Zampogiannis, Yezhou Yang, Cornelia Fermuller, and Yiannis Aloimonos. 2015. Learning the spatial semantics of manipulation actions through preposition grounding. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1389--1396.Google ScholarCross Ref
John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artificial Intelligence. 1050--1055. Google ScholarDigital Library
Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). Google ScholarDigital Library
Dengsheng Zhang, Md Monirul Islam, and Guojun Lu. 2012. A review on automatic image annotation techniques. Pattern Recogn. 45, 1 (2012), 346--362. Google ScholarDigital Library
Rong Zhao and William I. Grosky. 2002. Bridging the semantic gap in image retrieval. Distributed Multimedia Databases: Techniques and Applications (2002), 14--36. Google ScholarDigital Library
Wenyi Zhao, Rama Chellappa, P. Jonathon Phillips, and Azriel Rosenfeld. 2003. Face recognition: A literature survey. ACM Comput. Surv. 35, 4 (2003), 399--458. Google ScholarDigital Library
Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1529--1537. Google ScholarDigital Library
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015a. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19--27. Google ScholarDigital Library
Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015b. Building a large-scale multimodal knowledge base for visual question answering. arXiv preprint arXiv:1507.05670 (2015).Google Scholar

Index Terms

Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Robotics
2. Computing methodologies
  1. Artificial intelligence
  2. Machine learning

Recommendations

Learning distributed word representation with multi-contextual mixed embedding

Learning distributed word representations has been a popular method for various natural language processing applications such as word analogy and similarity, document classification and sentiment analysis. However, most existing word embedding models ...
Read More
Word2vec’s Distributed Word Representation for Hindi Word Sense Disambiguation
Distributed Computing and Internet Technology
Abstract
Word Sense Disambiguation (WSD) is the task of extracting an appropriate sense of an ambiguous word in a sentence. WSD is an essential task for language processing, as it is a pre-requisite for determining the closest interpretations of various ...
Read More
A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized ...
Read More

Reviews

Reviewer: Epaminondas Kapetanios

Robot learning operates at the crossroads of disciplines such as machine learning, robotics engineering, and developmental robotics for lifelong learning. Robot skills can be divided into four categories: sensorimotor (locomotion, grasping); interactive (joint manipulation of an object); linguistic; and autonomous self-exploration or exploration through guidance from a human teacher. Therefore, robot learning can be closely related to subject areas such as adaptive control, for improving sensorimotor skills via dynamically adapting controllers; reinforcement learning, for understanding, taking actions, and planning; and developmental robotics, for more degrees of autonomous learning modalities such as those existent in human children, where lifelong learning is expected to be cumulative and of progressively increasing complexity. In this context, the paper addresses the areas of computer vision and natural language understanding, with emphasis on robot learning, exceptionally well. In particular, it provides an excellent starting point for someone to do research in this area. Also, from a teaching point of view, it provides an excellent reading list for postgraduate students. Although the paper is well written, it takes a long time to reach the point when it concentrates on computer vision and natural language understanding specifically for robots. Instead of concentrating on related work about the integration of computer vision and natural language for robots, the authors first take two separate long journeys into the semantics of natural language processing (NLP) and computer vision and/or image annotation. There is a huge body of related work about the semantics in these two separate areas, and there are many other survey papers. Therefore, the reader may feel a bit disappointed having gone through this survey, particularly if he or she is already familiar with aspects of semantic computing in NLP and computer vision. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 49, Issue 4
December 2017
666 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3022634
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2016
- Accepted: 1 October 2016
- Revised: 1 July 2016
- Received: 1 February 2016
Published in csur Volume 49, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Language and vision
computer vision
distributional semantics
image captioning
image embedding
imitation learning
lexical semantics
multimedia
natural language processing
robotics
semantic parsing
survey
symbol grounding
visual attribute
word embedding
word2vec
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 2,422
  Total Downloads
- Downloads (Last 12 months)266
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Learning distributed word representation with multi-contextual mixed embedding

Word2vec’s Distributed Word Representation for Hindi Word Sense Disambiguation

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Reviews

Access critical reviews of Computing literature here