skip to main content
survey

Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics

Published:12 December 2016Publication History
Skip Abstract Section

Abstract

Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.

References

  1. Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloimonos. 2015. From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292 (2015).Google ScholarGoogle Scholar
  2. Eren Erdal Aksoy, Alexey Abramov, Johannes Dörr, Kejun Ning, Babette Dellen, and Florentin Wörgötter. 2011. Learning the semantics of object--action relations by observation. Int. J. Robot. Res. (2011), 0278364911410459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yiannis Aloimonos and Cornelia Fermüller. 2015. The cognitive dialogue: A new model for vision implementing common sense reasoning. Image Vis. Comput. 34 (2015), 42--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 1 (2014), 2773--2832. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. 2012a. A method of moments for mixture models and hidden Markov models. In COLT, Vol. 1. 4.Google ScholarGoogle Scholar
  6. Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, and Sham M. Kakade. 2012b. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems. 917--925. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Andrew J. Anderson, Elia Bruni, Ulisse Bordignon, Massimo Poesio, and Marco Baroni. 2013. Of words, eyes and brains: Correlating image-based distributional semantic models with neural representations of concepts. In EMNLP. 1960--1970.Google ScholarGoogle Scholar
  8. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016a. Learning to compose neural networks for question answering. In NAACL.Google ScholarGoogle Scholar
  9. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016b. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39--48.Google ScholarGoogle ScholarCross RefCross Ref
  10. Mark Andrews, Gabriella Vigliocco, and David Vinson. 2009. Integrating experiential and distributional data to learn semantic representations. Psychol. Rev. 116, 3 (2009), 463.Google ScholarGoogle ScholarCross RefCross Ref
  11. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robot. Auton. Syst. 57, 5 (2009), 469--483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In ICLR 2015.Google ScholarGoogle Scholar
  14. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics--Volume 1. Association for Computational Linguistics, 86--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gökhan Bakir. 2007. Predicting structured data. MIT press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2012. Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. ACL. 1533--1544.Google ScholarGoogle Scholar
  17. Albert Bandura. 1974. Psychological Modeling: Conflicting Theories. Transaction Publishers.Google ScholarGoogle Scholar
  18. Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, and others. 2012a. Video in sentences out. In UAI 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrei Barbu, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. 2012b. Simultaneous object detection, tracking, and event recognition. In ACS 2012.Google ScholarGoogle Scholar
  20. Kobus Barnard, Pinar Duygulu, David Forsyth, Nando De Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (2003), 1107--1135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kobus Barnard and David Forsyth. 2001. Learning the semantics of words and pictures. In Proceedings of the 8th IEEE International Conference on Computer Vision, 2001 (ICCV 2001), Vol. 2. IEEE, 408--415.Google ScholarGoogle ScholarCross RefCross Ref
  22. Marco Baroni. 2016. Grounding distributional semantics in the visual world. Lang. Ling. Compass 10, 1 (2016), 3--13.Google ScholarGoogle ScholarCross RefCross Ref
  23. Francisco Barranco, Cornelia Fermüller, and Yiannis Aloimonos. 2014. Contour motion estimation for asynchronous event-driven cameras. Proc. IEEE 102, 10 (2014), 1537--1556.Google ScholarGoogle ScholarCross RefCross Ref
  24. Daniel Barrett, Andrei Barbu, N. Siddharth, and Jeffrey Siskind. 2016. Saying what you're looking for: Linguistics meets video search. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (Oct. 2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jonathan Barron and Jitendra Malik. 2015. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37, 8 (2015), 1670--1687.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-up robust features (SURF). Comput. Vis. Image Understand. 110, 3 (2008), 346--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In Computer Vision--ECCV 2006. Springer, 404--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Michael Beetz, Suat Gedikli, Jan Bandouch, Bernhard Kirchlechner, Nico von Hoyningen-Huene, and Alexander Perzylo. 2007. Visually tracking football games based on TV broadcasts. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman. 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19, 7 (1997), 711--720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neur. Comput. 15, 6 (2003), 1373--1396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Islam Beltagy, Stephen Roller, Pengxiang Cheng, Katrin Erk, and Raymond J. Mooney. 2015. Representing meaning with a combination of logical form and vectors. arXiv preprint arXiv:1505.06816 (2015).Google ScholarGoogle Scholar
  32. Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1798--1828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (2003), 1137--1155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yoshua Bengio, Hugo Larochelle, Pascal Lamblin, Dan Popovici, Aaron Courville, Clarence Simard, Jerome Louradour, and Dumitru Erhan. 2007. Deep architectures for baby AI. (2007).Google ScholarGoogle Scholar
  35. A. Berg, J. Deng, and L. Fei-Fei. 2010. Large scale visual recognition challenge (ILSVRC), 2010. Retrieved from http://www. image-net.org/challenges/LSVRC (2010).Google ScholarGoogle Scholar
  36. Tamara Berg and Alexander C. Berg. 2009. Finding iconic images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009 (CVPR Workshops 2009). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  37. Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee-Whye Teh, Erik Learned-Miller, and David A. Forsyth. 2004. Names and faces in the news. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Vol. 2. IEEE, II--848. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic attribute discovery and characterization from noisy web data. In Computer Vision--ECCV 2010. Springer, 663--676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tamara L. Berg, David Forsyth, and others. 2006. Animals on the web. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1463--1470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55 (2016), 409--442. Google ScholarGoogle ScholarCross RefCross Ref
  41. David M. Blei and Michael I. Jordan. 2003. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 127--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarGoogle ScholarCross RefCross Ref
  43. Benjamin S. Bloom and others. 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. McKay, New York, NY (1956), 20--24.Google ScholarGoogle Scholar
  44. Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel. 2005. Three-dimensional face recognition. Int. J. Comput. Vis. 64, 1 (2005), 5--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49 (2014), 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Donna Byron, Alexander Koller, Jon Oberlander, Laura Stoia, and Kristina Striegnitz. 2007. Generating instructions in virtual environments (GIVE): A challenge and an evaluation testbed for NLG. (2007).Google ScholarGoogle Scholar
  48. Angelo Cangelosi. 2006. The grounding and sharing of symbols. Pragm. Cogn. 14, 2 (2006), 275--285.Google ScholarGoogle ScholarCross RefCross Ref
  49. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr, and Tom M. Mitchell. 2010. Toward an architecture for never-ending language learning. In AAAI, Vol. 5. 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Marisa Carrasco. 2011. Visual attention: The past 25 years. Vis. Res. 51, 13 (2011), 1484--1525.Google ScholarGoogle ScholarCross RefCross Ref
  51. Joao Carreira and Cristian Sminchisescu. 2010. Constrained parametric min-cuts for automatic object segmentation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3241--3248.Google ScholarGoogle ScholarCross RefCross Ref
  52. Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014. Semantic parsing for text to 3d scene generation. ACL 2014 (2014), 17.Google ScholarGoogle Scholar
  53. Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. 2015. HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017--1025. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  55. Anthony Chemero. 2003. An outline of a theory of affordances. Ecological Psychology 15, 2 (2003), 181--195.Google ScholarGoogle ScholarCross RefCross Ref
  56. David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1. Association for Computational Linguistics, 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine Learning. ACM, 128--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015a. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google ScholarGoogle Scholar
  59. Xinlei Chen, Ashish Shrivastava, and Arpan Gupta. 2013. Neil: Extracting visual knowledge from web data. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 1409--1416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015b. Revisiting word embedding for contrasting meaning. In Proceedings of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  61. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder--decoder approaches. Syntax Sem. Struct. Stat. Transl. (2014), 103.Google ScholarGoogle Scholar
  62. Myung Jin Choi, Antonio Torralba, and Alan S. Willsky. 2012. Context models and out-of-context objects. Pattern Recogn. Lett. 33, 7 (2012), 853--862. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Stephen Clark and Stephen Pulman. 2007. Combining symbolic and distributional models of meaning. In AAAI Spring Symposium: Quantum Interaction. 52--55.Google ScholarGoogle Scholar
  65. Michael D. Cohen and Paul Bacdayan. 1994. Organizational routines are stored as procedural memory: Evidence from a laboratory study. Organiz. Sci. 5, 4 (1994), 554--568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Nadav Cohen, Or Sharir, and Amnon Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory. 698--728.Google ScholarGoogle Scholar
  67. Silvia Coradeschi, Amy Loutfi, and Britta Wrede. 2013. A short review of symbol grounding in robotic and intelligent systems. KI-Künstliche Intell. 27, 2 (2013), 129--136.Google ScholarGoogle ScholarCross RefCross Ref
  68. Silvia Coradeschi and Alessandro Saffiotti. 2000. Anchoring symbols to sensor data: Preliminary report. In AAAI/IAAI. 129--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Nelson Cowan. 2008. What are the differences between long-term, short-term, and working memory? Progr. Brain Res. 169 (2008), 323--338.Google ScholarGoogle ScholarCross RefCross Ref
  70. Trevor Darrell. 2010. Learning Representations for Real-world Recognition. Retrieved from http://www.eecs.berkeley.edu/∼trevor/colloq.pdf UCB EECS Colloquium {Accessed: 2015 11 1}.Google ScholarGoogle Scholar
  71. Pradipto Das, Chenliang Xu, Richard Doell, and Jason Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2634--2641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Hal Daumé III. 2007. Frustratingly easy domain adaptation. ACL 2007 (2007), 256.Google ScholarGoogle Scholar
  73. Hal Daumé III, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Mach. Learn. 75, 3 (2009), 297--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. 1269--1277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi, Yejin Choi, Hal Daumé III, Alexander C. Berg, and others. 2012. Detecting visual text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 762--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  77. Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox, and others. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2758--2766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Susan T. Dumais. 2007. LSA and information retrieval: Getting back to basics. Handb. Latent Semant. Anal. (2007), 293--321.Google ScholarGoogle Scholar
  79. Hugh Durrant-Whyte and Tim Bailey. 2006. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 13, 2 (2006), 99--110.Google ScholarGoogle ScholarCross RefCross Ref
  80. Pinar Duygulu, Kobus Barnard, Joao F. G. de Freitas, and David A. Forsyth. 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Computer Vision ECCV 2002. Springer, 97--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Aleksandrs Ecins, Cornelia Fermuller, and Yiannis Aloimonos. 2014. Shadow free segmentation in still images using local density measure. In Proceedings of the 2014 IEEE International Conference on Computational Photography (ICCP). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  82. Aleksandrs Ecins, Cornelia Fermuller, and Yiannis Aloimonos. 2016. Cluttered scene segmentation using the symmetry constraint. In Proceedings of the International Conference in Robotics and Automation (ICRA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. H. Eichenbaum. 2008. Memory. Scholarpedia 3, 3 (2008), 1747.Google ScholarGoogle ScholarCross RefCross Ref
  84. Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In EMNLP. 1292--1302.Google ScholarGoogle Scholar
  85. Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, Vol. 452. 457.Google ScholarGoogle ScholarCross RefCross Ref
  86. Oren Etzioni, Michele Banko, and Michael J. Cafarella. 2006. Machine reading. In AAAI, Vol. 6. 1517--1519. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Rui Fang, Changsong Liu, Lanbo She, and Joyce Y. Chai. 2013. Towards situated dialogue: Revisiting referring expression generation. In EMNLP. 392--402.Google ScholarGoogle Scholar
  89. Ali Farhadi. 2011. Designing Representational Architectures in Recognition. University of Illinois at Urbana-Champaign. Champaign, IL, USA.Google ScholarGoogle Scholar
  90. Ali Farhadi, Ian Endres, and Derek Hoiem. 2010. Attribute-centric recognition for cross-category generalization. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2352--2359.Google ScholarGoogle ScholarCross RefCross Ref
  91. Alireza Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009). IEEE, 1778--1785.Google ScholarGoogle ScholarCross RefCross Ref
  92. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision--ECCV 2010. Springer, 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014).Google ScholarGoogle Scholar
  94. Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. 2015. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 (2015).Google ScholarGoogle Scholar
  95. Li Fei-Fei, Rob Fergus, and Pietro Perona. 2007. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106, 1 (2007), 59--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. S. L. Feng, Raghavan Manmatha, and Victor Lavrenko. 2004. Multiple Bernoulli relevance models for image and video annotation. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004 (CVPR’04)., Vol. 2. IEEE, II--1002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao Huang, Lucy Vanderwende, Jacob Devlin, Michel Galley, and Margaret Mitchell. 2015. A survey of current datasets for vision and language research. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 207--213.Google ScholarGoogle ScholarCross RefCross Ref
  98. Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 2 (1936), 179--188.Google ScholarGoogle ScholarCross RefCross Ref
  99. Daryl Fougnie. 2008. The relationship between attention and working memory. New Res. Short-term Mem. (2008), 1--45.Google ScholarGoogle Scholar
  100. D. F. Fouhey, A. Gupta, and A. Zisserman. 2016. 3D shape attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  101. Jianlong Fu, Jinqiao Wang, Xin-Jing Wang, Yong Rui, and Hanqing Lu. 2015. What visual attributes characterize an object class? In Computer Vision--ACCV 2014. Springer, 243--259.Google ScholarGoogle Scholar
  102. Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems. 2296--2304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. D. Garcia-Gasulla, J. Béjar, U. Cortés, E. Ayguadé, and J. Labarta. 2015. Extracting visual patterns from deep learning representations. arXiv preprint arXiv:1507.08818 (2015).Google ScholarGoogle Scholar
  104. Peter Gärdenfors. 2014. The Geometry of Meaning: Semantics Based on Conceptual Spaces. MIT Press.Google ScholarGoogle ScholarCross RefCross Ref
  105. Konstantina Garoufi. 2014. Planning-based models of natural language generation. Lang. Ling. Compass 8, 1 (2014), 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  106. Konstantina Garoufi and Alexander Koller. 2010. Automated planning for situated natural language generation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1573--1582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Konstantina Garoufi, Maria Staudte, Alexander Koller, and Matthew W. Crocker. 2016. Exploiting listener gaze to improve situated communication in dynamic virtual environments. Cognitive Science 40, 7 (2016), 1671--1703.Google ScholarGoogle ScholarCross RefCross Ref
  108. Dan Garrette, Katrin Erk, and Raymond Mooney. 2014. A formal approach to linking logical form and vector-space lexical semantics. In Computing Meaning. Springer, 27--48.Google ScholarGoogle Scholar
  109. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Kristen Grauman and Bastian Leibe. 2010. Visual Object Recognition. Number 11. Morgan 8 Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Douglas Greenlee. 1978. Semiotic and significs. Int. Stud. Philos. 10 (1978), 251--254.Google ScholarGoogle ScholarCross RefCross Ref
  114. Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Sarad Venugopalan, Randy Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 2712--2719. Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. Gutemberg Guerra-Filho and Yiannis Aloimonos. 2007. A language for human action. Computer 40, 5 (2007), 42--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Abhinav Gupta. 2009. Beyond nouns and verbs. (2009).Google ScholarGoogle Scholar
  117. Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2015. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vis. 112, 2 (2015), 133--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).Google ScholarGoogle Scholar
  119. Xintong Han, Bharat Singh, Vlad I. Morariu, and Larry S. Davis. 2015. Fast automatic video retrieval using web images. arXiv preprint arXiv:1512.03384 (2015).Google ScholarGoogle Scholar
  120. Emily M. Hand and Rama Chellappa. 2016. Attributes for improved attributes: A multi-task network for attribute classification. arXiv preprint arXiv:1604.07360 (2016).Google ScholarGoogle Scholar
  121. Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarCross RefCross Ref
  122. Stevan Harnad. 1990. The symbol grounding problem. Physica D 42, 1 (1990), 335--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Zellig S. Harris. 1954. Distributional structure. Word 10, 2--3 (1954), 146--162.Google ScholarGoogle ScholarCross RefCross Ref
  124. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.Google ScholarGoogle Scholar
  125. Geremy Heitz and Daphne Koller. 2008. Learning spatial context: Using stuff to find things. In Computer Vision--ECCV 2008. Springer, 30--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Geoffrey E. Hinton. 1984. Distributed representations. Technical Report: Carnegie Melon University.Google ScholarGoogle Scholar
  127. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. (2013), 853--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Bernhard Hommel, Jochen Müsseler, Gisa Aschersleben, and Wolfgang Prinzb. 2001. The theory of event coding (TEC): A framework for perception and action planning. Behav. Brain Sci. 24 (2001), 849--937.Google ScholarGoogle ScholarCross RefCross Ref
  131. Thanarat Horprasert, David Harwood, and Larry S. Davis. 1999. A statistical approach for real-time robust background subtraction and shadow detection. In IEEE ICCV, Vol. 99. 1--19.Google ScholarGoogle Scholar
  132. Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report. Technical Report 07-49, University of Massachusetts, Amherst.Google ScholarGoogle Scholar
  133. Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 39--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Julian Jaynes. 2000. The Origin of Consciousness in the Breakdown of the Bicameral Mind. Houghton Mifflin Harcourt.Google ScholarGoogle Scholar
  135. Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. 2003. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 119--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  137. Benjamin Johnston, Fangkai Yang, Rogan Mendoza, Xiaoping Chen, and Mary-Anne Williams. 2008. Ontology based object categorization for robots. In Practical Aspects of Knowledge Management. Springer, 219--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. 2015. Learning visual features from large weakly supervised data. arXiv preprint arXiv:1511.02251 (2015).Google ScholarGoogle Scholar
  139. Alap Karapurkar. 2008. Modeling human activities. Scholarly Paper Archive, Department of Computer Science, University of Maryland, College Park, MD, 20742.Google ScholarGoogle Scholar
  140. Andrej Karpathy and Li Fei-Fei. 2015a. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  141. Andrej Karpathy and Li Fei-Fei. 2015b. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  142. Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3276--3284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Alexander Koller and Matthew Stone. 2007. Sentence generation as a planning problem. ACL 2007 (2007), 336.Google ScholarGoogle Scholar
  145. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalanditis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (2016), 45.Google ScholarGoogle Scholar
  146. Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. NAACL HLT 2013 (2013), 10.Google ScholarGoogle Scholar
  147. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. German Kruszewski, Denis Paperno, and Marco Baroni. 2015. Deriving boolean structures from distributional vectors. Trans. Assoc. Comput. Ling. 3 (2015), 375--388.Google ScholarGoogle ScholarCross RefCross Ref
  149. Gaurav Kulkarni, Visruth Premraj, Vicente Ordonez, Sudipta Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 12 (2013), 2891--2903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In ICML.Google ScholarGoogle Scholar
  151. Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar. 2009. Attribute and simile classifiers for face verification. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. IEEE, 365--372.Google ScholarGoogle Scholar
  152. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  153. Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1817--1824.Google ScholarGoogle ScholarCross RefCross Ref
  154. Kevin Lai and Dieter Fox. 2010. Object recognition in 3D point clouds using web data and domain adaptation. Int. J. Robot. Res. 29, 8 (2010), 1019--1037. Google ScholarGoogle ScholarDigital LibraryDigital Library
  155. Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009). IEEE, 951--958.Google ScholarGoogle ScholarCross RefCross Ref
  156. Victor Lavrenko, R. Manmatha, and Jiwoon Jeon. 2003. A model for learning the semantics of pictures. In Advances in Neural Information Processing Systems. None. Google ScholarGoogle ScholarDigital LibraryDigital Library
  157. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 2169--2178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Dieu-Thu Le, Jasper Uijlings, and Raffaella Bernardi. 2014. Tuhoi: Trento universal human object interaction dataset. V8L Net 2014 (2014), 17.Google ScholarGoogle Scholar
  159. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  160. Chee Wee Leong and Rada Mihalcea. 2011. Going beyond text: A hybrid image-text approach for measuring word relatedness. In IJCNLP. 1403--1407.Google ScholarGoogle Scholar
  161. Stephen C. Levinson. 2001. Pragmatics. In International Encyclopedia of Social and Behavioral Sciences: Vol. 17. Pergamon, 11948--11954.Google ScholarGoogle Scholar
  162. Omer Levy and Yoav Goldberg. 2014a. Dependencybased word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. 302--308.Google ScholarGoogle Scholar
  163. Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems. 2177--2185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  164. Li-Jia Li and Li Fei-Fei. 2007. What, where and who? Classifying events by scene and object recognition. In Proceedings of the IEEE 11th International Conference on Computer Vision. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  165. Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  166. Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, and Alberto Del Bimbo. 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv. 49, 1 (2016), 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  167. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8.Google ScholarGoogle Scholar
  168. Changsong Liu and Joyce Yue Chai. 2015. Learning to mediate perceptual differences in situated human-robot dialogue. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press, 2288--2294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  169. Changsong Liu, Lanbo She, Rui Fang, and Joyce Y. Chai. 2014a. Probabilistic labeling for efficient referential grounding based on collaborative discourse. In ACL (2). 13--18.Google ScholarGoogle Scholar
  170. Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3337--3344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  171. Tie-Yan Liu. 2009. Learning to rank for information retrieval. Found. Trends Inform. Retriev. 3, 3 (2009), 225--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  172. Xiaobai Liu, Yibiao Zhao, and Song-Chun Zhu. 2014b. Single-view 3d scene parsing by attributed grammar. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 684--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  173. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google ScholarGoogle ScholarCross RefCross Ref
  174. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  175. James MacQueen and others. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281--297.Google ScholarGoogle Scholar
  176. Ameesh Makadia, Vladimir Pavlovic, and Sanjiv Kumar. 2008. A new baseline for image annotation. In Computer Vision--ECCV 2008. Springer, 316--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  177. Alexis Maldonado, Humberto Alvarez, and Michael Beetz. 2012. Improving robot manipulation through fingertip perception. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2947--2954.Google ScholarGoogle ScholarCross RefCross Ref
  178. Jitendra Malik, Pablo Arbeláez, João Carreira, Katerina Fragkiadaki, Ross Girshick, Georgia Gkioxari, Saurabh Gupta, Bharath Hariharan, Abhishek Kar, and Shubham Tulsiani. 2016. The three Rs of computer vision: Recognition, reconstruction and reorganization. Pattern Recogn. Lett. 72 (2016), 4--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  179. Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  180. Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick Johnston, Andrew Rabinovich, and Kevin Murphy. 2015. What’s cookin’? Interpreting cooking videos using text, speech and vision. In NAACL 2015.Google ScholarGoogle ScholarCross RefCross Ref
  181. Matthew Marge, Claire Bonial, Brendan Byrne, Taylor Cassidy, A. William Evans, Susan G. Hill, and Clare Voss. 2016. Applying the wizard-of-oz technique to multimodal human-robot dialogue. In Proceedings of RO-MAN (To appear).Google ScholarGoogle Scholar
  182. David Marr. 1982. Vision: A Computational Investigation Into the Human Representation and Processing of Visual Information. Henry Holt and Co., Inc., New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  183. David R. Martin, Charless C. Fowlkes, and Jitendra Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 26, 5 (2004), 530--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  184. Cynthia Matuszek*, Nicholas FitzGerald*, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded attribute learning. In Proceedings of the 2012 International Conference on Machine Learning. Edinburgh, Scotland.Google ScholarGoogle Scholar
  185. Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. 2013. Learning to parse natural language commands to a robot control system. In Experimental Robotics. Springer, 403--415.Google ScholarGoogle Scholar
  186. Nikolaos Mavridis. 2015. A review of verbal and non-verbal human--robot interactive communication. Robot. Auton. Syst. 63 (2015), 22--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  187. Nikolaos Mavridis and Deb Roy. 2006. Grounded situation models for robots: Where words and percepts meet. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4690--4697.Google ScholarGoogle ScholarCross RefCross Ref
  188. Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015a. Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  189. Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015b. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  190. Jon D. Mcauliffe and David M. Blei. 2008. Supervised topic models. In Advances in Neural Information Processing Systems. 121--128.Google ScholarGoogle Scholar
  191. Brian McMahan and Matthew Stone. 2015. A Bayesian model of grounded color semantics. Trans. Assoc. Comput. Ling. 3 (2015), 103--115.Google ScholarGoogle ScholarCross RefCross Ref
  192. Ken McRae, George S. Cree, Mark S. Seidenberg, and Chris McNorgan. 2005. Semantic feature production norms for a large set of living and nonliving things. Behav. Res. Methods 37, 4 (2005), 547--559.Google ScholarGoogle ScholarCross RefCross Ref
  193. Chet Meyers and Thomas B. Jones. 1993. Promoting Active Learning. Strategies for the College Classroom. ERIC.Google ScholarGoogle Scholar
  194. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  195. George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  196. Marvin Minsky. 2006. The emotion machine. New York: Pantheon (2006).Google ScholarGoogle ScholarDigital LibraryDigital Library
  197. Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 747--756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  198. Saif M. Mohammad, Bonnie J. Dorr, Graeme Hirst, and Peter D. Turney. 2013. Computing lexical contrast. Comput. Ling. 39, 3 (2013), 555--590.Google ScholarGoogle ScholarCross RefCross Ref
  199. Raymond J. Mooney. 2008. Learning to connect language and perception. In AAAI. 1598--1601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  200. Raymond J. Mooney. 2013. Grounded Language Learning. (7 2013). 27th AAAI Conference on Artificial Intelligence, Washington 2013 Retrieved November 2, 2015 from http://videolectures.net/ aaai2013_mooney_language_learning/.Google ScholarGoogle Scholar
  201. Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka. 1999. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management. Citeseer, 1--9.Google ScholarGoogle Scholar
  202. Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Proceedings of the International Workshop on Artificial Intelligence and Statistics. Citeseer, 246--252.Google ScholarGoogle Scholar
  203. Charles William Morris. 1938. Foundations of the theory of signs. (1938).Google ScholarGoogle Scholar
  204. Venkatesh N. Murthy, Subhransu Maji, and R. Manmatha. 2015. Automatic image annotation using deep learning representations. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 603--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  205. Austin Myers, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Affordance detection of tool parts from geometric features. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).Google ScholarGoogle ScholarCross RefCross Ref
  206. Douglas L. Nelson, Cathy L. McEvoy, and Thomas A. Schreiber. 2004. The university of south Florida free association, rhyme, and word fragment norms. Behav. Res. Methods Instrum. Comput. 36, 3 (2004), 402--407.Google ScholarGoogle ScholarCross RefCross Ref
  207. Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. 2015. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  208. Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 3 (2001), 145--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  209. Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. From large scale image categorization to entry-level categories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  210. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143--1151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  211. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  212. Devi Parikh. 2009. Modeling context for image understanding: When, for what, and how? (2009).Google ScholarGoogle ScholarDigital LibraryDigital Library
  213. Devi Parikh and Kristen Grauman. 2011. Relative attributes. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 503--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  214. Seyoung Park, Bruce Xiaohan Nie, and Song-Chun Zhu. 2016. Attribute and-or grammar for joint parsing of human attributes, part and pose. arXiv preprint arXiv:1605.02112 (2016).Google ScholarGoogle Scholar
  215. Katerina Pastra and Yiannis Aloimonos. 2012. The minimalist grammar of action. Philos. Trans. Roy. Soc. B: Biol. Sci. 367, 1585 (2012), 103--117.Google ScholarGoogle ScholarCross RefCross Ref
  216. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12 (2014), 1532--1543.Google ScholarGoogle Scholar
  217. Jean Piaget. 2013. Play, Dreams and Imitation in Childhood. Vol. 25. Routledge.Google ScholarGoogle Scholar
  218. Tony Plate. 1997. A common framework for distributed representation schemes for compositional structure. Connectionist Systems for Knowledge Representation and Deduction (1997), 15--34.Google ScholarGoogle Scholar
  219. Robert Pless and Richard Souvenir. 2009. A survey of manifold learning for images. IPSJ Trans. Comput. Vis. Appl. 1 (2009), 83--94.Google ScholarGoogle ScholarCross RefCross Ref
  220. J. Pont-Tuset, P. Arbelaez, J. Barron, F. Marques, and J. Malik. 2016. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans. Pattern Anal. Mach. Intelli. (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  221. Hoifung Poon and Pedro Domingos. 2009. Unsupervised semantic parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1. Association for Computational Linguistics, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  222. Cecilia Quiroga-Clare. 2003. Language ambiguity: A curse and a blessing. Transl. J. 7, 1 (2003).Google ScholarGoogle Scholar
  223. Gabriel A. Radvansky and Jeffrey M. Zacks. 2014. Event Cognition. Oxford University Press.Google ScholarGoogle Scholar
  224. Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953--2961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  225. Giacomo Rizzolatti and Laila Craighero. 2004. The mirror-neuron system. Annu. Rev. Neurosci. 27 (2004), 169--192.Google ScholarGoogle ScholarCross RefCross Ref
  226. Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In Pattern Recognition. Springer, 184--195.Google ScholarGoogle Scholar
  227. Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3202--3212.Google ScholarGoogle ScholarCross RefCross Ref
  228. Stephen Roller and Sabine Schulte Im Walde. 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1146--1157.Google ScholarGoogle Scholar
  229. Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the ACL.Google ScholarGoogle ScholarCross RefCross Ref
  230. Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500 (2000), 2323--2326.Google ScholarGoogle ScholarCross RefCross Ref
  231. Deb Roy. 2005. Grounding words in perception and action: Computational insights. Trends Cogn. Sci. 9, 8 (2005), 390.Google ScholarGoogle ScholarCross RefCross Ref
  232. Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 77, 1-3 (2008), 157--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  233. Fereshteh Sadeghi, C. Lawrence Zitnick, and Ali Farhadi. 2015. VISALOGY: Answering visual analogy questions. In Advances in Neural Information Processing Systems (NIPS-15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  234. Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1745--1752. Google ScholarGoogle ScholarDigital LibraryDigital Library
  235. Karin Kipper Schuler. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon (January 1, 2005). Dissertations available from ProQuest. Paper AAI3179808. http://repository.upenn.edu/dissertations/AAI3179808.Google ScholarGoogle Scholar
  236. Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015. Symmetric pattern based word embeddings for improved word similarity prediction. CoNLL 2015 (2015), 258.Google ScholarGoogle Scholar
  237. Nishant Shukla, Caiming Xiong, and Song-Chun Zhu. 2015. A unified framework for human-robot knowledge transfer. In Proceedings of the 2015 AAAI Fall Symposium Series.Google ScholarGoogle Scholar
  238. Narayanaswamy Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. 2014. Seeing what you’re told: Sentence-guided activity recognition in video. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 732--739. Google ScholarGoogle ScholarDigital LibraryDigital Library
  239. Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of semantic representation with visual attributes. In ACL (1). 572--582.Google ScholarGoogle Scholar
  240. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  241. Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4561--4569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  242. Jeffrey Mark Siskind. 2001. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. J. Artif. Intell. Res. 15 (2001), 31--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  243. Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling. 2 (2014), 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  244. Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 129--136.Google ScholarGoogle Scholar
  245. Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15 (2014), 2949--2980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  246. Mark Steedman. 1996. Surface structure and interpretation. (1996).Google ScholarGoogle Scholar
  247. Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, and others. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems. 2440--2448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  248. Douglas Summers-Stay, Ching L. Teo, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos. 2012. Using a minimal action grammar for activity understanding in the real world. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4104--4111.Google ScholarGoogle ScholarCross RefCross Ref
  249. Douglas Alan Summers-Stay. 2013. Productive vision: Methods for automatic image comprehension. (2013).Google ScholarGoogle Scholar
  250. Yuyin Sun, Liefeng Bo, and Dieter Fox. 2013. Attribute based object identification. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2096--2103.Google ScholarGoogle ScholarCross RefCross Ref
  251. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  252. Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In ACL.Google ScholarGoogle Scholar
  253. Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lars Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1701--1708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  254. Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen. 2015. Book2movie: Aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1827--1835.Google ScholarGoogle ScholarCross RefCross Ref
  255. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  256. Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. 2005. Learning structured prediction models: A large margin approach. In Proceedings of the 22nd International Conference on Machine Learning. ACM, 896--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  257. Stefanie Tellex, Ross Knepper, Adrian Li, Daniela Rus, and Nicholas Roy. 2014. Asking for help using inverse semantics. Proceedings of Robotics: Science and Systems, Berkeley, USA (2014).Google ScholarGoogle ScholarCross RefCross Ref
  258. Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  259. Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 5500 (2000), 2319--2323.Google ScholarGoogle ScholarCross RefCross Ref
  260. Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Fast 2D border ownership assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5117--5125.Google ScholarGoogle ScholarCross RefCross Ref
  261. Ching L. Teo, Yezhou Yang, Hal Daumé III, Cornelia Fermüller, and Yiannis Aloimonos. 2012. Towards a Watson that sees: Language-guided action recognition for robots. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 374--381.Google ScholarGoogle ScholarCross RefCross Ref
  262. Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING).Google ScholarGoogle Scholar
  263. Jesse Thomason, Shiqi Zhang, Raymond Mooney, and Peter Stone. 2015. Learning to interpret natural language commands through human-robot dialog. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  264. Sebastian Thrun, Wolfram Burgard, and Dieter Fox. 2005. Probabilistic Robotics. MIT Press.Google ScholarGoogle Scholar
  265. Joseph Tighe and Svetlana Lazebnik. 2010. Superparsing: Scalable nonparametric image parsing with superpixels. In European Conference on Computer Vision. Springer, 352--365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  266. Atousa Torabi, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015).Google ScholarGoogle Scholar
  267. Antonio Torralba, Alexei Efros, and others. 2011. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1521--1528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  268. Anne-Marie Tousch, Stéphane Herbin, and Jean-Yves Audibert. 2012. Semantic hierarchies for image annotation: A survey. Pattern Recogn. 45, 1 (2012), 333--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  269. Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. Int. J. Comput. Vis. 63, 2 (2005), 113--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  270. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 384--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  271. Matthew Turk and Alex Pentland. 1991. Eigenfaces for recognition. J. Cogn. Neurosci. 3, 1 (1991), 71--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  272. Jasper R. R. Uijlings and Vittorio Ferrari. 2015. Situational object boundary detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4712--4721.Google ScholarGoogle Scholar
  273. Laurens J. P. van der Maaten, Eric O. Postma, and H. Jaap van den Herik. 2009. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 10, 1--41 (2009), 66--71.Google ScholarGoogle Scholar
  274. Bernard Vauquois. 1968. Structures profondes et traduction automatique. Le système du CETA. Rev. Roum. Ling. 13, 2 (1968), 105--130.Google ScholarGoogle Scholar
  275. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  276. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  277. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT.Google ScholarGoogle Scholar
  278. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  279. Luis Von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 319--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  280. Matthew R. Walter, Matthew E. Antone, Ekapol Chuangsuwanich, Andrew Correa, Randall Davis, Luke Fletcher, Emilio Frazzoli, Yuli Friedman, James R. Glass, Jonathan P. How, Jeong Hwan Jeon, Sertac Karaman, Brandon Luders, Nicholas Roy, Stefanie Tellex, and Seth J. Teller. 2015. A situationally aware voice-commandable robotic forklift working alongside people in unstructured outdoor environments. J. Field Robot. 32, 4 (2015), 590--628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  281. Chong Wang, David Blei, and Fei-Fei Li. 2009. Simultaneous image classification and annotation. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 1903--1910.Google ScholarGoogle Scholar
  282. Meng Wang, Bingbing Ni, Xian-Sheng Hua, and Tat-Seng Chua. 2012. Assistive tagging: A survey of multimedia tagging with human-computer joint exploration. ACM Comput. Surv. 44, 4 (2012), 25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  283. Ronald J. Williams. 1988. On the use of backpropagation in associative reinforcement learning. In Proceedings of the IEEE International Conference on Neural Networks, 1988. IEEE, 263--270.Google ScholarGoogle ScholarCross RefCross Ref
  284. Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2015b. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  285. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015a. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1912--1920.Google ScholarGoogle Scholar
  286. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In ICML.Google ScholarGoogle Scholar
  287. Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, and Kate Saenko. 2015b. A multi-scale multiple instance video description network. arXiv preprint arXiv:1505.05914 (2015).Google ScholarGoogle Scholar
  288. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning. 2048--2057.Google ScholarGoogle Scholar
  289. Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015c. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  290. Yezhou Yang, Cornelia Fermuller, and Yiannis Aloimonos. 2013. Detection of manipulation action consequences (MAC). In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2563--2570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  291. Yezhou Yang, Cornelia Fermuller, Yiannis Aloimonos, and Eren Erdal Aksoy. 2015. Learning the semantics of manipulation action. The 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Vol. 1. Association for Computational Linguistics, 676--686.Google ScholarGoogle ScholarCross RefCross Ref
  292. Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015a. Neural self talk: Image understanding via continuous questioning and answering. arXiv preprint arXiv:1512.03460 (2015).Google ScholarGoogle Scholar
  293. Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015b. Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI-15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  294. Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  295. Yezhou Yang, Ching L. Teo, Cornelia Fermuller, and Yiannis Aloimonos. 2013. Robots with language: Multi-label visual recognition using NLP. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4256--4262.Google ScholarGoogle ScholarCross RefCross Ref
  296. Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. 2011. Human action recognition by learning bases of action attributes and parts. In Proceedings of the 2011 International Conference on Computer Vision. IEEE, 1331--1338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  297. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  298. Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  299. Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  300. Dani Yogatama, Manaal Faruqui, Chris Dyer, and Noah A. Smith. 2014. Learning word representations with hierarchical sparse coding. arXiv preprint arXiv:1406.2035 (2014).Google ScholarGoogle Scholar
  301. Nivasan Yogeswaran, Wenting Dang, William Taube Navaraj, Dhayalan Shakthivel, Saleem Khan, Emre Ozan Polat, Shoubhik Gupta, Hadi Heidari, Mohsen Kaboli, Leandro Lorenzelli, and others. 2015. New materials and advances in making electronic skin for interactive robots. Adv. Robot. 29, 21 (2015), 1359--1373.Google ScholarGoogle ScholarCross RefCross Ref
  302. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  303. Haonan Yu, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. 2015b. A compositional framework for grounding language inference, generation, and acquisition in video. J. Artif. Intell. Res. (2015), 601--713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  304. Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded language learning from video described with sentences. In ACL (1). 53--63.Google ScholarGoogle Scholar
  305. Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2015c. Video paragraph captioning using hierarchical recurrent neural networks. arXiv preprint arXiv:1510.07712 (2015).Google ScholarGoogle Scholar
  306. Licheng Yu, Eunbyung Park, Alexander C. Berg, and Tamara L. Berg. 2015a. Visual madlibs: Fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2461--2469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  307. Xiaodong Yu, Cornelia Fermuller, Ching Lik Teo, Yezhou Yang, and Yiannis Aloimonos. 2011. Active scene recognition with vision and language. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 810--817. Google ScholarGoogle ScholarDigital LibraryDigital Library
  308. Konstantinos Zampogiannis, Yezhou Yang, Cornelia Fermuller, and Yiannis Aloimonos. 2015. Learning the spatial semantics of manipulation actions through preposition grounding. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1389--1396.Google ScholarGoogle ScholarCross RefCross Ref
  309. John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artificial Intelligence. 1050--1055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  310. Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  311. Dengsheng Zhang, Md Monirul Islam, and Guojun Lu. 2012. A review on automatic image annotation techniques. Pattern Recogn. 45, 1 (2012), 346--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  312. Rong Zhao and William I. Grosky. 2002. Bridging the semantic gap in image retrieval. Distributed Multimedia Databases: Techniques and Applications (2002), 14--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  313. Wenyi Zhao, Rama Chellappa, P. Jonathon Phillips, and Azriel Rosenfeld. 2003. Face recognition: A literature survey. ACM Comput. Surv. 35, 4 (2003), 399--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  314. Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1529--1537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  315. Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  316. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015a. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  317. Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015b. Building a large-scale multimodal knowledge base for visual question answering. arXiv preprint arXiv:1507.05670 (2015).Google ScholarGoogle Scholar

Index Terms

  1. Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics

                Recommendations

                Reviews

                Epaminondas Kapetanios

                Robot learning operates at the crossroads of disciplines such as machine learning, robotics engineering, and developmental robotics for lifelong learning. Robot skills can be divided into four categories: sensorimotor (locomotion, grasping); interactive (joint manipulation of an object); linguistic; and autonomous self-exploration or exploration through guidance from a human teacher. Therefore, robot learning can be closely related to subject areas such as adaptive control, for improving sensorimotor skills via dynamically adapting controllers; reinforcement learning, for understanding, taking actions, and planning; and developmental robotics, for more degrees of autonomous learning modalities such as those existent in human children, where lifelong learning is expected to be cumulative and of progressively increasing complexity. In this context, the paper addresses the areas of computer vision and natural language understanding, with emphasis on robot learning, exceptionally well. In particular, it provides an excellent starting point for someone to do research in this area. Also, from a teaching point of view, it provides an excellent reading list for postgraduate students. Although the paper is well written, it takes a long time to reach the point when it concentrates on computer vision and natural language understanding specifically for robots. Instead of concentrating on related work about the integration of computer vision and natural language for robots, the authors first take two separate long journeys into the semantics of natural language processing (NLP) and computer vision and/or image annotation. There is a huge body of related work about the semantics in these two separate areas, and there are many other survey papers. Therefore, the reader may feel a bit disappointed having gone through this survey, particularly if he or she is already familiar with aspects of semantic computing in NLP and computer vision. Online Computing Reviews Service

                Access critical reviews of Computing literature here

                Become a reviewer for Computing Reviews.

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader