Skip to main content
Log in

Computational models for integrating linguistic and visual information: A survey

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

This paper surveys research in developing computational models for integrating linguistic and visual information. It begins with a discussion of systems which have been actually implemented and continues with computationally motivated theories of human cognition. Since existing research spans several disciplines (e.g., natural language understanding, computer vision, knowledge representation), as well as several application areas, an important contribution of this paper is to categorize existing research based on inputs and objectives. Finally, some key issues related to integrating information from two such diverse sources are outlined and related to existing research. Throughout, the key issue addressed is the correspondence problem, namely how to associate visual events with words and vice versa.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abella A. & Kender R. (1993). Qualitatively Describing Objects Using Spatial Prepositions. In Proceedings ofThe Eleventh National Conference on Artificial Intelligent (AAAI-93), 536–540. Washington, DC.

  • Allen, J. (1987).Natural Language Understanding. Benjamin/Cummings: Menlo Park, CA.

    Google Scholar 

  • Adorni, G., Di Manzo, M. & Giunchiglia, F. (1984). Natural Language Driven Image Generation. In Proceedings ofCOLING, 495–500.

  • Abe, N., Soga, I. & Tsuji, S. (1981). A Plot Understanding System on Reference to Both Image and Language. In Proceedings ofIJCAI-81, 77–84.

  • Ballard, D. H. & Brown, C.Computer Vision. Prentice Hall: New Jersey.

  • Beckwith, R., Fellbaum, C., Gross, D. & Miller, G. A. (1991). WordNet: A Lexical Database Organized on Psycholinguistic Principles. InLexicons: Using On-line Resources to Build a Lexicon. Lawrence Erlbaum: Hillsdale, NJ.

    Google Scholar 

  • Chang, Shi-Kuo (1989)Principles of Pictorial Information Systems Design. Prentice-Hall.

  • Feiner, S. K. & McKeown, K. R. (1991). Automating the Generation of Coordinated Multimedia Explanations.IEEE Computer 24(10): 33–41.

    Google Scholar 

  • Geller, J. & Shapiro, C. (1987). Graphical Deep Knowledge for Intelligent Machine Drafting. In Proceedings ofThe Tenth International Joint Conference on Artificial Intelligence (IJCAI-87), 545–551. Morgan Kaufmann: Los Angeles, CA.

    Google Scholar 

  • Hearth, R. & Burton H. (1993). Selective Attention in Dynamic Vision. In Proceedings ofThe 13th International Joint Conference on Artificial Intelligence (IJCAI-93), 1579–1584.

  • Herskovits, A. (1986)Language and Spatial Cognition. Cambridge University Press.

  • Ishikawa, H., Suzuke, F., Kozakura, F., Makinouchi, A., Miyagishima, M., Izumida, Y., Aoshima, M. & Yamane, Y. (1993). The Model, Language, and Implementation of an Object-Oriented Multimedia Knowledge Base Management System.ACM Transactions on Database Systems.18(1): 1–50.

    Google Scholar 

  • Jackendoff, R. (1987). On Beyond Zebra: The Relation of Linguistic and Visual Information.Cognition,26(2): 89–114.

    Google Scholar 

  • Kosslyn, S. M., Brunn, J., Cave, K. & Wallach, R. (1984). Mental Imagery Ability. In Pinker, S. (ed.)Visual Cognition, 1–63. MIT Press: Cambridge Mass.

    Google Scholar 

  • Khoubyari, S. & Hull, J. (1993). Keyword Location in Noisy Document Images. In Proceedings ofThe Second Annual Symposium on Document Analysis and Information Retrieval, 217–231.

  • Kirsch, R. A. (1964). Computer Interpretation of English Text and Picture Patterns.IEEE Transactions on Electronic Computers 13: 363–376.

    Google Scholar 

  • Kobsa, A.et al. (1986). Combining Deictic Gestures and Natural Language for Referent Identification. In Proceedings ofCOLING, 356–361.

  • Kosslyn, S. M. (1990). Mental Imagery. In Osherson, D. A.et al. (eds.),Visual Cognition and Action, 73–97. MIT Press: Cambridge Mass.

    Google Scholar 

  • MacGregor, R. (1991). The Evolving Technology of Classification-Based Knowledge Representation Systems. InPrinciples of Semantic Netwerks: Exploration in the Representation of Knowledge, 385–400. Morgan Kaufmann: Los Angeles, CA.

    Google Scholar 

  • Marr, D. (1982).Vision. W. H. Freeman: San Francisco.

    Google Scholar 

  • Maybury, T. (ed.). (1993).Intelligent MultiMedia Interfaces. AAAI Press/The MIT Press.

  • McDonald, D. & Conklin, E. J. (1981). Salience as a Simplifying Metaphor for Natural Language Generation. In Proceedings ofAAAI-81, 49–51.

  • Minsky, M. (1975). A Framework for Representing Knowledge. In Winston, P. H. (ed.),The Psychology of Computer Vision, 211–277. McGraw-Hill Book Company: New York, NY.

    Google Scholar 

  • Miller, G. A. & Johnson-Laird, P. N. (1976).Language and Perception. The Belknap Press of Harvard University Press: Cambridge, MA.

    Google Scholar 

  • Montalvo, S. F. (1985). Diagram Understanding: The Intersection of Computer Vision and Graphics. A.I. Memo 873, Massachusetts Institute of Technology.

  • Maddox, A. B. & Pustejovsky, J. (1987). Linguistic Descriptions of Visual Event Perceptions. In Proceedings ofThe Ninth Annual Cognitive Science Society Conference, 442–454, Seattle.

  • Moore, J. D. & Swartout, W. R. (1990). Pointing: A Way Toward Explanation Dialogue. In Proceedings ofThe Eighth National Conference on Artificial Intelligence (AAAI-90), 457–464.

  • Novak, G. S. & Bulko, W. C. (1990). Understanding Natural Language with Diagrams. In Proceedings ofThe Eighth National Conference on Artificial Intelligence (AAAI-90), 465–470, Boston.

  • Neal, J. G., Dobes, Z., Bettinger, K. E. & Byoun J. S. (1988). Multi-Modal References in Human-Computer Dialogue. In Proceedings ofAAAI-88, 819–823. Morgan Kaufmann.

  • Nakatani, H. & Itoh, Y. (1994). An Image Retrieval System that Accepts Natural Language. InWorking Notes of the AAAI-94 Workship on Integration of Natural Language and Vision Processing, 7–13.

  • Neumann, B. & Nova, H. (1983). Event Models for Recognition and Natural Language Description of Events in Real-World Image Sequences. In Proceedings ofIJCAI 1983, 724–726.

  • Olivier, P., Maeda, T. & Tsujii, J. ichi (1994). Automatic Depiction of Spatial Descriptions. In Proceedings ofAAAI-94, 1405–1410. Seattle, WA.

  • Pinker, S. (ed.). (1984).Visual Cognition. MIT Press: Cambridge Mass.

    Google Scholar 

  • Rajagopalan, R. (1994). A Model for Integrated Qualitative Spatial and Dynamic Reasoning about Physical System. In Proceedings ofAAAI-94, 1411–1417. Seattle, WA.

  • Rowe, N. & Gugliemo, E. (1993). Exploiting Captions in Retrieval of Multimedia Data.Information Processing and Management 29(4): 453–461.

    Google Scholar 

  • Reiter, R. & Mackworth, A. K. (1987). A Logical Framework for Depiction and Image Interpretation. Technical Report 88-17. The University of British Columbia.

  • Shapiro, S. C. (1982). Generalized Augmented Transition Network Grammars for Generation from Semantic Networks.The American Journal for Computational Linguistics 8(2): 12–25.

    Google Scholar 

  • Shapiro, S. C. & Rapaport, W. J. (1990). The SNePS Family. CS Technical Report 90-21, SUNY at Buffalo.

  • Siskind, J. M. (1990). Acquiring Core Meanings of Words, Represented as Jackendoff-Style Conceptual Structures, from Correlated Streams of Linguistic and Non-Linguistic Input. In Proceedings ofThe 28th Annual Meeting of the Association for Computational Linguistics, 143–156.

  • Sowa, J. F. (1991).Principles of Semantic Networks: Exploration in the Representation of Knowledge. Morgan Kaufmann: Los Angeles, CA.

    Google Scholar 

  • Srihari, R. K. (1991). PICTION: A System that Uses Captions to Label Human Faces in Newspaper Photographs. In Proceedings ofThe 9th National Conference on Artificial Intelligence (AAAI-91), 80–85. Anaheim, CA.

  • Srihari, R. K. & Baltus, M. (1993). Incorporating Syntactic Constraints in Recognizing Handwritten Sentences. In Proceedings ofThe International Joint Conference on Artificial Intelligence (IJCAI-93), 1262–1267.

  • Srihari, R. K. & Burhans, D. T. (1994). Visual Semantics: Extracting Visual Information from Text Accompanying Pictures. In Proceedings ofAAAI-94, 793–798. Seattle, WA.

  • Srihari, R. K. (1994). Use of Collateral Text in Understanding Photos.Artificial Intelligence Review (special issue on integration of NLP and Vision), (this volume).

  • Talmy, L. (1983). How Language Structures Space. In Pick, H. & Acreolo, L. (eds.)Spatial Orientation: Theory, Research, and Application, 225–282. Plenum: New York.

    Google Scholar 

  • Truve, S. & Richards, W. (1987). From Waltz to Winston (via the Connection Table). In Proceedings ofThe First International conference on Computer Vision, 393–404. Computer Society Press.

  • Waltz, D. L. (1981) Generating and Understanding Scene Descriptions. In Webber, Bonnie & Sag, Ivan (eds.)Elements of Discourse Understanding, 266–282. Cambridge University Press: New York, NY.

    Google Scholar 

  • Weymouth, T. E. (1986). Using Object Descriptions in a Schema Network for Machine Vision. PhD thesis, University of Masschussetts at Amherst.

  • Winograd, T. (1973). A Procedural Model of Language Understanding. InComputer Models of thought and Language, 152–186. W. H. Freeman and Company: San Francisco.

    Google Scholar 

  • Yokota, M., Taniguchi, R. & Kawaguchi, E. (1984). Language-Picture Question-Answering Through Common Semantic Representation and its Application to the World of Weather Report. In Bolc, Leonard (ed.)Natural Language Communication with Pictorial Information Systems. Springer-Verlag.

  • Zernik, U. & Vivier, B. J. (1988), How Near Is Too Far? Talking about Visual Images. In Proceedings ofThe Tenth Annual Conference of the Cognitive Science Society, 202–208. Lawrence Erlbaum Associates.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srihari, R.K. Computational models for integrating linguistic and visual information: A survey. Artif Intell Rev 8, 349–369 (1994). https://doi.org/10.1007/BF00849725

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00849725

Key words

Navigation