Computational models for integrating linguistic and visual information: A survey

Srihari, Rohini K.

doi:10.1007/BF00849725

Computational models for integrating linguistic and visual information: A survey

Published: September 1994

Volume 8, pages 349–369, (1994)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Rohini K. Srihari¹

183 Accesses
16 Citations
Explore all metrics

Abstract

This paper surveys research in developing computational models for integrating linguistic and visual information. It begins with a discussion of systems which have been actually implemented and continues with computationally motivated theories of human cognition. Since existing research spans several disciplines (e.g., natural language understanding, computer vision, knowledge representation), as well as several application areas, an important contribution of this paper is to categorize existing research based on inputs and objectives. Finally, some key issues related to integrating information from two such diverse sources are outlined and related to existing research. Throughout, the key issue addressed is the correspondence problem, namely how to associate visual events with words and vice versa.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abella A. & Kender R. (1993). Qualitatively Describing Objects Using Spatial Prepositions. In Proceedings ofThe Eleventh National Conference on Artificial Intelligent (AAAI-93), 536–540. Washington, DC.
Allen, J. (1987).Natural Language Understanding. Benjamin/Cummings: Menlo Park, CA.
Google Scholar
Adorni, G., Di Manzo, M. & Giunchiglia, F. (1984). Natural Language Driven Image Generation. In Proceedings ofCOLING, 495–500.
Abe, N., Soga, I. & Tsuji, S. (1981). A Plot Understanding System on Reference to Both Image and Language. In Proceedings ofIJCAI-81, 77–84.
Ballard, D. H. & Brown, C.Computer Vision. Prentice Hall: New Jersey.
Beckwith, R., Fellbaum, C., Gross, D. & Miller, G. A. (1991). WordNet: A Lexical Database Organized on Psycholinguistic Principles. InLexicons: Using On-line Resources to Build a Lexicon. Lawrence Erlbaum: Hillsdale, NJ.
Google Scholar
Chang, Shi-Kuo (1989)Principles of Pictorial Information Systems Design. Prentice-Hall.
Feiner, S. K. & McKeown, K. R. (1991). Automating the Generation of Coordinated Multimedia Explanations.IEEE Computer 24(10): 33–41.
Google Scholar
Geller, J. & Shapiro, C. (1987). Graphical Deep Knowledge for Intelligent Machine Drafting. In Proceedings ofThe Tenth International Joint Conference on Artificial Intelligence (IJCAI-87), 545–551. Morgan Kaufmann: Los Angeles, CA.
Google Scholar
Hearth, R. & Burton H. (1993). Selective Attention in Dynamic Vision. In Proceedings ofThe 13th International Joint Conference on Artificial Intelligence (IJCAI-93), 1579–1584.
Herskovits, A. (1986)Language and Spatial Cognition. Cambridge University Press.
Ishikawa, H., Suzuke, F., Kozakura, F., Makinouchi, A., Miyagishima, M., Izumida, Y., Aoshima, M. & Yamane, Y. (1993). The Model, Language, and Implementation of an Object-Oriented Multimedia Knowledge Base Management System.ACM Transactions on Database Systems.18(1): 1–50.
Google Scholar
Jackendoff, R. (1987). On Beyond Zebra: The Relation of Linguistic and Visual Information.Cognition,26(2): 89–114.
Google Scholar
Kosslyn, S. M., Brunn, J., Cave, K. & Wallach, R. (1984). Mental Imagery Ability. In Pinker, S. (ed.)Visual Cognition, 1–63. MIT Press: Cambridge Mass.
Google Scholar
Khoubyari, S. & Hull, J. (1993). Keyword Location in Noisy Document Images. In Proceedings ofThe Second Annual Symposium on Document Analysis and Information Retrieval, 217–231.
Kirsch, R. A. (1964). Computer Interpretation of English Text and Picture Patterns.IEEE Transactions on Electronic Computers 13: 363–376.
Google Scholar
Kobsa, A.et al. (1986). Combining Deictic Gestures and Natural Language for Referent Identification. In Proceedings ofCOLING, 356–361.
Kosslyn, S. M. (1990). Mental Imagery. In Osherson, D. A.et al. (eds.),Visual Cognition and Action, 73–97. MIT Press: Cambridge Mass.
Google Scholar
MacGregor, R. (1991). The Evolving Technology of Classification-Based Knowledge Representation Systems. InPrinciples of Semantic Netwerks: Exploration in the Representation of Knowledge, 385–400. Morgan Kaufmann: Los Angeles, CA.
Google Scholar
Marr, D. (1982).Vision. W. H. Freeman: San Francisco.
Google Scholar
Maybury, T. (ed.). (1993).Intelligent MultiMedia Interfaces. AAAI Press/The MIT Press.
McDonald, D. & Conklin, E. J. (1981). Salience as a Simplifying Metaphor for Natural Language Generation. In Proceedings ofAAAI-81, 49–51.
Minsky, M. (1975). A Framework for Representing Knowledge. In Winston, P. H. (ed.),The Psychology of Computer Vision, 211–277. McGraw-Hill Book Company: New York, NY.
Google Scholar
Miller, G. A. & Johnson-Laird, P. N. (1976).Language and Perception. The Belknap Press of Harvard University Press: Cambridge, MA.
Google Scholar
Montalvo, S. F. (1985). Diagram Understanding: The Intersection of Computer Vision and Graphics. A.I. Memo 873, Massachusetts Institute of Technology.
Maddox, A. B. & Pustejovsky, J. (1987). Linguistic Descriptions of Visual Event Perceptions. In Proceedings ofThe Ninth Annual Cognitive Science Society Conference, 442–454, Seattle.
Moore, J. D. & Swartout, W. R. (1990). Pointing: A Way Toward Explanation Dialogue. In Proceedings ofThe Eighth National Conference on Artificial Intelligence (AAAI-90), 457–464.
Novak, G. S. & Bulko, W. C. (1990). Understanding Natural Language with Diagrams. In Proceedings ofThe Eighth National Conference on Artificial Intelligence (AAAI-90), 465–470, Boston.
Neal, J. G., Dobes, Z., Bettinger, K. E. & Byoun J. S. (1988). Multi-Modal References in Human-Computer Dialogue. In Proceedings ofAAAI-88, 819–823. Morgan Kaufmann.
Nakatani, H. & Itoh, Y. (1994). An Image Retrieval System that Accepts Natural Language. InWorking Notes of the AAAI-94 Workship on Integration of Natural Language and Vision Processing, 7–13.
Neumann, B. & Nova, H. (1983). Event Models for Recognition and Natural Language Description of Events in Real-World Image Sequences. In Proceedings ofIJCAI 1983, 724–726.
Olivier, P., Maeda, T. & Tsujii, J. ichi (1994). Automatic Depiction of Spatial Descriptions. In Proceedings ofAAAI-94, 1405–1410. Seattle, WA.
Pinker, S. (ed.). (1984).Visual Cognition. MIT Press: Cambridge Mass.
Google Scholar
Rajagopalan, R. (1994). A Model for Integrated Qualitative Spatial and Dynamic Reasoning about Physical System. In Proceedings ofAAAI-94, 1411–1417. Seattle, WA.
Rowe, N. & Gugliemo, E. (1993). Exploiting Captions in Retrieval of Multimedia Data.Information Processing and Management 29(4): 453–461.
Google Scholar
Reiter, R. & Mackworth, A. K. (1987). A Logical Framework for Depiction and Image Interpretation. Technical Report 88-17. The University of British Columbia.
Shapiro, S. C. (1982). Generalized Augmented Transition Network Grammars for Generation from Semantic Networks.The American Journal for Computational Linguistics 8(2): 12–25.
Google Scholar
Shapiro, S. C. & Rapaport, W. J. (1990). The SNePS Family. CS Technical Report 90-21, SUNY at Buffalo.
Siskind, J. M. (1990). Acquiring Core Meanings of Words, Represented as Jackendoff-Style Conceptual Structures, from Correlated Streams of Linguistic and Non-Linguistic Input. In Proceedings ofThe 28th Annual Meeting of the Association for Computational Linguistics, 143–156.
Sowa, J. F. (1991).Principles of Semantic Networks: Exploration in the Representation of Knowledge. Morgan Kaufmann: Los Angeles, CA.
Google Scholar
Srihari, R. K. (1991). PICTION: A System that Uses Captions to Label Human Faces in Newspaper Photographs. In Proceedings ofThe 9th National Conference on Artificial Intelligence (AAAI-91), 80–85. Anaheim, CA.
Srihari, R. K. & Baltus, M. (1993). Incorporating Syntactic Constraints in Recognizing Handwritten Sentences. In Proceedings ofThe International Joint Conference on Artificial Intelligence (IJCAI-93), 1262–1267.
Srihari, R. K. & Burhans, D. T. (1994). Visual Semantics: Extracting Visual Information from Text Accompanying Pictures. In Proceedings ofAAAI-94, 793–798. Seattle, WA.
Srihari, R. K. (1994). Use of Collateral Text in Understanding Photos.Artificial Intelligence Review (special issue on integration of NLP and Vision), (this volume).
Talmy, L. (1983). How Language Structures Space. In Pick, H. & Acreolo, L. (eds.)Spatial Orientation: Theory, Research, and Application, 225–282. Plenum: New York.
Google Scholar
Truve, S. & Richards, W. (1987). From Waltz to Winston (via the Connection Table). In Proceedings ofThe First International conference on Computer Vision, 393–404. Computer Society Press.
Waltz, D. L. (1981) Generating and Understanding Scene Descriptions. In Webber, Bonnie & Sag, Ivan (eds.)Elements of Discourse Understanding, 266–282. Cambridge University Press: New York, NY.
Google Scholar
Weymouth, T. E. (1986). Using Object Descriptions in a Schema Network for Machine Vision. PhD thesis, University of Masschussetts at Amherst.
Winograd, T. (1973). A Procedural Model of Language Understanding. InComputer Models of thought and Language, 152–186. W. H. Freeman and Company: San Francisco.
Google Scholar
Yokota, M., Taniguchi, R. & Kawaguchi, E. (1984). Language-Picture Question-Answering Through Common Semantic Representation and its Application to the World of Weather Report. In Bolc, Leonard (ed.)Natural Language Communication with Pictorial Information Systems. Springer-Verlag.
Zernik, U. & Vivier, B. J. (1988), How Near Is Too Far? Talking about Visual Images. In Proceedings ofThe Tenth Annual Conference of the Cognitive Science Society, 202–208. Lawrence Erlbaum Associates.

Download references

Author information

Authors and Affiliations

Center of Excellence for Document Analysis and Recognition (CEDAR), and Department of Computer Science, State University of New York at Buffalo, UB Commons, 520 Lee Entrance — Suite 202, 14228-2567, Buffalo, NY, USA
Rohini K. Srihari

Authors

Rohini K. Srihari
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srihari, R.K. Computational models for integrating linguistic and visual information: A survey. Artif Intell Rev 8, 349–369 (1994). https://doi.org/10.1007/BF00849725

Download citation

Issue Date: September 1994
DOI: https://doi.org/10.1007/BF00849725

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computational models for integrating linguistic and visual information: A survey

Abstract

Access this article

Similar content being viewed by others

Artificial Visual Intelligence

Aligning Visual and Lexical Semantics

A formal framework to represent spatial knowledge

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Key words

Navigation

Computational models for integrating linguistic and visual information: A survey

Abstract

Access this article

Similar content being viewed by others

Artificial Visual Intelligence

Aligning Visual and Lexical Semantics

A formal framework to represent spatial knowledge

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation