research-article

Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems

Authors:
Zheng Zhang

Tsinghua University, China

Tsinghua University, China
View Profile

,
Lizi Liao

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Minlie Huang

Tsinghua University, China

Tsinghua University, China
View Profile

,
Xiaoyan Zhu

Tsinghua University, China

Tsinghua University, China
View Profile

,
Tat-Seng Chua

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

Authors Info & Claims

WWW '19: The World Wide Web ConferenceMay 2019Pages 2401–2412https://doi.org/10.1145/3308558.3313598

Published:13 May 2019Publication History

WWW '19: The World Wide Web Conference

Pages 2401–2412

ABSTRACT

Multimodal dialogue systems are attracting increasing attention with a more natural and informative way for human-computer interaction. As one of its core components, the belief tracker estimates the user's goal at each step of the dialogue and provides a direct way to validate the ability of dialogue understanding. However, existing studies on belief trackers are largely limited to textual modality, which cannot be easily extended to capture the rich semantics in multimodal systems such as those with product images. For example, in fashion domain, the visual appearance of clothes play a crucial role in understanding the user's intention. In this case, the existing belief trackers may fail to generate accurate belief states for a multimodal dialogue system.

In this paper, we present the first neural multimodal belief tracker (NMBT) to demonstrate how multimodal evidence can facilitate semantic understanding and dialogue state tracking. Given the multimodal inputs, while applying a textual encoder to represent textual utterances, the model gives special consideration to the semantics revealed in visual modality. It learns concept level fashion semantics by delving deep into image sub-regions and integrating concept probabilities via multiple instance learning. Then in each turn, an adaptive attention mechanism learns to automatically emphasize on different evidence sources of both visual and textual modalities for more accurate dialogue state prediction. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain and the results show that our method achieves superior performance as compared to a wide range of baselines.

References

Xavier Alameda-Pineda, Miriam Redi, Mohammad Soleymani, Nicu Sebe, Shih-Fu Chang, and Samuel Gosling. 2017. MUSA2: First ACM Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1974-1975. Google ScholarDigital Library
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425-2433. Google ScholarDigital Library
Dan Bohus and Alex Rudnicky. 2006. A k-hypotheses+ other belief updating model. In Proc. of the AAAI Workshop on Statistical and Empirical Methods in Spoken Dialogue Systems, Vol. 62.Google Scholar
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose´ MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2.Google ScholarCross Ref
Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pe´rez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89, 1-2 (1997), 31-71. Google ScholarDigital Library
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, 2015. From captions to visual concepts and back. In CVPR. 1473-1482.Google Scholar
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121-2129. Google ScholarDigital Library
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International journal of computer vision 106, 2 (2014), 210-233. Google ScholarDigital Library
Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529-545.Google ScholarCross Ref
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR. 3.Google Scholar
David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16, 12 (2004), 2639-2664. Google ScholarDigital Library
Matthew Henderson. 2015. Machine learning for dialog state tracking: A review. In Proc. of The First International Workshop on Machine Learning in Spoken Language Processing.Google Scholar
Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 263-272.Google ScholarCross Ref
Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The third dialog state tracking challenge. In Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 324-329.Google ScholarCross Ref
Matthew Henderson, Blaise Thomson, and Steve Young. 2013. Deep neural network approach for the dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference. 467-471.Google Scholar
Matthew Henderson, Blaise Thomson, and Steve Young. 2014. Word-based dialog state tracking with recurrent neural networks. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 292-299.Google ScholarCross Ref
Takaaki Hori, Hai Wang, Chiori Hori, Shinji Watanabe, Bret Harsham, Jonathan Le Roux, John R Hershey, Yusuke Koji, Yi Jing, Zhaocheng Zhu, 2016. Dialog state tracking with attention-based sequence-to-sequence learning. In Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 552-558.Google ScholarCross Ref
Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233-1239.Google Scholar
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128-3137.Google ScholarCross Ref
Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems. 1889-1897. Google ScholarDigital Library
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint arXiv:1411.2539(2014).Google Scholar
Pei Ling Lai and Colin Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 05 (2000), 365-377.Google ScholarCross Ref
Staffan Larsson and David R Traum. 2000. Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural language engineering 6, 3-4 (2000), 323-340. Google ScholarDigital Library
Esther Levin, Roberto Pieraccini, and Wieland Eckert. 1998. Using Markov decision process for learning dialogue strategies. In ICAssP, Vol. 98. 201-204.Google Scholar
Esther Levin, Roberto Pieraccini, and Wieland Eckert. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing 8, 1 (2000), 11-23.Google ScholarCross Ref
Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware Multimodal Dialogue Systems. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 801-809. Google ScholarDigital Library
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090(2014).Google Scholar
Hideya Mino, Masao Utiyama, Eiichiro Sumita, and Takenobu Tokunaga. 2017. Key-value Attention Mechanism for Neural Machine Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2. 290-295.Google Scholar
Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P Spithourakis, and Lucy Vanderwende. 2017. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251(2017).Google Scholar
Nikola Mrkšic, Diarmuid O Se´aghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2016. Neural belief tracker: Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777(2016).Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532-1543.Google ScholarCross Ref
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in neural information processing systems. 2953-2961. Google ScholarDigital Library
Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2017. Multimodal Dialogs (MMD): A large-scale dataset for studying multimodal domain-aware conversations. arXiv preprint arXiv:1704.00200(2017).Google Scholar
Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2017. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. arXiv preprint arXiv:1704.00200(2017).Google Scholar
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.. In AAAI, Vol. 16. 3776-3784. Google ScholarDigital Library
Hongjie Shi, Takashi Ushio, Mitsuru Endo, Katsuyoshi Yamagami, and Noriaki Horii. 2017. Convolutional neural networks for multi-topic dialog state tracking. In Dialogues with Social Robots. Springer, 451-463.Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156-3164.Google ScholarCross Ref
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5005-5013.Google ScholarCross Ref
Jason Williams, Antoine Raux, and Matthew Henderson. 2016. The dialog state tracking challenge series: A review. Dialogue & Discourse 7, 3 (2016), 4-33.Google ScholarCross Ref
Jason D Williams. 2012. A critical analysis of two statistical spoken dialog systems in public use. In Spoken Language Technology Workshop (SLT), 2012 IEEE. 55-60.Google ScholarCross Ref
Jason D Williams. 2014. Web-style ranking and SLU combination for dialog state tracking. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 282-291.Google ScholarCross Ref
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048-2057. Google ScholarDigital Library
Steve Young. 2006. Using POMDPs for dialog management. In Spoken Language Technology Workshop, 2006. IEEE. IEEE, 8-13.Google ScholarCross Ref
Steve J Young. 2000. Probabilistic methods in spoken-dialogue systems. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 358, 1769 (2000), 1389-1402.Google Scholar
Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR. 5014-5022.Google Scholar
Victor Zue, Stephanie Seneff, James R Glass, Joseph Polifroni, Christine Pao, Timothy J Hazen, and Lee Hetherington. 2000. JUPlTER: a telephone-based conversational interface for weather information. IEEE Transactions on speech and audio processing 8, 1 (2000), 85-96.Google ScholarCross Ref

Recommendations

User Attention-guided Multimodal Dialog Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

As an intelligent way to interact with computers, the dialog system has been catching more and more attention. However, most research efforts only focus on text-based dialog systems, completely ignoring the rich semantics conveyed by the visual cues. ...
Read More
Knowledge-aware Multimodal Dialogue Systems
MM '18: Proceedings of the 26th ACM international conference on Multimedia

By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be ...
Read More
From vocal to multimodal dialogue management
ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Multimodal, speech-enabled systems pose different research problems when compared to unimodal, voice-only dialogue systems. One of the important issues is the question of how a multimodal interface should look like in order to make the multimodal ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Editors:
Ling Liu
Georgia Tech, USA
,
Ryen White
Microsoft Research, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 539
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems

WWW '19: The World Wide Web Conference

ABSTRACT

References

Cited By

Recommendations

User Attention-guided Multimodal Dialog Systems

Knowledge-aware Multimodal Dialogue Systems

From vocal to multimodal dialogue management

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems

WWW '19: The World Wide Web Conference

ABSTRACT

References

Cited By

Recommendations

User Attention-guided Multimodal Dialog Systems

Knowledge-aware Multimodal Dialogue Systems

From vocal to multimodal dialogue management

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media