skip to main content
10.1145/3308558.3313598acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems

Published:13 May 2019Publication History

ABSTRACT

Multimodal dialogue systems are attracting increasing attention with a more natural and informative way for human-computer interaction. As one of its core components, the belief tracker estimates the user's goal at each step of the dialogue and provides a direct way to validate the ability of dialogue understanding. However, existing studies on belief trackers are largely limited to textual modality, which cannot be easily extended to capture the rich semantics in multimodal systems such as those with product images. For example, in fashion domain, the visual appearance of clothes play a crucial role in understanding the user's intention. In this case, the existing belief trackers may fail to generate accurate belief states for a multimodal dialogue system.

In this paper, we present the first neural multimodal belief tracker (NMBT) to demonstrate how multimodal evidence can facilitate semantic understanding and dialogue state tracking. Given the multimodal inputs, while applying a textual encoder to represent textual utterances, the model gives special consideration to the semantics revealed in visual modality. It learns concept level fashion semantics by delving deep into image sub-regions and integrating concept probabilities via multiple instance learning. Then in each turn, an adaptive attention mechanism learns to automatically emphasize on different evidence sources of both visual and textual modalities for more accurate dialogue state prediction. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain and the results show that our method achieves superior performance as compared to a wide range of baselines.

References

  1. Xavier Alameda-Pineda, Miriam Redi, Mohammad Soleymani, Nicu Sebe, Shih-Fu Chang, and Samuel Gosling. 2017. MUSA2: First ACM Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1974-1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425-2433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dan Bohus and Alex Rudnicky. 2006. A k-hypotheses+ other belief updating model. In Proc. of the AAAI Workshop on Statistical and Empirical Methods in Spoken Dialogue Systems, Vol. 62.Google ScholarGoogle Scholar
  4. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose´ MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2.Google ScholarGoogle ScholarCross RefCross Ref
  5. Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pe´rez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89, 1-2 (1997), 31-71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, 2015. From captions to visual concepts and back. In CVPR. 1473-1482.Google ScholarGoogle Scholar
  7. Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121-2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International journal of computer vision 106, 2 (2014), 210-233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529-545.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR. 3.Google ScholarGoogle Scholar
  11. David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16, 12 (2004), 2639-2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Matthew Henderson. 2015. Machine learning for dialog state tracking: A review. In Proc. of The First International Workshop on Machine Learning in Spoken Language Processing.Google ScholarGoogle Scholar
  13. Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 263-272.Google ScholarGoogle ScholarCross RefCross Ref
  14. Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The third dialog state tracking challenge. In Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 324-329.Google ScholarGoogle ScholarCross RefCross Ref
  15. Matthew Henderson, Blaise Thomson, and Steve Young. 2013. Deep neural network approach for the dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference. 467-471.Google ScholarGoogle Scholar
  16. Matthew Henderson, Blaise Thomson, and Steve Young. 2014. Word-based dialog state tracking with recurrent neural networks. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 292-299.Google ScholarGoogle ScholarCross RefCross Ref
  17. Takaaki Hori, Hai Wang, Chiori Hori, Shinji Watanabe, Bret Harsham, Jonathan Le Roux, John R Hershey, Yusuke Koji, Yi Jing, Zhaocheng Zhu, 2016. Dialog state tracking with attention-based sequence-to-sequence learning. In Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 552-558.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233-1239.Google ScholarGoogle Scholar
  19. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128-3137.Google ScholarGoogle ScholarCross RefCross Ref
  20. Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems. 1889-1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint arXiv:1411.2539(2014).Google ScholarGoogle Scholar
  22. Pei Ling Lai and Colin Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 05 (2000), 365-377.Google ScholarGoogle ScholarCross RefCross Ref
  23. Staffan Larsson and David R Traum. 2000. Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural language engineering 6, 3-4 (2000), 323-340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Esther Levin, Roberto Pieraccini, and Wieland Eckert. 1998. Using Markov decision process for learning dialogue strategies. In ICAssP, Vol. 98. 201-204.Google ScholarGoogle Scholar
  25. Esther Levin, Roberto Pieraccini, and Wieland Eckert. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing 8, 1 (2000), 11-23.Google ScholarGoogle ScholarCross RefCross Ref
  26. Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware Multimodal Dialogue Systems. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 801-809. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090(2014).Google ScholarGoogle Scholar
  28. Hideya Mino, Masao Utiyama, Eiichiro Sumita, and Takenobu Tokunaga. 2017. Key-value Attention Mechanism for Neural Machine Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2. 290-295.Google ScholarGoogle Scholar
  29. Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P Spithourakis, and Lucy Vanderwende. 2017. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251(2017).Google ScholarGoogle Scholar
  30. Nikola Mrkšic, Diarmuid O Se´aghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2016. Neural belief tracker: Data-driven dialogue state tracking. arXiv preprint arXiv:1606.03777(2016).Google ScholarGoogle Scholar
  31. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532-1543.Google ScholarGoogle ScholarCross RefCross Ref
  32. Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in neural information processing systems. 2953-2961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2017. Multimodal Dialogs (MMD): A large-scale dataset for studying multimodal domain-aware conversations. arXiv preprint arXiv:1704.00200(2017).Google ScholarGoogle Scholar
  34. Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2017. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. arXiv preprint arXiv:1704.00200(2017).Google ScholarGoogle Scholar
  35. Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.. In AAAI, Vol. 16. 3776-3784. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Hongjie Shi, Takashi Ushio, Mitsuru Endo, Katsuyoshi Yamagami, and Noriaki Horii. 2017. Convolutional neural networks for multi-topic dialog state tracking. In Dialogues with Social Robots. Springer, 451-463.Google ScholarGoogle Scholar
  37. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156-3164.Google ScholarGoogle ScholarCross RefCross Ref
  38. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5005-5013.Google ScholarGoogle ScholarCross RefCross Ref
  39. Jason Williams, Antoine Raux, and Matthew Henderson. 2016. The dialog state tracking challenge series: A review. Dialogue & Discourse 7, 3 (2016), 4-33.Google ScholarGoogle ScholarCross RefCross Ref
  40. Jason D Williams. 2012. A critical analysis of two statistical spoken dialog systems in public use. In Spoken Language Technology Workshop (SLT), 2012 IEEE. 55-60.Google ScholarGoogle ScholarCross RefCross Ref
  41. Jason D Williams. 2014. Web-style ranking and SLU combination for dialog state tracking. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 282-291.Google ScholarGoogle ScholarCross RefCross Ref
  42. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048-2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Steve Young. 2006. Using POMDPs for dialog management. In Spoken Language Technology Workshop, 2006. IEEE. IEEE, 8-13.Google ScholarGoogle ScholarCross RefCross Ref
  44. Steve J Young. 2000. Probabilistic methods in spoken-dialogue systems. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 358, 1769 (2000), 1389-1402.Google ScholarGoogle Scholar
  45. Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR. 5014-5022.Google ScholarGoogle Scholar
  46. Victor Zue, Stephanie Seneff, James R Glass, Joseph Polifroni, Christine Pao, Timothy J Hazen, and Lee Hetherington. 2000. JUPlTER: a telephone-based conversational interface for weather information. IEEE Transactions on speech and audio processing 8, 1 (2000), 85-96.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    WWW '19: The World Wide Web Conference
    May 2019
    3620 pages
    ISBN:9781450366748
    DOI:10.1145/3308558

    Copyright © 2019 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 13 May 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate1,899of8,196submissions,23%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format