Abstract
Task-oriented multimodal dialogue systems have important application value and development prospects. Existing methods have made significant progress, but the following challenges still exist: (1) Most existing methods focus on improving the accuracy of dialogue state tracking and dialogue act prediction. However, the essential to leverage knowledge in the knowledge base to supplement textual responses in multi-turn dialogues is ignored. (2) One feature that distinguishes multimodal dialogue from plain text dialogue is the usage of visual information. However, existing methods ignore the importance of accurately providing visual information to improve user satisfaction. (3) For multimodal dialogue systems, most existing methods ignore the classification of response types to assign appropriate response generators automatically. To address the issues above, we present a user-satisfactory multimodal dialogue system, USMD for short. Specifically, USMD is designed as four modules. The general response generator is based on generative pre-training 2.0 (GPT-2) to generate dialogue acts and general textual responses. The knowledge-enriched response generator is designed to leverage a structured knowledge base under the guidance of a knowledge graph. The image recommender pays attention to both latent and explicit visual cues, a deep multimodal fusion model to obtain informative image representations. Finally, the response classifier automatically selects the appropriate generators to answer the user based on user and agent actions. Extensive experiments on the benchmark multimodal dialogue datasets show that the proposed USMD model achieves state-of-the-art performance.
Similar content being viewed by others
References
Ni J, Young T, Pandelea V, Xue F, Adiga V, Cambria E (2021) Recent advances in deep learning based dialogue systems: A systematic survey. CoRR arXiv: abs/2105.04387
Liao L, Long LH, Zhang Z, Huang M, Chua T ( 2021) Mmconv: an environment for multimodal conversational search across multiple domains. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp. 675– 684
Budzianowski P, Wen T, Tseng B, Casanueva I, Ultes S, Ramadan O, Gasic M ( 2018) Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016– 5026
Madotto A, Wu C, Fung P ( 2018) Mem2seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp. 1468– 1478
Zhang Y, Sun S, Galley M, Chen Y, Brockett C, Gao X, Gao J, Liu J, Dolan B (2020) DIALOGPT : large-scale generative pre-training for conversational response generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 270– 278
Hosseini-Asl E, McCann B, Wu C, Yavuz S, Socher R (2020) A simple language model for task-oriented dialogue. Adv Neural Inf Process Syst 33:20179–20191
He Y, Liao L, Zhang Z, Chua T ( 2021) Towards enriching responses with crowd-sourced knowledge for task-oriented dialogue. In: Proceedings of the 2nd ACM multimedia workshop on multimodal conversational AI, pp. 3– 11
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Wang K, Tian J, Wang R, Quan X, Yu J ( 2020) Multi-domain dialogue acts and response co-generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 7125– 7134
Zhang J, Hashimoto K, Wu C, Wang Y, Yu PS, Socher R, Xiong C ( 2020) Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. In: Proceedings of the 9th joint conference on lexical and computational semantics, pp. 154– 167
Xu H, Moon S, Liu H, Liu B, Shah P, Yu PS ( 2020) User memory reasoning for conversational recommendation. In: Proceedings of the 28th international conference on computational linguistics, pp. 5288– 5308
Wu Y, Liao L, Zhang G, Lei W, Zhao G, Qian X, Chua T-S (2022) State graph reasoning for multimodal conversational recommendation. IEEE Trans Multimed
Saha A, Khapra MM, Sankaranarayanan, K ( 2018) Towards building large scale multimodal domain-aware conversation systems. In: Proceedings of the AAAI conference on artificial intelligence, pp. 696– 704
Jiang W, Zhu M, Fang Y, Shi G, Zhao X, Liu Y (2022) Visual cluster grounding for image captioning. IEEE Trans Image Process 31:3920–3934
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JMF, Parikh D, Batra D (2017) Visual dialog. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1080– 1089
Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis GP, Vanderwende L ( 2017) Image-grounded conversations: multimodal context for natural question and response generation. In: Proceedings of the 8th international joint conference on natural language processing, pp. 462– 472
Asri LE, Schulz H, Sharma S, Zumer J, Harris J, Fine E, Mehrotra R, Suleman K ( 2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. In: Proceedings of the 18th annual SIGdial meeting on discourse and dialogue, pp. 207– 219
Moon S, Kottur S, Crook PA, De A, Poddar S, Levin T, Whitney D, Difranco D, Beirami A, Cho E, Subba R, Geramifard A ( 2020) Situated and interactive multimodal conversations. In: Proceedings of the 28th international conference on computational linguistics, pp. 1103– 1121
Liao L, Ma Y, He X, Hong R, Chua T ( 2018) Knowledge-aware multimodal dialogue systems. In: Proceedings of the 26th ACM international conference on multimedia, pp. 801– 809
Weston J, Chopra S, Bordes A ( 2015) Memory networks. In: Proceedings of the international conference on learning representations
Cui C, Wang W, Song X, Huang M, Xu X, Nie L ( 2019) User attention-guided multimodal dialog systems. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp. 445– 454
Nie L, Wang W, Hong R, Wang M, Tian Q ( 2019) Multimodal dialog system: generating responses via adaptive decoders. In: Proceedings of the 27th ACM international conference on multimedia, pp. 1098– 1106
Zhang H, Liu M, Gao Z, Lei X, Wang Y, Nie L (2021) Multimodal dialog system: relational graph-based context-aware question understanding. In: Proceedings of the 29th ACM international conference on multimedia, pp. 695– 703
Chauhan H, Firdaus M, Ekbal A, Bhattacharyya P ( 2019) Ordinal and attribute aware response generation in a multimodal dialogue system. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 5437– 5447
He W, Li Z, Lu D, Chen E, Xu T, Huai B, Yuan J ( 2020) Multimodal dialogue systems via capturing context-aware dependencies of semantic elements. In: Proceedings of the 28th ACM international conference on multimedia, pp. 2755– 2764
Devlin J, Chang M, Lee K, Toutanova K ( 2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, pp. 4171– 4186
Lin W, Tseng B, Byrne B ( 2021) Knowledge-aware graph-enhanced GPT-2 for dialogue state tracking. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp. 7871– 7881
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of the international conference on learning representations
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N ( 2021) An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the international conference on learning representations
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B ( 2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9992– 10002
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J ( 2020) VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the international conference on learning representations
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, Choi Y, Gao J (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. Eur Conf Comput Vis 12375:121–137
Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: Proceedings of the 38th international conference on machine learning vol 139, pp. 5583–5594
Xu H, Yan M, Li C, Bi B, Huang S, Xiao W, Huang F ( 2021) E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp. 503– 513
Lu J, Batra D, Parikh D, Lee S ( 2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst pp. 13– 23
Tan H, Bansal M ( 2019) LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp. 5099– 5110
Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H ( 2021) Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In: Proceedings of the AAAI conference on artificial intelligence, pp. 3208– 3216
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10575– 10584
Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Liang Z, Hu H, Xu C, Tao C, Geng X, Chen Y, Liang F, Jiang D( 2021) Maria: a visual experience powered conversational agent. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp. 5596– 5611
Kung P, Chang C, Yang T, Hsu H, Liou Y, Chen Y (2021) Multi-task learning for situated multi-domain end-to-end dialogue systems. CoRR arXiv: abs/2110.05221
de Vries H, Strub F, Chandar S, Pietquin O, Larochelle H, Courville AC (2017) Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4466– 4475
Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. Eur Conf Comput Vis 8693:740–755
Le H, Sahoo D, Chen NF, Hoi SCH (2019) Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 5612– 5623
Li Z, Li Z, Zhang J, Feng Y, Zhou J (2021) Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Trans Audio Speech Lang Process 29:2476–2483
Liu C, Lowe R, Serban I, Noseworthy M, Charlin L, Pineau J (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 2122– 2132
Serban IV, Sordoni A, Lowe R, Charlin L, Pineau J, Courville AC, Bengio Y ( 2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In: Proceedings of the AAAI conference on artificial intelligence, pp. 3295– 3301
Li X, Wu W, Qin L, Yin Q (2021) How to evaluate your dialogue models: a review of approaches. CoRR arXiv: abs/2108.01369
Tan M, Le QV (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th international conference on machine learning vol 97, pp. 6105–6114
Serban IV, Sordoni A, Bengio Y, Courville AC, Pineau J ( 2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the AAAI conference on artificial intelligence, pp. 3776– 3784
Sutskever I, Vinyals O, Le QV ( 2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 3104– 3112
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 770– 778
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR arXiv: abs/1910.01108
Acknowledgements
This work is supported by the grants from the scientific research programme of Beijing Municipal Education Commission (KZ202110011017), the Natural Science Foundation of Shandong Province (ZR2020MF136), and the Open Research Fund of Beijing Key Laboratory of Big Data Technology for Food Safety [No. BTBD-2020KF05], Beijing Technology and Business University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article. They have no conflicts of interest to declare that are relevant to the content of this article.
Data Availability Statement
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Li, H., Wang, L. et al. A multimodal dialogue system for improving user satisfaction via knowledge-enriched response and image recommendation. Neural Comput & Applic 35, 13187–13206 (2023). https://doi.org/10.1007/s00521-023-08409-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08409-z