Skip to main content
Log in

A multimodal dialogue system for improving user satisfaction via knowledge-enriched response and image recommendation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Task-oriented multimodal dialogue systems have important application value and development prospects. Existing methods have made significant progress, but the following challenges still exist: (1) Most existing methods focus on improving the accuracy of dialogue state tracking and dialogue act prediction. However, the essential to leverage knowledge in the knowledge base to supplement textual responses in multi-turn dialogues is ignored. (2) One feature that distinguishes multimodal dialogue from plain text dialogue is the usage of visual information. However, existing methods ignore the importance of accurately providing visual information to improve user satisfaction. (3) For multimodal dialogue systems, most existing methods ignore the classification of response types to assign appropriate response generators automatically. To address the issues above, we present a user-satisfactory multimodal dialogue system, USMD for short. Specifically, USMD is designed as four modules. The general response generator is based on generative pre-training 2.0 (GPT-2) to generate dialogue acts and general textual responses. The knowledge-enriched response generator is designed to leverage a structured knowledge base under the guidance of a knowledge graph. The image recommender pays attention to both latent and explicit visual cues, a deep multimodal fusion model to obtain informative image representations. Finally, the response classifier automatically selects the appropriate generators to answer the user based on user and agent actions. Extensive experiments on the benchmark multimodal dialogue datasets show that the proposed USMD model achieves state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Ni J, Young T, Pandelea V, Xue F, Adiga V, Cambria E (2021) Recent advances in deep learning based dialogue systems: A systematic survey. CoRR arXiv: abs/2105.04387

  2. Liao L, Long LH, Zhang Z, Huang M, Chua T ( 2021) Mmconv: an environment for multimodal conversational search across multiple domains. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp. 675– 684

  3. Budzianowski P, Wen T, Tseng B, Casanueva I, Ultes S, Ramadan O, Gasic M ( 2018) Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016– 5026

  4. Madotto A, Wu C, Fung P ( 2018) Mem2seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp. 1468– 1478

  5. Zhang Y, Sun S, Galley M, Chen Y, Brockett C, Gao X, Gao J, Liu J, Dolan B (2020) DIALOGPT : large-scale generative pre-training for conversational response generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 270– 278

  6. Hosseini-Asl E, McCann B, Wu C, Yavuz S, Socher R (2020) A simple language model for task-oriented dialogue. Adv Neural Inf Process Syst 33:20179–20191

    Google Scholar 

  7. He Y, Liao L, Zhang Z, Chua T ( 2021) Towards enriching responses with crowd-sourced knowledge for task-oriented dialogue. In: Proceedings of the 2nd ACM multimedia workshop on multimodal conversational AI, pp. 3– 11

  8. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9

    Google Scholar 

  9. Wang K, Tian J, Wang R, Quan X, Yu J ( 2020) Multi-domain dialogue acts and response co-generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 7125– 7134

  10. Zhang J, Hashimoto K, Wu C, Wang Y, Yu PS, Socher R, Xiong C ( 2020) Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. In: Proceedings of the 9th joint conference on lexical and computational semantics, pp. 154– 167

  11. Xu H, Moon S, Liu H, Liu B, Shah P, Yu PS ( 2020) User memory reasoning for conversational recommendation. In: Proceedings of the 28th international conference on computational linguistics, pp. 5288– 5308

  12. Wu Y, Liao L, Zhang G, Lei W, Zhao G, Qian X, Chua T-S (2022) State graph reasoning for multimodal conversational recommendation. IEEE Trans Multimed

  13. Saha A, Khapra MM, Sankaranarayanan, K ( 2018) Towards building large scale multimodal domain-aware conversation systems. In: Proceedings of the AAAI conference on artificial intelligence, pp. 696– 704

  14. Jiang W, Zhu M, Fang Y, Shi G, Zhao X, Liu Y (2022) Visual cluster grounding for image captioning. IEEE Trans Image Process 31:3920–3934

    Article  Google Scholar 

  15. Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JMF, Parikh D, Batra D (2017) Visual dialog. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1080– 1089

  16. Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis GP, Vanderwende L ( 2017) Image-grounded conversations: multimodal context for natural question and response generation. In: Proceedings of the 8th international joint conference on natural language processing, pp. 462– 472

  17. Asri LE, Schulz H, Sharma S, Zumer J, Harris J, Fine E, Mehrotra R, Suleman K ( 2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. In: Proceedings of the 18th annual SIGdial meeting on discourse and dialogue, pp. 207– 219

  18. Moon S, Kottur S, Crook PA, De A, Poddar S, Levin T, Whitney D, Difranco D, Beirami A, Cho E, Subba R, Geramifard A ( 2020) Situated and interactive multimodal conversations. In: Proceedings of the 28th international conference on computational linguistics, pp. 1103– 1121

  19. Liao L, Ma Y, He X, Hong R, Chua T ( 2018) Knowledge-aware multimodal dialogue systems. In: Proceedings of the 26th ACM international conference on multimedia, pp. 801– 809

  20. Weston J, Chopra S, Bordes A ( 2015) Memory networks. In: Proceedings of the international conference on learning representations

  21. Cui C, Wang W, Song X, Huang M, Xu X, Nie L ( 2019) User attention-guided multimodal dialog systems. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp. 445– 454

  22. Nie L, Wang W, Hong R, Wang M, Tian Q ( 2019) Multimodal dialog system: generating responses via adaptive decoders. In: Proceedings of the 27th ACM international conference on multimedia, pp. 1098– 1106

  23. Zhang H, Liu M, Gao Z, Lei X, Wang Y, Nie L (2021) Multimodal dialog system: relational graph-based context-aware question understanding. In: Proceedings of the 29th ACM international conference on multimedia, pp. 695– 703

  24. Chauhan H, Firdaus M, Ekbal A, Bhattacharyya P ( 2019) Ordinal and attribute aware response generation in a multimodal dialogue system. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 5437– 5447

  25. He W, Li Z, Lu D, Chen E, Xu T, Huai B, Yuan J ( 2020) Multimodal dialogue systems via capturing context-aware dependencies of semantic elements. In: Proceedings of the 28th ACM international conference on multimedia, pp. 2755– 2764

  26. Devlin J, Chang M, Lee K, Toutanova K ( 2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, pp. 4171– 4186

  27. Lin W, Tseng B, Byrne B ( 2021) Knowledge-aware graph-enhanced GPT-2 for dialogue state tracking. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp. 7871– 7881

  28. Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of the international conference on learning representations

  29. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N ( 2021) An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the international conference on learning representations

  30. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B ( 2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9992– 10002

  31. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J ( 2020) VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the international conference on learning representations

  32. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, Choi Y, Gao J (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. Eur Conf Comput Vis 12375:121–137

    Google Scholar 

  33. Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: Proceedings of the 38th international conference on machine learning vol 139, pp. 5583–5594

  34. Xu H, Yan M, Li C, Bi B, Huang S, Xiao W, Huang F ( 2021) E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp. 503– 513

  35. Lu J, Batra D, Parikh D, Lee S ( 2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst pp. 13– 23

  36. Tan H, Bansal M ( 2019) LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp. 5099– 5110

  37. Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, Wang H ( 2021) Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In: Proceedings of the AAAI conference on artificial intelligence, pp. 3208– 3216

  38. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10575– 10584

  39. Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

  40. Liang Z, Hu H, Xu C, Tao C, Geng X, Chen Y, Liang F, Jiang D( 2021) Maria: a visual experience powered conversational agent. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp. 5596– 5611

  41. Kung P, Chang C, Yang T, Hsu H, Liou Y, Chen Y (2021) Multi-task learning for situated multi-domain end-to-end dialogue systems. CoRR arXiv: abs/2110.05221

  42. de Vries H, Strub F, Chandar S, Pietquin O, Larochelle H, Courville AC (2017) Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4466– 4475

  43. Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. Eur Conf Comput Vis 8693:740–755

    Google Scholar 

  44. Le H, Sahoo D, Chen NF, Hoi SCH (2019) Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 5612– 5623

  45. Li Z, Li Z, Zhang J, Feng Y, Zhou J (2021) Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Trans Audio Speech Lang Process 29:2476–2483

    Article  Google Scholar 

  46. Liu C, Lowe R, Serban I, Noseworthy M, Charlin L, Pineau J (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 2122– 2132

  47. Serban IV, Sordoni A, Lowe R, Charlin L, Pineau J, Courville AC, Bengio Y ( 2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In: Proceedings of the AAAI conference on artificial intelligence, pp. 3295– 3301

  48. Li X, Wu W, Qin L, Yin Q (2021) How to evaluate your dialogue models: a review of approaches. CoRR arXiv: abs/2108.01369

  49. Tan M, Le QV (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th international conference on machine learning vol 97, pp. 6105–6114

  50. Serban IV, Sordoni A, Bengio Y, Courville AC, Pineau J ( 2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the AAAI conference on artificial intelligence, pp. 3776– 3784

  51. Sutskever I, Vinyals O, Le QV ( 2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 3104– 3112

  52. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 770– 778

  53. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR arXiv: abs/1910.01108

Download references

Acknowledgements

This work is supported by the grants from the scientific research programme of Beijing Municipal Education Commission (KZ202110011017), the Natural Science Foundation of Shandong Province (ZR2020MF136), and the Open Research Fund of Beijing Key Laboratory of Big Data Technology for Food Safety [No. BTBD-2020KF05], Beijing Technology and Business University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunlei Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article. They have no conflicts of interest to declare that are relevant to the content of this article.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Li, H., Wang, L. et al. A multimodal dialogue system for improving user satisfaction via knowledge-enriched response and image recommendation. Neural Comput & Applic 35, 13187–13206 (2023). https://doi.org/10.1007/s00521-023-08409-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08409-z

Keywords

Navigation