Skip to main content

Object Category-Based Visual Dialog for Effective Question Generation

  • Conference paper
  • First Online:
Computational Visual Media (CVM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14593))

Included in the following conference series:

  • 78 Accesses

Abstract

GuessWhat?! is a visual dialog dataset that consists of a series of goal-oriented questions and answers between a questioner and an answerer. The purpose of the task is to enable the questioner to identify the target object in an image based on the dialogue history. A key challenge for the questioner model is to generate informative and strategic questions that can narrow down the search space effectively. However, previous models lack questioning strategies and rely only on the visual features of the objects without considering their category information, which leads to uninformative, redundant or irrelevant questions. To overcome this limitation, we propose an Object-Category based Visual Dialogue (OCVD) model that leverages the category information of objects to generate more diverse and instructive questions. Our model incorporates a category selection module that dynamically updates the category information according to the answers and adopts a linear category-based search strategy. We evaluate our model on the GuessWhat?! dataset and demonstrate its superiority over previous methods in terms of generation quality and dialogue effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Abbasnejad, E., Wu, Q., Abbasnejad, I., Shi, J.Q., van den Hengel, A.: An active information seeking model for goal-oriented vision-and-language tasks. ArXiv abs/1812.06398 (2018)

    Google Scholar 

  2. Abbasnejad, E., Wu, Q., Shi, J.Q., van den Hengel, A.: What’s to know? Uncertainty as a guide to asking goal-oriented questions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4150–4159 (2018)

    Google Scholar 

  3. Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: do we really need it? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8182–8197 (2020)

    Google Scholar 

  4. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3674–3683 (2018)

    Google Scholar 

  5. Bani, G., et al.: Adding object detection skills to visual dialogue agents. In: ECCV Workshops (2018)

    Google Scholar 

  6. Chattopadhyay, P., et al.: Evaluating visual conversational agents via cooperative human-AI games. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, vol. 5, pp. 2–10 (2017)

    Google Scholar 

  7. Chen, C., Tan, Z., Cheng, Q., Jiang, X., Liu, Q., Zhu, Y., Gu, X.: UTC: a unified transformer with inter-task contrastive learning for visual dialog. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 18103–18112 (2022)

    Google Scholar 

  8. Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16537–16547 (2022)

    Google Scholar 

  9. Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)

    Google Scholar 

  10. Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2970–2979 (2017)

    Google Scholar 

  11. De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)

    Google Scholar 

  12. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000)

    Article  Google Scholar 

  13. Guo, D., Wang, H., Zhang, H., Zha, Z.J., Wang, M.: Iterative context-aware graph inference for visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10055–10064 (2020)

    Google Scholar 

  14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)

    Google Scholar 

  15. Lee, S.W., Heo, Y.J., Zhang, B.T.: Answerer in questioner’s mind: information theoretic approach to goal-oriented visual dialog. In: Neural Information Processing Systems (2018)

    Google Scholar 

  16. Oshima, R., Shinagawa, S., Tsunashima, H., Feng, Q., Morishima, S.: Pointing out human answer mistakes in a goal-oriented visual dialogue. arXiv preprint arXiv:2309.10375 (2023)

  17. Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11831–11838 (2020)

    Google Scholar 

  18. Pashevich, A., Schmid, C., Sun, C.: Episodic transformer for vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15942–15952 (2021)

    Google Scholar 

  19. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  20. Sang-Woo, L., Tong, G., Sohee, Y., Jaejun, Y., Jung-Woo, H.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: Proceedings of International Conference on Learning Representations. ICLR (2019)

    Google Scholar 

  21. Serban, I., Sordoni, A., Bengio, Y., Courville, A.C., Pineau, J.: Hierarchical neural network generative models for movie dialogues. ArXiv abs/1507.04808 (2015)

    Google Scholar 

  22. Shekhar, R., Baumgärtner, T., Venkatesh, A., Bruni, E., Bernardi, R., Fernández, R.: Ask no more: deciding when to guess in referential visual dialogue. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1218–1233 (2019)

    Google Scholar 

  23. Shekhar, R., et al.: Beyond task success: a closer look at jointly learning to see, ask, and guessWhat. In: North American Chapter of the Association for Computational Linguistics (2018)

    Google Scholar 

  24. Shekhar, R., et al.: Beyond task success: a closer look at jointly learning to see, ask, and guessWhat. In: Proceedings of NAACL-HLT, pp. 2578–2587 (2019)

    Google Scholar 

  25. Shukla, P., Elmadjian, C., Sharan, R., Kulkarni, V., Turk, M., Wang, W.Y.: What should i ask? Using conversationally informative rewards for goal-oriented visual dialog. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6442–6451 (2020)

    Google Scholar 

  26. Sicilia, A., Alikhani, M.: Learning to generate equitable text in dialogue from biased training data. arXiv preprint arXiv:2307.04303 (2023)

  27. Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. In: International Joint Conference on Artificial Intelligence (2017)

    Google Scholar 

  28. Testoni, A., Bernardi, R.: Looking for confirmations: an effective and human-like visual dialogue strategy. In: Conference on Empirical Methods in Natural Language Processing (2021)

    Google Scholar 

  29. Testoni, A., Bernardi, R.: Garbage in, flowers out: noisy training data help generative models at test time. IJCoL. Italian J. Comput. Linguist. 8, 8–1 (2022)

    Google Scholar 

  30. Tu, T., Ping, Q., Thattai, G., Tur, G., Natarajan, P.: Learning better visual dialog agents with pretrained visual-linguistic representation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5618–5627 (2021)

    Google Scholar 

  31. Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2585–2594 (2022)

    Google Scholar 

  32. Wang, Y., Joty, S., Lyu, M., King, I., Xiong, C., Hoi, S.C.: VD-BERT: A unified vision and dialog transformer with BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3325–3338 (2020)

    Google Scholar 

  33. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)

    Google Scholar 

  34. Xu, Z., Feng, F., Wang, X., Yang, Y., Jiang, H., Ouyang, Z.: Answer-driven visual state estimator for goal-oriented visual dialogue. In: Proceedings of the 28th ACM International Conference on Multimedia (2020)

    Google Scholar 

  35. Yanan, S., Yanxin, T., Fangxiang, F., Chunping, Z., Xiaojie, W.: Category-based strategy-driven question generator for visual dialogue. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp. 1000–1011 (2022)

    Google Scholar 

  36. Yuan, Z., et al.: X-trans2cap: cross-modal knowledge transfer using transformer for 3D dense captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8563–8573 (2022)

    Google Scholar 

  37. Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., van den Hengel, A.: Asking the difficult questions: goal-oriented visual question generation via intermediate rewards. In: European Conference on Computer Vision (2017)

    Google Scholar 

  38. Zhao, R., Tresp, V.: Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient. In: LaCATODA@ IJCAI, pp. 1–7 (2018)

    Google Scholar 

  39. Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 1839–1851 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingchen Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, F., Zhou, Y., Zhong, Z., Li, G. (2024). Object Category-Based Visual Dialog for Effective Question Generation. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14593. Springer, Singapore. https://doi.org/10.1007/978-981-97-2092-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2092-7_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2091-0

  • Online ISBN: 978-981-97-2092-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics