Object Category-Based Visual Dialog for Effective Question Generation

Xu, Feifei; Zhou, Yingchen; Zhong, Zheng; Li, Guangzhen

doi:10.1007/978-981-97-2092-7_16

Feifei Xu⁹,
Yingchen Zhou⁹,
Zheng Zhong⁹ &
…
Guangzhen Li⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14593))

Included in the following conference series:

International Conference on Computational Visual Media

78 Accesses

Abstract

GuessWhat?! is a visual dialog dataset that consists of a series of goal-oriented questions and answers between a questioner and an answerer. The purpose of the task is to enable the questioner to identify the target object in an image based on the dialogue history. A key challenge for the questioner model is to generate informative and strategic questions that can narrow down the search space effectively. However, previous models lack questioning strategies and rely only on the visual features of the objects without considering their category information, which leads to uninformative, redundant or irrelevant questions. To overcome this limitation, we propose an Object-Category based Visual Dialogue (OCVD) model that leverages the category information of objects to generate more diverse and instructive questions. Our model incorporates a category selection module that dynamically updates the category information according to the answers and adopts a linear category-based search strategy. We evaluate our model on the GuessWhat?! dataset and demonstrate its superiority over previous methods in terms of generation quality and dialogue effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Abbasnejad, E., Wu, Q., Abbasnejad, I., Shi, J.Q., van den Hengel, A.: An active information seeking model for goal-oriented vision-and-language tasks. ArXiv abs/1812.06398 (2018)
Google Scholar
Abbasnejad, E., Wu, Q., Shi, J.Q., van den Hengel, A.: What’s to know? Uncertainty as a guide to asking goal-oriented questions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4150–4159 (2018)
Google Scholar
Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: do we really need it? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8182–8197 (2020)
Google Scholar
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3674–3683 (2018)
Google Scholar
Bani, G., et al.: Adding object detection skills to visual dialogue agents. In: ECCV Workshops (2018)
Google Scholar
Chattopadhyay, P., et al.: Evaluating visual conversational agents via cooperative human-AI games. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, vol. 5, pp. 2–10 (2017)
Google Scholar
Chen, C., Tan, Z., Cheng, Q., Jiang, X., Liu, Q., Zhu, Y., Gu, X.: UTC: a unified transformer with inter-task contrastive learning for visual dialog. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 18103–18112 (2022)
Google Scholar
Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16537–16547 (2022)
Google Scholar
Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)
Google Scholar
Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2970–2979 (2017)
Google Scholar
De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)
Google Scholar
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000)
Article Google Scholar
Guo, D., Wang, H., Zhang, H., Zha, Z.J., Wang, M.: Iterative context-aware graph inference for visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10055–10064 (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Google Scholar
Lee, S.W., Heo, Y.J., Zhang, B.T.: Answerer in questioner’s mind: information theoretic approach to goal-oriented visual dialog. In: Neural Information Processing Systems (2018)
Google Scholar
Oshima, R., Shinagawa, S., Tsunashima, H., Feng, Q., Morishima, S.: Pointing out human answer mistakes in a goal-oriented visual dialogue. arXiv preprint arXiv:2309.10375 (2023)
Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11831–11838 (2020)
Google Scholar
Pashevich, A., Schmid, C., Sun, C.: Episodic transformer for vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15942–15952 (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Sang-Woo, L., Tong, G., Sohee, Y., Jaejun, Y., Jung-Woo, H.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: Proceedings of International Conference on Learning Representations. ICLR (2019)
Google Scholar
Serban, I., Sordoni, A., Bengio, Y., Courville, A.C., Pineau, J.: Hierarchical neural network generative models for movie dialogues. ArXiv abs/1507.04808 (2015)
Google Scholar
Shekhar, R., Baumgärtner, T., Venkatesh, A., Bruni, E., Bernardi, R., Fernández, R.: Ask no more: deciding when to guess in referential visual dialogue. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1218–1233 (2019)
Google Scholar
Shekhar, R., et al.: Beyond task success: a closer look at jointly learning to see, ask, and guessWhat. In: North American Chapter of the Association for Computational Linguistics (2018)
Google Scholar
Shekhar, R., et al.: Beyond task success: a closer look at jointly learning to see, ask, and guessWhat. In: Proceedings of NAACL-HLT, pp. 2578–2587 (2019)
Google Scholar
Shukla, P., Elmadjian, C., Sharan, R., Kulkarni, V., Turk, M., Wang, W.Y.: What should i ask? Using conversationally informative rewards for goal-oriented visual dialog. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6442–6451 (2020)
Google Scholar
Sicilia, A., Alikhani, M.: Learning to generate equitable text in dialogue from biased training data. arXiv preprint arXiv:2307.04303 (2023)
Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. In: International Joint Conference on Artificial Intelligence (2017)
Google Scholar
Testoni, A., Bernardi, R.: Looking for confirmations: an effective and human-like visual dialogue strategy. In: Conference on Empirical Methods in Natural Language Processing (2021)
Google Scholar
Testoni, A., Bernardi, R.: Garbage in, flowers out: noisy training data help generative models at test time. IJCoL. Italian J. Comput. Linguist. 8, 8–1 (2022)
Google Scholar
Tu, T., Ping, Q., Thattai, G., Tur, G., Natarajan, P.: Learning better visual dialog agents with pretrained visual-linguistic representation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5618–5627 (2021)
Google Scholar
Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2585–2594 (2022)
Google Scholar
Wang, Y., Joty, S., Lyu, M., King, I., Xiong, C., Hoi, S.C.: VD-BERT: A unified vision and dialog transformer with BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3325–3338 (2020)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Google Scholar
Xu, Z., Feng, F., Wang, X., Yang, Y., Jiang, H., Ouyang, Z.: Answer-driven visual state estimator for goal-oriented visual dialogue. In: Proceedings of the 28th ACM International Conference on Multimedia (2020)
Google Scholar
Yanan, S., Yanxin, T., Fangxiang, F., Chunping, Z., Xiaojie, W.: Category-based strategy-driven question generator for visual dialogue. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp. 1000–1011 (2022)
Google Scholar
Yuan, Z., et al.: X-trans2cap: cross-modal knowledge transfer using transformer for 3D dense captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8563–8573 (2022)
Google Scholar
Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., van den Hengel, A.: Asking the difficult questions: goal-oriented visual question generation via intermediate rewards. In: European Conference on Computer Vision (2017)
Google Scholar
Zhao, R., Tresp, V.: Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient. In: LaCATODA@ IJCAI, pp. 1–7 (2018)
Google Scholar
Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 1839–1851 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Shanghai University of Electric Power, Shanghai, China
Feifei Xu, Yingchen Zhou, Zheng Zhong & Guangzhen Li

Authors

Feifei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yingchen Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Guangzhen Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yingchen Zhou .

Editor information

Editors and Affiliations

Victoria University of Wellington, Wellington, New Zealand
Fang-Lue Zhang
Ben-Gurion University, Be'er Sheva, Israel
Andrei Sharf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, F., Zhou, Y., Zhong, Z., Li, G. (2024). Object Category-Based Visual Dialog for Effective Question Generation. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14593. Springer, Singapore. https://doi.org/10.1007/978-981-97-2092-7_16

Download citation

DOI: https://doi.org/10.1007/978-981-97-2092-7_16
Published: 30 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2091-0
Online ISBN: 978-981-97-2092-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Object Category-Based Visual Dialog for Effective Question Generation