Image captioning for cultural artworks: a case study on ceramics

Zheng, Baoying; Liu, Fang; Zhang, Mohan; Zhou, Tongqing; Cui, Shenglan; Ye, Yunfan; Guo, Yeting

doi:10.1007/s00530-023-01178-8

Image captioning for cultural artworks: a case study on ceramics

Regular Paper
Published: 23 September 2023

Volume 29, pages 3223–3243, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Baoying Zheng¹,
Fang Liu¹,
Mohan Zhang¹,
Tongqing Zhou²,
Shenglan Cui¹,
Yunfan Ye¹ &
…
Yeting Guo²

348 Accesses
Explore all metrics

Abstract

When viewing ancient artworks, people try to build connections with them to ‘read’ the correct messages from the past. A proper descriptive caption is essential for viewers to attain universal understanding and cognitive appreciation. Recent advance in tailoring deep learning for image analysis predominately focuses on generating captions for natural images. However, these relevant techniques are ill-suited for interpreting ancient artworks, which exhibit differential appearances, various design functions, and more importantly, implicit cultural metaphors, hardly summarized in a short caption/sentence. This work presents the design and implementation of a novel framework, termed as ARTalk, for comprehensive image captioning for ancient artworks, with ceramics as the running case. First, we launch an exploratory study on understanding ancient artwork captions, elaborate 15 factors via semi-structural discussion with experts, and form a dedicated caption template with statistical importance analysis on factors. Second, we build a dataset (i.e., CArt15K) with factor-granularity annotations on visuals and texts of ceramics. Third, we jointly fine-tune multiple deep networks for automatic factor extraction and construct a knowledge graph for metaphor inference. We train the networks on CArt15K, evaluate performance by comparing with the baselines, and conduct qualitative analysis on practical generation. We have also implemented a prototype of ARTalk for interactively assisting experts in caption generation. We will release the CArt15K dataset for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iconographic Image Captioning for Artworks

Bright as the Sun: In-depth Analysis of Imagination-Driven Image Captioning

Image caption generation using transformer learning methods: a case study on instagram image

Article 19 October 2023

Data availability

The data are available from the corresponding author on reasonable request.

Notes

https://github.com/doccano/doccano.
Our metaphor knowledge graph is constructed in the context of Chinese culture.

References

Gleason, C., Fiannaca, A.J., Kneisel, M., Cutrell, E., Morris, M.R.: Footnotes: Geo-referenced audio annotations for nonvisual exploration. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2(3), 1–24 (2018). https://doi.org/10.1145/3264919
Article Google Scholar
Biswal, S., Xiao, C., Glass, L.M., Westover, B., Sun, J.: Clara: Clinical report auto-completion. In: Proceedings of The Web Conference, pp. 541–550. ACMPress, TaiPei, China (2020)
Gonthier, N., Gousseau, Y., Ladjal, S., Bonfait, O.: Weakly supervised object detection in artworks. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Zurich, Switzerland (2018)
Sheng, S., Laenen, K., Moens, M.-F.: Can image captioning help passage retrieval in multimodal question answering? In: European Conference on Information Retrieval, pp. 94–101. ACMPress, Stavanger, Norway (2019). Springer
Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: International Conference on Multimedia Modeling, pp. 3–15. ACMPress, Prague, Czech Republic (2018). Springer
Wynen, D., Schmid, C., Mairal, J.: Unsupervised learning of artistic styles with archetypal style analysis. Adv. Neural. Inf. Process. Syst. 31, 6584–6593 (2018)
Google Scholar
Chu, W.-T., Wu, Y.-L.: Image style classification based on learnt deep correlation features. IEEE Trans. Multimedia 20(9), 2491–2502 (2018)
Article Google Scholar
Yang, H., Min, K.: Classification of basic artistic media based on a deep convolutional approach. Vis. Comput. 36(3), 559–578 (2020). https://doi.org/10.1007/s00371-019-01641-6
Article Google Scholar
Sheng, S., Moens, M.-F.: Generating captions for images of ancient artworks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2478–2486. ACMPress, Nice, France (2019)
Bai, Z., Nakashima, Y., Garcia, N.: Explain me the painting: multi-topic knowledgeable art description generation. In: In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5422–5432. IEEE, Montreal, Canada (2021)
Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Article Google Scholar
Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: European Conference on Computer Vision, pp. 529–545. Springer, Zurich, Switzerland (2014). Springer
Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2596–2604. IEEE, Los Alamitos , Washington , Tokyo (2015)
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. ACM, Portland Oregon (2011)
Dhar, G.K.V.P.S., Li, S., Tamara, Y.C.A.C.B., Berg, L.: Baby talk: understanding and generating simple image descriptions. Comput. Vis. Pattern Recogn. 35(12), 2891–2903 (2013). https://doi.org/10.1109/TPAMI.2012.162
Shen, X., Liu, B., Zhou, Y., Zhao, J., Liu, M.: Remote sensing image captioning via variational autoencoder and reinforcement learning. Knowl.-Based Syst. 203, 105920 (2020). https://doi.org/10.1016/j.knosys.2020.105920
Article Google Scholar
Liu, F., Zhang, M., Zheng, B., Cui, S., Ma, W., Liu, Z.: Feature fusion via multi-target learning for ancient artwork captioning. Information Fusion 97, 101811 (2023)
Article Google Scholar
Feng, Q., Wu, Y., Fan, H., Yan, C., Xu, M., Yang, Y.: Cascaded revision network for novel object captioning. IEEE Trans. Circ. Syst. Video Technol. 30(10), 3413–3421 (2020). https://doi.org/10.1109/TCSVT.2020.2965966
Article Google Scholar
Liu, M., Li, L., Hu, H., Guan, W., Tian, J.: Image caption generation with dual attention mechanism. Inf. Process. Manage. 57(2), 102178 (2020)
Article Google Scholar
Xu, L., Merono-Penuela, A., Huang, Z., Van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26. ISWC, Vienna, Austria (2017)
Xu, L., Wang, X.: Semantic description of cultural digital images: using a hierarchical model and controlled vocabulary. D-Lib Magazine 21, 5–6 (2015). https://doi.org/10.1045/may2015-xu
Article Google Scholar
Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
Cetinic, E.: Iconographic image captioning for artworks. In: International Conference on Pattern Recognition, pp. 502–516. Springer, Munich, Germany (2021). Springer
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017)
Che, W., Fan, X., Xiong, R., Zhao, D.: Visual relationship embedding network for image paragraph generation. IEEE Trans. Multimedia 22(9), 2307–2320 (2019)
Article Google Scholar
Guo, D., Lu, R., Chen, B., Zeng, Z., Zhou, M.: Matching visual features to hierarchical semantic topics for image paragraph captioning. Int. J. Comput. Vis. 130(8), 1920–1937 (2022)
Article Google Scholar
Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 729–744 (2018)
Zeng, X.-H., Liu, B.-G., Zhou, M.: Understanding and generating ultrasound image description. J. Comput. Sci. Technol. 33(5), 1086–1100 (2018). https://doi.org/10.1007/s11390-018-1874-8
Article Google Scholar
Qian, X., Koh, E., Du, F., Kim, S., Chan, J.: A formative study on designing accurate and natural figure captioning systems. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8. ACM, Honolulu HI USA (2020)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet MATH Google Scholar
Zhu, X., Li, L., Liu, J., Peng, H., Niu, X.: Captioning transformer with stacked attention modules. Appl. Sci. 8(5), 739 (2018)
Article Google Scholar
Gannon, M.J.: Cultural metaphors: their use in management practice as a method for understanding cultures. Online Read. Psychol. Culture 7, 4 (2011)
Google Scholar
Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J., Belongie, S.: Bam! the behance artistic media dataset for recognition beyond photography. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1202–1211. IEEE, Venice, Italy (2017)
Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: a new visual-semantic dataset with visual and contextual sentences in the artistic domain. In: International Conference on Image Analysis and Processing, pp. 729–740. Springer, Trento, Italy (2019)
Li, Q., Yin, J., Wang, Y.: An image comment method based on emotion capture module. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), pp. 334–339. IEEE, Qingdao, China (2021)
Carraggi, A., Cornia, M., Baraldi, L., Cucchiara, R.: Visual-semantic alignment across domains using a semi-supervised approach. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Shaoyin, X., Ke, X.: Chinese Porcelain Dictionary. China Culture and History Press, Beijing (2019)
Google Scholar
Xiaodong, L., Xue, W.: Cultural Relics. Xueyuan Press, Beijing (2005)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. ACM, Philadelphia Pennsylvania (2002)
Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004)
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72. ACM, Ann Arbor (2005)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. ACL, Ann Arbor, Michigan (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE, Boston, MA, USA (2015)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. ACM, Lille, France (2015). PMLR
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. IEEE, Venice, Italy (2017)
Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452 (2016)
Fudholi, D.H., Windiatmoko, Y., Afrianto, N., Susanto, P.E., Suyuti, M., Hidayatullah, A.F., Rahmadi, R.: Image captioning with attention for smart local tourism using efficientnet. In: IOP Conference Series: Materials Science and Engineering, vol. 1077, p. 012038 (2021)
Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022)

Download references

Funding

This work is supported by The National Natural Science Foundation of China (No. 62172155).

Author information

Authors and Affiliations

School of Design, Hunan University, Juzizhou, Changsha, 410082, Hunan, China
Baoying Zheng, Fang Liu, Mohan Zhang, Shenglan Cui & Yunfan Ye
College of Computer, National University of Defense Technology, Dongfeng Road, Changsha, 410073, Hunan, China
Tongqing Zhou & Yeting Guo

Authors

Baoying Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mohan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tongqing Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shenglan Cui
View author publications
You can also search for this author in PubMed Google Scholar
Yunfan Ye
View author publications
You can also search for this author in PubMed Google Scholar
Yeting Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Fang Liu or Yunfan Ye.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The data from the Taipei Palace Museum and the Metropolitan Museum of Art have no use restrictions. We only use the collected data from the Beijing Palace Museum for the test experiment.

Additional information

Communicated by M. Buzzelli.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: User questionnaire

See Figs. 10 and 11.

Appendix B: Exploratory study

See Table 8.

Table 8 Definitions, examples, frequency statistics, and types of the 15 basic factors elaborated for ceramics description

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zheng, B., Liu, F., Zhang, M. et al. Image captioning for cultural artworks: a case study on ceramics. Multimedia Systems 29, 3223–3243 (2023). https://doi.org/10.1007/s00530-023-01178-8

Download citation

Received: 06 March 2023
Accepted: 29 August 2023
Published: 23 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00530-023-01178-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image captioning for cultural artworks: a case study on ceramics

Abstract

Access this article

Similar content being viewed by others

Iconographic Image Captioning for Artworks

Bright as the Sun: In-depth Analysis of Imagination-Driven Image Captioning

Image caption generation using transformer learning methods: a case study on instagram image

Data availability

Notes

References

Funding