Skip to main content
Log in

Image captioning for cultural artworks: a case study on ceramics

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

When viewing ancient artworks, people try to build connections with them to ‘read’ the correct messages from the past. A proper descriptive caption is essential for viewers to attain universal understanding and cognitive appreciation. Recent advance in tailoring deep learning for image analysis predominately focuses on generating captions for natural images. However, these relevant techniques are ill-suited for interpreting ancient artworks, which exhibit differential appearances, various design functions, and more importantly, implicit cultural metaphors, hardly summarized in a short caption/sentence. This work presents the design and implementation of a novel framework, termed as ARTalk, for comprehensive image captioning for ancient artworks, with ceramics as the running case. First, we launch an exploratory study on understanding ancient artwork captions, elaborate 15 factors via semi-structural discussion with experts, and form a dedicated caption template with statistical importance analysis on factors. Second, we build a dataset (i.e., CArt15K) with factor-granularity annotations on visuals and texts of ceramics. Third, we jointly fine-tune multiple deep networks for automatic factor extraction and construct a knowledge graph for metaphor inference. We train the networks on CArt15K, evaluate performance by comparing with the baselines, and conduct qualitative analysis on practical generation. We have also implemented a prototype of ARTalk for interactively assisting experts in caption generation. We will release the CArt15K dataset for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The data are available from the corresponding author on reasonable request.

Notes

  1. https://github.com/doccano/doccano.

  2. Our metaphor knowledge graph is constructed in the context of Chinese culture.

References

  1. Gleason, C., Fiannaca, A.J., Kneisel, M., Cutrell, E., Morris, M.R.: Footnotes: Geo-referenced audio annotations for nonvisual exploration. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2(3), 1–24 (2018). https://doi.org/10.1145/3264919

    Article  Google Scholar 

  2. Biswal, S., Xiao, C., Glass, L.M., Westover, B., Sun, J.: Clara: Clinical report auto-completion. In: Proceedings of The Web Conference, pp. 541–550. ACMPress, TaiPei, China (2020)

  3. Gonthier, N., Gousseau, Y., Ladjal, S., Bonfait, O.: Weakly supervised object detection in artworks. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Zurich, Switzerland (2018)

  4. Sheng, S., Laenen, K., Moens, M.-F.: Can image captioning help passage retrieval in multimodal question answering? In: European Conference on Information Retrieval, pp. 94–101. ACMPress, Stavanger, Norway (2019). Springer

  5. Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: International Conference on Multimedia Modeling, pp. 3–15. ACMPress, Prague, Czech Republic (2018). Springer

  6. Wynen, D., Schmid, C., Mairal, J.: Unsupervised learning of artistic styles with archetypal style analysis. Adv. Neural. Inf. Process. Syst. 31, 6584–6593 (2018)

    Google Scholar 

  7. Chu, W.-T., Wu, Y.-L.: Image style classification based on learnt deep correlation features. IEEE Trans. Multimedia 20(9), 2491–2502 (2018)

    Article  Google Scholar 

  8. Yang, H., Min, K.: Classification of basic artistic media based on a deep convolutional approach. Vis. Comput. 36(3), 559–578 (2020). https://doi.org/10.1007/s00371-019-01641-6

    Article  Google Scholar 

  9. Sheng, S., Moens, M.-F.: Generating captions for images of ancient artworks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2478–2486. ACMPress, Nice, France (2019)

  10. Bai, Z., Nakashima, Y., Garcia, N.: Explain me the painting: multi-topic knowledgeable art description generation. In: In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5422–5432. IEEE, Montreal, Canada (2021)

  11. Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)

    Article  Google Scholar 

  12. Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: European Conference on Computer Vision, pp. 529–545. Springer, Zurich, Switzerland (2014). Springer

  13. Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2596–2604. IEEE, Los Alamitos , Washington , Tokyo (2015)

  14. Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. ACM, Portland Oregon (2011)

  15. Dhar, G.K.V.P.S., Li, S., Tamara, Y.C.A.C.B., Berg, L.: Baby talk: understanding and generating simple image descriptions. Comput. Vis. Pattern Recogn. 35(12), 2891–2903 (2013). https://doi.org/10.1109/TPAMI.2012.162

  16. Shen, X., Liu, B., Zhou, Y., Zhao, J., Liu, M.: Remote sensing image captioning via variational autoencoder and reinforcement learning. Knowl.-Based Syst. 203, 105920 (2020). https://doi.org/10.1016/j.knosys.2020.105920

    Article  Google Scholar 

  17. Liu, F., Zhang, M., Zheng, B., Cui, S., Ma, W., Liu, Z.: Feature fusion via multi-target learning for ancient artwork captioning. Information Fusion 97, 101811 (2023)

    Article  Google Scholar 

  18. Feng, Q., Wu, Y., Fan, H., Yan, C., Xu, M., Yang, Y.: Cascaded revision network for novel object captioning. IEEE Trans. Circ. Syst. Video Technol. 30(10), 3413–3421 (2020). https://doi.org/10.1109/TCSVT.2020.2965966

    Article  Google Scholar 

  19. Liu, M., Li, L., Hu, H., Guan, W., Tian, J.: Image caption generation with dual attention mechanism. Inf. Process. Manage. 57(2), 102178 (2020)

    Article  Google Scholar 

  20. Xu, L., Merono-Penuela, A., Huang, Z., Van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26. ISWC, Vienna, Austria (2017)

  21. Xu, L., Wang, X.: Semantic description of cultural digital images: using a hierarchical model and controlled vocabulary. D-Lib Magazine 21, 5–6 (2015). https://doi.org/10.1045/may2015-xu

    Article  Google Scholar 

  22. Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)

  23. Cetinic, E.: Iconographic image captioning for artworks. In: International Conference on Pattern Recognition, pp. 502–516. Springer, Munich, Germany (2021). Springer

  24. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017)

  25. Che, W., Fan, X., Xiong, R., Zhao, D.: Visual relationship embedding network for image paragraph generation. IEEE Trans. Multimedia 22(9), 2307–2320 (2019)

    Article  Google Scholar 

  26. Guo, D., Lu, R., Chen, B., Zeng, Z., Zhou, M.: Matching visual features to hierarchical semantic topics for image paragraph captioning. Int. J. Comput. Vis. 130(8), 1920–1937 (2022)

    Article  Google Scholar 

  27. Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 729–744 (2018)

  28. Zeng, X.-H., Liu, B.-G., Zhou, M.: Understanding and generating ultrasound image description. J. Comput. Sci. Technol. 33(5), 1086–1100 (2018). https://doi.org/10.1007/s11390-018-1874-8

    Article  Google Scholar 

  29. Qian, X., Koh, E., Du, F., Kim, S., Chan, J.: A formative study on designing accurate and natural figure captioning systems. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8. ACM, Honolulu HI USA (2020)

  30. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  31. Zhu, X., Li, L., Liu, J., Peng, H., Niu, X.: Captioning transformer with stacked attention modules. Appl. Sci. 8(5), 739 (2018)

    Article  Google Scholar 

  32. Gannon, M.J.: Cultural metaphors: their use in management practice as a method for understanding cultures. Online Read. Psychol. Culture 7, 4 (2011)

    Google Scholar 

  33. Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J., Belongie, S.: Bam! the behance artistic media dataset for recognition beyond photography. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1202–1211. IEEE, Venice, Italy (2017)

  34. Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: a new visual-semantic dataset with visual and contextual sentences in the artistic domain. In: International Conference on Image Analysis and Processing, pp. 729–740. Springer, Trento, Italy (2019)

  35. Li, Q., Yin, J., Wang, Y.: An image comment method based on emotion capture module. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), pp. 334–339. IEEE, Qingdao, China (2021)

  36. Carraggi, A., Cornia, M., Baraldi, L., Cucchiara, R.: Visual-semantic alignment across domains using a semi-supervised approach. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)

  37. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  38. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  39. Shaoyin, X., Ke, X.: Chinese Porcelain Dictionary. China Culture and History Press, Beijing (2019)

    Google Scholar 

  40. Xiaodong, L., Xue, W.: Cultural Relics. Xueyuan Press, Beijing (2005)

    Google Scholar 

  41. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. ACM, Philadelphia Pennsylvania (2002)

  42. Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004)

  43. Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72. ACM, Ann Arbor (2005)

  44. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. ACL, Ann Arbor, Michigan (2015)

  45. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE, Boston, MA, USA (2015)

  46. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. ACM, Lille, France (2015). PMLR

  47. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. IEEE, Venice, Italy (2017)

  48. Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452 (2016)

  49. Fudholi, D.H., Windiatmoko, Y., Afrianto, N., Susanto, P.E., Suyuti, M., Hidayatullah, A.F., Rahmadi, R.: Image captioning with attention for smart local tourism using efficientnet. In: IOP Conference Series: Materials Science and Engineering, vol. 1077, p. 012038 (2021)

  50. Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022)

Download references

Funding

This work is supported by The National Natural Science Foundation of China (No. 62172155).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fang Liu or Yunfan Ye.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The data from the Taipei Palace Museum and the Metropolitan Museum of Art have no use restrictions. We only use the collected data from the Beijing Palace Museum for the test experiment.

Additional information

Communicated by M. Buzzelli.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: User questionnaire

See Figs. 10 and 11.

Fig. 10
figure 10

A 7-point Likert scale questionnaire for the ability of ARTalk-generated ceramic captions, descriptions, and metaphorical sentences to enhance cultural understanding

Fig. 11
figure 11

A 7-point Likert scale questionnaire for Correctness, Accuracy, Readability and Usefulness of Independence Metaphor and Joint Metaphor

Appendix B: Exploratory study

See Table 8.

Table 8 Definitions, examples, frequency statistics, and types of the 15 basic factors elaborated for ceramics description

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, B., Liu, F., Zhang, M. et al. Image captioning for cultural artworks: a case study on ceramics. Multimedia Systems 29, 3223–3243 (2023). https://doi.org/10.1007/s00530-023-01178-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01178-8

Keywords

Navigation