Abstract
When viewing ancient artworks, people try to build connections with them to ‘read’ the correct messages from the past. A proper descriptive caption is essential for viewers to attain universal understanding and cognitive appreciation. Recent advance in tailoring deep learning for image analysis predominately focuses on generating captions for natural images. However, these relevant techniques are ill-suited for interpreting ancient artworks, which exhibit differential appearances, various design functions, and more importantly, implicit cultural metaphors, hardly summarized in a short caption/sentence. This work presents the design and implementation of a novel framework, termed as ARTalk, for comprehensive image captioning for ancient artworks, with ceramics as the running case. First, we launch an exploratory study on understanding ancient artwork captions, elaborate 15 factors via semi-structural discussion with experts, and form a dedicated caption template with statistical importance analysis on factors. Second, we build a dataset (i.e., CArt15K) with factor-granularity annotations on visuals and texts of ceramics. Third, we jointly fine-tune multiple deep networks for automatic factor extraction and construct a knowledge graph for metaphor inference. We train the networks on CArt15K, evaluate performance by comparing with the baselines, and conduct qualitative analysis on practical generation. We have also implemented a prototype of ARTalk for interactively assisting experts in caption generation. We will release the CArt15K dataset for further research.
Similar content being viewed by others
Data availability
The data are available from the corresponding author on reasonable request.
Notes
Our metaphor knowledge graph is constructed in the context of Chinese culture.
References
Gleason, C., Fiannaca, A.J., Kneisel, M., Cutrell, E., Morris, M.R.: Footnotes: Geo-referenced audio annotations for nonvisual exploration. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2(3), 1–24 (2018). https://doi.org/10.1145/3264919
Biswal, S., Xiao, C., Glass, L.M., Westover, B., Sun, J.: Clara: Clinical report auto-completion. In: Proceedings of The Web Conference, pp. 541–550. ACMPress, TaiPei, China (2020)
Gonthier, N., Gousseau, Y., Ladjal, S., Bonfait, O.: Weakly supervised object detection in artworks. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Zurich, Switzerland (2018)
Sheng, S., Laenen, K., Moens, M.-F.: Can image captioning help passage retrieval in multimodal question answering? In: European Conference on Information Retrieval, pp. 94–101. ACMPress, Stavanger, Norway (2019). Springer
Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: International Conference on Multimedia Modeling, pp. 3–15. ACMPress, Prague, Czech Republic (2018). Springer
Wynen, D., Schmid, C., Mairal, J.: Unsupervised learning of artistic styles with archetypal style analysis. Adv. Neural. Inf. Process. Syst. 31, 6584–6593 (2018)
Chu, W.-T., Wu, Y.-L.: Image style classification based on learnt deep correlation features. IEEE Trans. Multimedia 20(9), 2491–2502 (2018)
Yang, H., Min, K.: Classification of basic artistic media based on a deep convolutional approach. Vis. Comput. 36(3), 559–578 (2020). https://doi.org/10.1007/s00371-019-01641-6
Sheng, S., Moens, M.-F.: Generating captions for images of ancient artworks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2478–2486. ACMPress, Nice, France (2019)
Bai, Z., Nakashima, Y., Garcia, N.: Explain me the painting: multi-topic knowledgeable art description generation. In: In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5422–5432. IEEE, Montreal, Canada (2021)
Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: European Conference on Computer Vision, pp. 529–545. Springer, Zurich, Switzerland (2014). Springer
Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2596–2604. IEEE, Los Alamitos , Washington , Tokyo (2015)
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. ACM, Portland Oregon (2011)
Dhar, G.K.V.P.S., Li, S., Tamara, Y.C.A.C.B., Berg, L.: Baby talk: understanding and generating simple image descriptions. Comput. Vis. Pattern Recogn. 35(12), 2891–2903 (2013). https://doi.org/10.1109/TPAMI.2012.162
Shen, X., Liu, B., Zhou, Y., Zhao, J., Liu, M.: Remote sensing image captioning via variational autoencoder and reinforcement learning. Knowl.-Based Syst. 203, 105920 (2020). https://doi.org/10.1016/j.knosys.2020.105920
Liu, F., Zhang, M., Zheng, B., Cui, S., Ma, W., Liu, Z.: Feature fusion via multi-target learning for ancient artwork captioning. Information Fusion 97, 101811 (2023)
Feng, Q., Wu, Y., Fan, H., Yan, C., Xu, M., Yang, Y.: Cascaded revision network for novel object captioning. IEEE Trans. Circ. Syst. Video Technol. 30(10), 3413–3421 (2020). https://doi.org/10.1109/TCSVT.2020.2965966
Liu, M., Li, L., Hu, H., Guan, W., Tian, J.: Image caption generation with dual attention mechanism. Inf. Process. Manage. 57(2), 102178 (2020)
Xu, L., Merono-Penuela, A., Huang, Z., Van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26. ISWC, Vienna, Austria (2017)
Xu, L., Wang, X.: Semantic description of cultural digital images: using a hierarchical model and controlled vocabulary. D-Lib Magazine 21, 5–6 (2015). https://doi.org/10.1045/may2015-xu
Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
Cetinic, E.: Iconographic image captioning for artworks. In: International Conference on Pattern Recognition, pp. 502–516. Springer, Munich, Germany (2021). Springer
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017)
Che, W., Fan, X., Xiong, R., Zhao, D.: Visual relationship embedding network for image paragraph generation. IEEE Trans. Multimedia 22(9), 2307–2320 (2019)
Guo, D., Lu, R., Chen, B., Zeng, Z., Zhou, M.: Matching visual features to hierarchical semantic topics for image paragraph captioning. Int. J. Comput. Vis. 130(8), 1920–1937 (2022)
Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 729–744 (2018)
Zeng, X.-H., Liu, B.-G., Zhou, M.: Understanding and generating ultrasound image description. J. Comput. Sci. Technol. 33(5), 1086–1100 (2018). https://doi.org/10.1007/s11390-018-1874-8
Qian, X., Koh, E., Du, F., Kim, S., Chan, J.: A formative study on designing accurate and natural figure captioning systems. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8. ACM, Honolulu HI USA (2020)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Zhu, X., Li, L., Liu, J., Peng, H., Niu, X.: Captioning transformer with stacked attention modules. Appl. Sci. 8(5), 739 (2018)
Gannon, M.J.: Cultural metaphors: their use in management practice as a method for understanding cultures. Online Read. Psychol. Culture 7, 4 (2011)
Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J., Belongie, S.: Bam! the behance artistic media dataset for recognition beyond photography. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1202–1211. IEEE, Venice, Italy (2017)
Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: a new visual-semantic dataset with visual and contextual sentences in the artistic domain. In: International Conference on Image Analysis and Processing, pp. 729–740. Springer, Trento, Italy (2019)
Li, Q., Yin, J., Wang, Y.: An image comment method based on emotion capture module. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), pp. 334–339. IEEE, Qingdao, China (2021)
Carraggi, A., Cornia, M., Baraldi, L., Cucchiara, R.: Visual-semantic alignment across domains using a semi-supervised approach. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Shaoyin, X., Ke, X.: Chinese Porcelain Dictionary. China Culture and History Press, Beijing (2019)
Xiaodong, L., Xue, W.: Cultural Relics. Xueyuan Press, Beijing (2005)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. ACM, Philadelphia Pennsylvania (2002)
Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004)
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72. ACM, Ann Arbor (2005)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. ACL, Ann Arbor, Michigan (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE, Boston, MA, USA (2015)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. ACM, Lille, France (2015). PMLR
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. IEEE, Venice, Italy (2017)
Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452 (2016)
Fudholi, D.H., Windiatmoko, Y., Afrianto, N., Susanto, P.E., Suyuti, M., Hidayatullah, A.F., Rahmadi, R.: Image captioning with attention for smart local tourism using efficientnet. In: IOP Conference Series: Materials Science and Engineering, vol. 1077, p. 012038 (2021)
Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022)
Funding
This work is supported by The National Natural Science Foundation of China (No. 62172155).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The data from the Taipei Palace Museum and the Metropolitan Museum of Art have no use restrictions. We only use the collected data from the Beijing Palace Museum for the test experiment.
Additional information
Communicated by M. Buzzelli.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zheng, B., Liu, F., Zhang, M. et al. Image captioning for cultural artworks: a case study on ceramics. Multimedia Systems 29, 3223–3243 (2023). https://doi.org/10.1007/s00530-023-01178-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01178-8