Skip to main content
Log in

Entity recognition based on heterogeneous graph reasoning of visual region and text candidate

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Entity recognition plays a crucial role in various domains, such as natural language processing, information retrieval, and question-answering systems. While significant progress has been made in recognizing entities from plain text, the exploration of entity recognition from multimodal data remains limited due to disparities in semantic representation. In light of this challenge, given the supportive nature of visual and text data, we propose a novel entity recognition model called Heterogeneous Graph Reasoning(HGR), leveraging the synergistic nature of visual and textual data. HGR utilizes image objects to facilitate text entity extraction by mining the potential pair projection between text entity and image object. This is achieved through the utilization of the Vision Refine and Graph Cross Inference modules. In the Vision Refine module, semantically relevant objects hidden in the image are selected to aid in the text entity extraction. In the Graph Cross Inference module, cross-association inference between visual regions and textual entities is constructed through graph construction, heterogeneous graph fusion, visual region refinement and cross inference. To validate the effectiveness of our model, extensive experiments on four multimodal datasets are conducted. Among these datasets, two originate from Chinese unmanned surface vehicles and journalism(USV and NEWS), while the remaining two are public English multimodal datasets(Twitter-2015 and Twitter-2017). The experimental results demonstrate the superiority of our model, with F1-sore improvements of 1.55%, 0.12%, 0.22%, and 0.99% on the four datasets, respectively, when compared to the second-best state-of-the-art model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The Twitter-15 and Twitter-17 datasets are available from (https://github.com/jefferyYu/UMT). Our expanded datasets will be made publicly available after paper publication.

Code availability

Our code will be made publicly available after paper publication.

References

  • Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In Proceedings of the 27th international conference on computational linguistics (pp. 1638–1649).

  • Arshad, O., Gallo, I., Nawaz, S., & Calefati, A. (2019). Aiding intra-text representations with visual context for multimodal named entity recognition. In 2019 International conference on document analysis and recognition (ICDAR) (pp. 337–342). IEEE.

  • Asgari-Chenaghlu, M., Feizi-Derakhshi, M. R., Farzinvash, L., Balafar, M., & Motamed, C. (2020). A multimodal deep learning approach for named entity recognition from social media. arXiv preprint arXiv:2001.06888

  • Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162).

  • Changpinyo, S., Sharma, P., Ding, N., & Soricut, R. (2021). Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3558–3568).

  • Chen, D., Li, Z., Gu, B., & Chen, Z. (2021). Multimodal named entity recognition with image attributes and image knowledge. In Database systems for advanced applications: 26th international conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, proceedings, Part II 26 (pp. 186–201). Springer.

  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  • Cui, Y., Che, W., Wang, S., & Liu, T. (2022). Lert: A linguistically-motivated pre-trained language model. arXiv preprint arXiv:2211.05344

  • Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  • Grishman, R., & Sundheim, B. M. (1996). Message understanding conference-6: A brief history. In COLING 1996 volume 1: The 16th international conference on computational linguistics.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Huang, P. -Y., Liu, F., Shiang, S. -R., Oh, J., & Dyer, C. (2016). Attention-based multimodal neural machine translation. In Proceedings of the first conference on machine translation (vol. 2, pp. 639–645).

  • Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991

  • Hudson, D., & Manning, C. D. (2019). Learning by abstraction: The neural state machine. In Advances in neural information processing systems (vol. 32).

  • Ive, J., Madhyastha, P., & Specia, L. (2019). Distilling translations with visual awareness. arXiv preprint arXiv:1906.07701

  • Jiao, Z., Sun, S., & Sun, K. (2018). Chinese lexical analysis with deep bi-gru-crf network. arXiv preprint arXiv:1807.01882

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  • Li, Y., Qian, Y., Yu, Y., Qin, X., Zhang, C., Liu, Y., Yao, K., Han, J., Liu, J., & Ding, E. (2021). Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM international conference on multimedia (pp. 1912–1920).

  • Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2015). Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493

  • Lin, H., Meng, F., Su, J., Yin, Y., Yang, Z., Ge, Y., Zhou, J., & Luo, J. (2020). Dynamic context-guided capsule network for multimodal machine translation. In Proceedings of the 28th ACM international conference on multimedia (pp. 1320–1329).

  • Li, J., Sun, A., Han, J., & Li, C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), 50–70.

    Article  Google Scholar 

  • Liu, L., Wang, M., Zhang, M., Qing, L., & He, X. (2022). Uamner: Uncertainty-aware multimodal named entity recognition in social media posts. Applied Intelligence, 52(4), 4109–4125.

    Article  Google Scholar 

  • Lu, D., Neves, L., Carvalho, V., Zhang, N., & Ji, H. (2018). Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th annual meeting of the association for computational linguistics (vol. 1, pp. 1990–1999).

  • Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proc. Icml (vol. 30, pp. 3). Atlanta, Georgia, USA.

  • Moon, S., Neves, L., & Carvalho, V. (2018). Multimodal named entity recognition for short social media posts. arXiv preprint arXiv:1802.07862

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.

  • Reimers, N., & Gurevych, I. (2017). Optimal hyperparameters for deep lstm-networks for sequence labeling tasks. arXiv preprint arXiv:1707.06799

  • Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813

  • Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909

  • Strubell, E., Verga, P., Belanger, D., & McCallum, A. (2017). Fast and accurate entity recognition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098

  • Su, J., Chen, J., Jiang, H., Zhou, C., Lin, H., Ge, Y., Wu, Q., & Lai, Y. (2021). Multi-modal neural machine translation with deep semantic interactions. Information Sciences, 554, 47–60.

    Article  MathSciNet  Google Scholar 

  • Sun, L., Wang, J., Zhang, K., Su, Y., & Weng, F. (2021). Rpbert: A text-image relation propagation-based bert model for multimodal ner. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 13860–13868).

  • Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., & Wang, C. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14454–14463).

  • Tomori, S., Ninomiya, T., & Mori, S. (2016). Domain specific named entity recognition referring to the real world by deep neural networks. In Proceedings of the 54th annual meeting of the association for computational linguistics (vol. 2, pp. 236–242).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (vol. 30).

  • Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903

  • Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. Stat, 1050(20), 10–48550.

    Google Scholar 

  • Wang, X., Ye, J., Li, Z., Tian, J., Jiang, Y., Yan, M., Zhang, J., & Xiao, Y. (2022). Cat-mner: Multimodal named entity recognition with knowledge-refined cross-modal attention. In 2022 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.

  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., & Funtowicz, M. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 38–45).

  • Yu, J., Jiang, J., Yang, L., & Xia, R. (2020). Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Association for computational linguistics

  • Zhai, F., Potdar, S., Xiang, B., & Zhou, B. (2017). Neural models for sequence chunking. In Proceedings of the AAAI conference on artificial intelligence (vol. 31).

  • Zhang, Z., Chen, K., Wang, R., Utiyama, M., Sumita, E., Li, Z., & Zhao, H. (2020). Neural machine translation with universal visual representation. In International conference on learning representations.

  • Zhang, Q., Fu, J., Liu, X., & Huang, X. (2018). Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence (vol. 32).

  • Zhang, Y., Jiang, M., & Zhao, Q. (2021). Explicit knowledge incorporation for visual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1356–1365).

  • Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., & Zhou, G. (2021). Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 14347–14355).

  • Zhang, D., Wu, L., Sun, C., Li, S., Zhu, Q., & Zhou, G. (2019). Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In IJCAI (pp. 5415–5421).

  • Zheng, C., Wu, Z., Wang, T., Cai, Y., & Li, Q. (2020). Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Transactions on Multimedia, 23, 2520–2532.

    Article  Google Scholar 

Download references

Funding

This work is partially supported by National Natural Science Foundation of China under (Grant no. 72204155 and 62202282). This work is also partially supported by Shanghai Youth Science and Technology Talents Sailing Program under (Grant no. 22YF1413700).

Author information

Authors and Affiliations

Authors

Contributions

XW directed the study from idea, experiment to writing. NZ contributes to the idea, experiment, theoretical analysis. JL contributes to the experiment and experiment analysis. YC and ZL contribute to the writing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Nengjun Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no confict of interest.

Ethics approval

This work does not involve any human subjects or animals, so has no ethical concerns.

Consent to participate

Not applicable.

Consent for publication

All authors consent to submission and publication.

Additional information

Editors: Bingxue Zhang, Feida Zhu, Bin Yang, João Gama

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Zhu, N., Li, J. et al. Entity recognition based on heterogeneous graph reasoning of visual region and text candidate. Mach Learn (2024). https://doi.org/10.1007/s10994-023-06456-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-023-06456-0

Keywords

Navigation