Entity recognition based on heterogeneous graph reasoning of visual region and text candidate

Wang, Xinzhi; Zhu, Nengjun; Li, Jiahao; Chang, Yudong; Li, Zhennan

doi:10.1007/s10994-023-06456-0

Entity recognition based on heterogeneous graph reasoning of visual region and text candidate

Published: 05 January 2024

(2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Xinzhi Wang¹,
Nengjun Zhu ORCID: orcid.org/0000-0002-6146-9887¹,
Jiahao Li¹,
Yudong Chang¹ &
…
Zhennan Li¹

141 Accesses
1 Altmetric
Explore all metrics

Abstract

Entity recognition plays a crucial role in various domains, such as natural language processing, information retrieval, and question-answering systems. While significant progress has been made in recognizing entities from plain text, the exploration of entity recognition from multimodal data remains limited due to disparities in semantic representation. In light of this challenge, given the supportive nature of visual and text data, we propose a novel entity recognition model called Heterogeneous Graph Reasoning(HGR), leveraging the synergistic nature of visual and textual data. HGR utilizes image objects to facilitate text entity extraction by mining the potential pair projection between text entity and image object. This is achieved through the utilization of the Vision Refine and Graph Cross Inference modules. In the Vision Refine module, semantically relevant objects hidden in the image are selected to aid in the text entity extraction. In the Graph Cross Inference module, cross-association inference between visual regions and textual entities is constructed through graph construction, heterogeneous graph fusion, visual region refinement and cross inference. To validate the effectiveness of our model, extensive experiments on four multimodal datasets are conducted. Among these datasets, two originate from Chinese unmanned surface vehicles and journalism(USV and NEWS), while the remaining two are public English multimodal datasets(Twitter-2015 and Twitter-2017). The experimental results demonstrate the superiority of our model, with F1-sore improvements of 1.55%, 0.12%, 0.22%, and 0.99% on the four datasets, respectively, when compared to the second-best state-of-the-art model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probing the Impacts of Visual Context in Multimodal Entity Alignment

Jointly Linking Visual and Textual Entity Mentions with Background Knowledge

Graph Fusion Multimodal Named Entity Recognition Based on Auxiliary Relation Enhancement

Data availability

The Twitter-15 and Twitter-17 datasets are available from (https://github.com/jefferyYu/UMT). Our expanded datasets will be made publicly available after paper publication.

Code availability

Our code will be made publicly available after paper publication.

References

Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In Proceedings of the 27th international conference on computational linguistics (pp. 1638–1649).
Arshad, O., Gallo, I., Nawaz, S., & Calefati, A. (2019). Aiding intra-text representations with visual context for multimodal named entity recognition. In 2019 International conference on document analysis and recognition (ICDAR) (pp. 337–342). IEEE.
Asgari-Chenaghlu, M., Feizi-Derakhshi, M. R., Farzinvash, L., Balafar, M., & Motamed, C. (2020). A multimodal deep learning approach for named entity recognition from social media. arXiv preprint arXiv:2001.06888
Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154–6162).
Changpinyo, S., Sharma, P., Ding, N., & Soricut, R. (2021). Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3558–3568).
Chen, D., Li, Z., Gu, B., & Chen, Z. (2021). Multimodal named entity recognition with image attributes and image knowledge. In Database systems for advanced applications: 26th international conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, proceedings, Part II 26 (pp. 186–201). Springer.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Cui, Y., Che, W., Wang, S., & Liu, T. (2022). Lert: A linguistically-motivated pre-trained language model. arXiv preprint arXiv:2211.05344
Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Grishman, R., & Sundheim, B. M. (1996). Message understanding conference-6: A brief history. In COLING 1996 volume 1: The 16th international conference on computational linguistics.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Huang, P. -Y., Liu, F., Shiang, S. -R., Oh, J., & Dyer, C. (2016). Attention-based multimodal neural machine translation. In Proceedings of the first conference on machine translation (vol. 2, pp. 639–645).
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991
Hudson, D., & Manning, C. D. (2019). Learning by abstraction: The neural state machine. In Advances in neural information processing systems (vol. 32).
Ive, J., Madhyastha, P., & Specia, L. (2019). Distilling translations with visual awareness. arXiv preprint arXiv:1906.07701
Jiao, Z., Sun, S., & Sun, K. (2018). Chinese lexical analysis with deep bi-gru-crf network. arXiv preprint arXiv:1807.01882
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Li, Y., Qian, Y., Yu, Y., Qin, X., Zhang, C., Liu, Y., Yao, K., Han, J., Liu, J., & Ding, E. (2021). Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM international conference on multimedia (pp. 1912–1920).
Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2015). Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493
Lin, H., Meng, F., Su, J., Yin, Y., Yang, Z., Ge, Y., Zhou, J., & Luo, J. (2020). Dynamic context-guided capsule network for multimodal machine translation. In Proceedings of the 28th ACM international conference on multimedia (pp. 1320–1329).
Li, J., Sun, A., Han, J., & Li, C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), 50–70.
Article Google Scholar
Liu, L., Wang, M., Zhang, M., Qing, L., & He, X. (2022). Uamner: Uncertainty-aware multimodal named entity recognition in social media posts. Applied Intelligence, 52(4), 4109–4125.
Article Google Scholar
Lu, D., Neves, L., Carvalho, V., Zhang, N., & Ji, H. (2018). Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th annual meeting of the association for computational linguistics (vol. 1, pp. 1990–1999).
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proc. Icml (vol. 30, pp. 3). Atlanta, Georgia, USA.
Moon, S., Neves, L., & Carvalho, V. (2018). Multimodal named entity recognition for short social media posts. arXiv preprint arXiv:1802.07862
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Reimers, N., & Gurevych, I. (2017). Optimal hyperparameters for deep lstm-networks for sequence labeling tasks. arXiv preprint arXiv:1707.06799
Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909
Strubell, E., Verga, P., Belanger, D., & McCallum, A. (2017). Fast and accurate entity recognition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098
Su, J., Chen, J., Jiang, H., Zhou, C., Lin, H., Ge, Y., Wu, Q., & Lai, Y. (2021). Multi-modal neural machine translation with deep semantic interactions. Information Sciences, 554, 47–60.
Article MathSciNet Google Scholar
Sun, L., Wang, J., Zhang, K., Su, Y., & Weng, F. (2021). Rpbert: A text-image relation propagation-based bert model for multimodal ner. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 13860–13868).
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., & Wang, C. (2021). Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14454–14463).
Tomori, S., Ninomiya, T., & Mori, S. (2016). Domain specific named entity recognition referring to the real world by deep neural networks. In Proceedings of the 54th annual meeting of the association for computational linguistics (vol. 2, pp. 236–242).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (vol. 30).
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. Stat, 1050(20), 10–48550.
Google Scholar
Wang, X., Ye, J., Li, Z., Tian, J., Jiang, Y., Yan, M., Zhang, J., & Xiao, Y. (2022). Cat-mner: Multimodal named entity recognition with knowledge-refined cross-modal attention. In 2022 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., & Funtowicz, M. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 38–45).
Yu, J., Jiang, J., Yang, L., & Xia, R. (2020). Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Association for computational linguistics
Zhai, F., Potdar, S., Xiang, B., & Zhou, B. (2017). Neural models for sequence chunking. In Proceedings of the AAAI conference on artificial intelligence (vol. 31).
Zhang, Z., Chen, K., Wang, R., Utiyama, M., Sumita, E., Li, Z., & Zhao, H. (2020). Neural machine translation with universal visual representation. In International conference on learning representations.
Zhang, Q., Fu, J., Liu, X., & Huang, X. (2018). Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence (vol. 32).
Zhang, Y., Jiang, M., & Zhao, Q. (2021). Explicit knowledge incorporation for visual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1356–1365).
Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., & Zhou, G. (2021). Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI conference on artificial intelligence (vol. 35, pp. 14347–14355).
Zhang, D., Wu, L., Sun, C., Li, S., Zhu, Q., & Zhou, G. (2019). Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In IJCAI (pp. 5415–5421).
Zheng, C., Wu, Z., Wang, T., Cai, Y., & Li, Q. (2020). Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Transactions on Multimedia, 23, 2520–2532.
Article Google Scholar

Download references

Funding

This work is partially supported by National Natural Science Foundation of China under (Grant no. 72204155 and 62202282). This work is also partially supported by Shanghai Youth Science and Technology Talents Sailing Program under (Grant no. 22YF1413700).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Xinzhi Wang, Nengjun Zhu, Jiahao Li, Yudong Chang & Zhennan Li

Authors

Xinzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Nengjun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jiahao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yudong Chang
View author publications
You can also search for this author in PubMed Google Scholar
Zhennan Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

XW directed the study from idea, experiment to writing. NZ contributes to the idea, experiment, theoretical analysis. JL contributes to the experiment and experiment analysis. YC and ZL contribute to the writing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Nengjun Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no confict of interest.

Ethics approval

This work does not involve any human subjects or animals, so has no ethical concerns.

Consent to participate

Not applicable.

Consent for publication

All authors consent to submission and publication.

Additional information

Editors: Bingxue Zhang, Feida Zhu, Bin Yang, João Gama

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Zhu, N., Li, J. et al. Entity recognition based on heterogeneous graph reasoning of visual region and text candidate. Mach Learn (2024). https://doi.org/10.1007/s10994-023-06456-0

Download citation

Received: 19 April 2023
Revised: 31 July 2023
Accepted: 11 October 2023
Published: 05 January 2024
DOI: https://doi.org/10.1007/s10994-023-06456-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entity recognition based on heterogeneous graph reasoning of visual region and text candidate

Abstract

Access this article

Similar content being viewed by others

Probing the Impacts of Visual Context in Multimodal Entity Alignment

Jointly Linking Visual and Textual Entity Mentions with Background Knowledge

Graph Fusion Multimodal Named Entity Recognition Based on Auxiliary Relation Enhancement

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Entity recognition based on heterogeneous graph reasoning of visual region and text candidate

Abstract

Access this article

Similar content being viewed by others

Probing the Impacts of Visual Context in Multimodal Entity Alignment

Jointly Linking Visual and Textual Entity Mentions with Background Knowledge

Graph Fusion Multimodal Named Entity Recognition Based on Auxiliary Relation Enhancement

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation