research-article

MPMRC-MNER: A Unified MRC framework for Multimodal Named Entity Recognition based Multimodal Prompt

Authors:
Xigang Bao

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0009-0002-3250-2403
View Profile

,
Mengyuan Tian

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0009-0006-0565-3838
View Profile

,
Zhiyuan Zha

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0000-0001-8702-4088
View Profile

,
Biao Qin

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0000-0002-4304-675X
View Profile

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementOctober 2023Pages 47–56https://doi.org/10.1145/3583780.3614975

Published:21 October 2023Publication History

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Pages 47–56

ABSTRACT

Multimodal named entity recognition (MNER) is a vision-language task, which aims to detect entity spans and classify them to corresponding entity types given a sentence-image pair. Existing methods often regard an image as a set of visual objects, trying to explicitly capture the relations between visual objects and entities. However, since visual objects are often not identical to entities in quantity and type, they may suffer the bias introduced by visual objects rather than aid. Inspired by the success of textual prompt-based fine-tuning (PF) approaches in many methods, in this paper, we propose a Multimodal Prompt-based Machine Reading Comprehension based framework to implicit alignment between text and image for improving MNER, namely MPMRC-MNER. Specifically, we transform text-only query in MRC into multimodal prompt containing image tokens and text tokens. To better integrate image tokens and text tokens, we design a prompt-aware attention mechanism for better cross-modal fusion. At last, contrastive learning with two types of contrastive losses is designed to learn more consistent representation of two modalities and reduce noise. Extensive experiments and analyses on two public MNER datasets, Twitter2015 and Twitter2017, demonstrate the better performance of our model against the state-of-the-art methods.

References

Omer Arshad, Ignazio Gallo, Shah Nawaz, and Alessandro Calefati. 2019. Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20--25, 2019. IEEE, 337--342.Google Scholar
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. Association for Computational Linguistics, 1870--1879.Google ScholarCross Ref
Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021a. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In Database Systems for Advanced Applications - 26th International Conference, DASFAA. 186--201.Google Scholar
Shaowei Chen, Yu Wang, Jie Liu, and Yuelin Wang. 2021c. Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extraction. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 12666--12674.Google ScholarCross Ref
Shaowei Chen, Yu Wang, Jie Liu, and Yuelin Wang. 2021d. Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extraction. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 12666--12674.Google ScholarCross Ref
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020a. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1597--1607.Google Scholar
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020b. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1597--1607.Google Scholar
Tao Chen, Haizhou Shi, Siliang Tang, Zhigang Chen, Fei Wu, and Yueting Zhuang. 2021b. CIL: Contrastive Instance Learning Framework for Distantly Supervised Relation Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1--6, 2021. Association for Computational Linguistics, 6191--6200.Google ScholarCross Ref
Xiang Chen, Ningyu Zhang, Lei Li, Shumin Deng, Chuanqi Tan, Changliang Xu, Fei Huang, Luo Si, and Huajun Chen. 2022. Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. ACM, 904--915.Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186.Google Scholar
Ross B. Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 1440--1448.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.Google Scholar
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR, Vol. abs/1508.01991 (2015).Google Scholar
Meihuizi Jia, Xin Shen, Lei Shen, Jinhui Pang, Lejian Liao, Yang Song, Meng Chen, and Xiaodong He. 2022. Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 3549--3558.Google ScholarDigital Library
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12--17, 2016. The Association for Computational Linguistics, 260--270.Google ScholarCross Ref
Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. 2021b. MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1--6, 2021 (Findings of ACL, Vol. ACL/IJCNLP 2021). Association for Computational Linguistics, 1359--1370.Google Scholar
Tian Li, Xiang Chen, Shanghang Zhang, Zhen Dong, and Kurt Keutzer. 2021a. Cross-Domain Sentiment Classification with Contrastive Learning and Mutual Information Maximization. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6--11, 2021. IEEE, 8203--8207.Google Scholar
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020. A Unified MRC Framework for Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 5849--5859.Google ScholarCross Ref
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019a. Entity-Relation Extraction as Multi-Turn Question Answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 1340--1350.Google ScholarCross Ref
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019b. Entity-Relation Extraction as Multi-Turn Question Answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 1340--1350.Google ScholarCross Ref
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs/1907.11692 (2019).Google Scholar
Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018a. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers. 1990--1999.Google ScholarCross Ref
Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018b. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, 1990--1999.Google ScholarCross Ref
Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7--12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.Google Scholar
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The Natural Language Decathlon: Multitask Learning as Question Answering. CoRR, Vol. abs/1806.08730 (2018).Google Scholar
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, 852--860.Google ScholarCross Ref
Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 2335--2345.Google ScholarCross Ref
Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically Fused Graph Network for Multi-hop Reasoning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 6140--6150.Google ScholarCross Ref
Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, and Bowen Zhou. 2020. Select, Answer and Explain: Interpretable Multi-Hop Reading Comprehension over Multiple Documents. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9073--9080.Google ScholarCross Ref
Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12--16, 2020. ACM, 1038--1046.Google ScholarDigital Library
Bo Xu, Shizhou Huang, Chaofeng Sha, and Hongya Wang. 2022. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. In WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, K. Selcuk Candan, Huan Liu, Leman Akoglu, Xin Luna Dong, and Jiliang Tang (Eds.). ACM, 1215--1223.Google Scholar
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16--20, 2020. Association for Computational Linguistics, 6442--6454.Google ScholarCross Ref
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A Fast and Accurate One-Stage Approach to Visual Grounding. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4682--4692.Google ScholarCross Ref
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2019. Association for Computational Linguistics, 2369--2380.Google Scholar
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 3342--3352.Google ScholarCross Ref
Yang Yu, Dong Zhang, and Shoushan Li. 2022. Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 189--198.Google ScholarDigital Library
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021a. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 14347--14355.Google ScholarCross Ref
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021b. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 14347--14355.Google ScholarCross Ref
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive Co-attention Network for Named Entity Recognition in Tweets. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2--7, 2019. AAAI Press, 5674--5681.Google ScholarCross Ref
Xin Zhang, Jingling Yuan, Lin Li, and Jianquan Liu. 2023. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM. 958--966.Google ScholarDigital Library
Fei Zhao, Chunhui Li, Zhen Wu, Shangyu Xing, and Xinyu Dai. 2022. Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 3983--3992.Google ScholarDigital Library
Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li. 2021. Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning. IEEE Trans. Multim., Vol. 23 (2021), 2520--2532.Google ScholarDigital Library

Index Terms

MPMRC-MNER: A Unified MRC framework for Multimodal Named Entity Recognition based Multimodal Prompt
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition
WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

Visual information shows to empower accurately named entity recognition in short texts, such as posts from social media. Previous work on multimodal named entity recognition (MNER) often regards an image as a set of visual objects, trying to explicitly ...
Read More
Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a pair of text and image. However, most previous MNER works focus on extracting entities in the form of text but failing to ground text symbols to their ...
Read More
MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Multimodal Named Entity Recognition (MNER) aims to combine data from different modalities (e.g. text, images, videos, etc.) for recognition and classification of named entities, which is crucial for constructing Multimodal Knowledge Graphs (MMKGs). ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
October 2023
5508 pages
ISBN:9798400701245
DOI:10.1145/3583780
General Chairs:
Ingo Frommholz
University of Wolverhampton, UK
,
Frank Hopfgartner
University of Koblenz, Germany
,
Mark Lee
University of Birmingham, UK
,
Michael Oakes
University of Birmingham, UK
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Min Zhang
Tsinghua University, China
,
Rodrygo Santos
Federal University of Minas Gerais, Brazil
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
contrastive learning
multimodal named entity recognition
multimodal prompt
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 254
  Total Downloads
- Downloads (Last 12 months)254
- Downloads (Last 6 weeks)86
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MPMRC-MNER: A Unified MRC framework for Multimodal Named Entity Recognition based Multimodal Prompt

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition

Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework

MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition