research-article

Free Access

Just Accepted

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Authors:
Yuankun Liu

School of Computer Science and Technology, North University of China, Taiyuan, China

School of Computer Science and Technology, North University of China, Taiyuan, China

0009-0009-3600-1586
View Profile

,
Xiang Yuan

School of Software and Microelectronics, Peking University, Beijing, China

School of Software and Microelectronics, Peking University, Beijing, China

0009-0003-4377-8031
View Profile

,
Haochen Li

School of Software and Microelectronics, Peking University, Beijing China

School of Software and Microelectronics, Peking University, Beijing China

0000-0002-3197-4073
View Profile

,
Zhijie Tan

School of Software and Microelectronics, Peking University, Beijing China

School of Software and Microelectronics, Peking University, Beijing China

0000-0003-4934-1001
View Profile

,
Jinsong Huang

School of Software and Microelectronics, Peking University, Beijing China

School of Software and Microelectronics, Peking University, Beijing China

0000-0003-4337-7003
View Profile

,
Jingjie Xiao

School of Software and Microelectronics, Peking University, Beijing China

School of Software and Microelectronics, Peking University, Beijing China

0000-0003-2027-0369
View Profile

,
Weiping Li

School of Software and Microelectronics, Peking University, Beijing China

School of Software and Microelectronics, Peking University, Beijing China

0000-0003-2958-3097
View Profile

,
Tong Mo

School of Software and Microelectronics, Peking University, Beijing China

School of Software and Microelectronics, Peking University, Beijing China

0000-0002-3564-4610
View Profile

ACM Transactions on Multimedia Computing, Communications, and ApplicationsAccepted on April 2024https://doi.org/10.1145/3664816

Online AM:11 May 2024Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g. triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g. inner relations among triplets). Therefore, we propose a Semantic-Consistency Enhanced Multi-Level Scene Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.

References

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. 382–398.Google ScholarCross Ref
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarCross Ref
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473(2014).Google Scholar
Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. 2023. Vlp: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 38–56.Google ScholarCross Ref
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12655–12663.Google ScholarCross Ref
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15789–15798.Google ScholarCross Ref
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. 104–120.Google ScholarDigital Library
Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. 2022. Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4(2022), 1–23.Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google Scholar
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1218–1226.Google ScholarCross Ref
Yu Duan, Yun Xiong, Yao Zhang, Yuwei Fu, and Yangyong Zhu. 2021. Hsgmp: Heterogeneous scene graph message passing for cross-modal retrieval. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 82–91.Google ScholarDigital Library
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612(2017).Google Scholar
Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, and Jianqing Fan. 2022. Constructing phrase-level semantic labels to form multi-grained supervision for image-text retrieval. In Proceedings of the 2022 International Conference on Multimedia Retrieval. 137–145.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163–6171.Google ScholarCross Ref
Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-wise hierarchical alignment network for image-text matching. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 765–771.Google ScholarCross Ref
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3668–3678.Google ScholarCross Ref
Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882(2014).Google Scholar
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations. 1–14.Google Scholar
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision. 201–216.Google ScholarDigital Library
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336–11344.Google ScholarCross Ref
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning. 19730–19742.Google Scholar
Kun Li, Jiaxiu Li, Dan Guo, Xun Yang, and Meng Wang. 2023. Transformer-based visual grounding with cross-modality interaction. ACM Transactions on Multimedia Computing, Communications and Applications 19, 6(2023), 1–19.Google ScholarDigital Library
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In IEEE/CVF International Conference on Computer Vision. 4654–4662.Google ScholarCross Ref
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1(2022), 641–656.Google ScholarCross Ref
Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16420–16429.Google ScholarCross Ref
Wenhui Li, Song Yang, Qiang Li, Xuanya Li, and An-An Liu. 2023. Commonsense-guided semantic and relational consistencies for image-text retrieval. IEEE Transactions on Multimedia(2023).Google ScholarDigital Library
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.Google ScholarCross Ref
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921–10930.Google ScholarCross Ref
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2023).Google Scholar
Jian Liu, Yufeng Chen, and Jinan Xu. 2022. Multimedia event extraction from news with a unified contrastive learning framework. In Proceedings of the 30th ACM International Conference on Multimedia. 1945–1953.Google ScholarDigital Library
Yongfei Liu, Bo Wan, Xiaodan Zhu, and Xuming He. 2020. Learning cross-modal context graph for visual grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11645–11652.Google ScholarCross Ref
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
Manh-Duy Nguyen, Binh T Nguyen, and Cathal Gurrin. 2021. A deep local and global scene-graph matching for image-text retrieval. arXiv preprint arXiv:2106.02400(2021).Google Scholar
Jiaming Pei, Kaiyang Zhong, Zhi Yu, Lukun Wang, and Kuruva Lakshmanna. 2023. Scene graph semantic inference for image and text matching. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 5(2023), 1–23.Google ScholarDigital Library
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1(2020), 318–329.Google ScholarDigital Library
Clifton Poth, Jonas Pfeiffer, Andreas Rücklé, and Iryna Gurevych. 2021. What to pre-train on? Efficient intermediate task selection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10585–10605.Google ScholarCross Ref
Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, and Dacheng Tao. 2022. Where does the performance improvement come from? -A reproducibility concern about image-text retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2727–2737.Google Scholar
Zhangxiang Shi, Tianzhu Zhang, Xi Wei, Feng Wu, and Yongdong Zhang. 2022. Decoupled cross-modal phrase-attention network for image-sentence matching. IEEE Transactions on Image Processing(2022).Google Scholar
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. 6105–6114.Google Scholar
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716–3725.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).Google Scholar
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508–1517.Google ScholarCross Ref
Yanan Wang, Michihiro Yasunaga, Hongyu Ren, Shinya Wada, and Jure Leskovec. 2023. Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21582–21592.Google ScholarCross Ref
Yun Wang, Tong Zhang, Xueya Zhang, Zhen Cui, Yuge Huang, Pengcheng Shen, Shaoxin Li, and Jian Yang. 2021. Wasserstein coupled graph learning for cross-modal retrieval. In 2021 IEEE/CVF International Conference on Computer Vision. 1793–1802.Google ScholarCross Ref
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6609–6618.Google ScholarCross Ref
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685–10694.Google ScholarCross Ref
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.Google ScholarCross Ref
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5831–5840.Google ScholarCross Ref
Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Shuaiqi Jing, and Jingkuan Song. 2021. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching. In Proceedings of the 29th ACM International Conference on Multimedia. 2205–2213.Google ScholarDigital Library

Index Terms

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

Cross-modal Graph Matching Network for Image-text Retrieval
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into ...
Read More
ISES: Instance-level Semantic Enhanced Score Model for Image-Text Retrieval
AAIA '23: Proceedings of the 2023 International Conference on Advances in Artificial Intelligence and Applications

Image-Text Retrieval (ITR) is an important measure of the performance of text and image mutual retrieval, by searching for semantically relevant information in the relevant modality and then matching the corresponding modality, the key difficulty lies in ...
Read More
Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression
Advances in Knowledge Discovery and Data Mining
Abstract
Image-text retrieval is an essential branch in the field of information retrieval, facing the serious challenge of the cross-modal semantic gap. Although significant progress has been made in recent years, most research has ignored an essential ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted
ISSN:1551-6857
EISSN:1551-6865
Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Online AM: 11 May 2024
- Accepted: 26 April 2024
- Revised: 20 March 2024
- Received: 28 November 2023
Published in tomm Just Accepted

Check for updates
Author Tags
Image-Text Retrieval
Scene Graph Matching
Image-Text Similarity Reasoning
Cross-Modal Alignment
Contrastive Learning
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 101
  Total Downloads
- Downloads (Last 12 months)101
- Downloads (Last 6 weeks)101
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Cross-modal Graph Matching Network for Image-text Retrieval

ISES: Instance-level Semantic Enhanced Score Model for Image-Text Retrieval

Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Cross-modal Graph Matching Network for Image-text Retrieval

ISES: Instance-level Semantic Enhanced Score Model for Image-Text Retrieval

Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media