Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Wang, Jingyi; Huang, Jinfa; Zhang, Can; Deng, Zhidong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.08522 (cs)

[Submitted on 15 May 2023]

Title:Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Authors:Jingyi Wang, Jinfa Huang, Can Zhang, Zhidong Deng

View PDF

Abstract:Dynamic scene graphs generated from video clips could help enhance the semantic visual understanding in a wide range of challenging tasks such as environmental perception, autonomous navigation, and task planning of self-driving vehicles and mobile robots. In the process of temporal and spatial modeling during dynamic scene graph generation, it is particularly intractable to learn time-variant relations in dynamic scene graphs among frames. In this paper, we propose a Time-variant Relation-aware TRansformer (TR$^2$), which aims to model the temporal change of relations in dynamic scene graphs. Explicitly, we leverage the difference of text embeddings of prompted sentences about relation labels as the supervision signal for relations. In this way, cross-modality feature guidance is realized for the learning of time-variant relations. Implicitly, we design a relation feature fusion module with a transformer and an additional message token that describes the difference between adjacent frames. Extensive experiments on the Action Genome dataset prove that our TR$^2$ can effectively model the time-variant relations. TR$^2$ significantly outperforms previous state-of-the-art methods under two different settings by 2.1% and 2.6% respectively.

Comments:	Preprint. Accepted by ICRA 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.08522 [cs.CV]
	(or arXiv:2305.08522v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.08522

Submission history

From: Jingyi Wang [view email]
[v1] Mon, 15 May 2023 10:30:38 UTC (2,863 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators