Scene Graph Generation: A Comprehensive Survey

Deep learning techniques have led to remarkable breakthroughs in the field of generic object detection and have spawned a lot of scene-understanding tasks in recent years. Scene graph has been the focus of research because of its powerful semantic representation and applications to scene understanding. Scene Graph Generation (SGG) refers to the task of automatically mapping an image into a semantic structural scene graph, which requires the correct labeling of detected objects and their relationships. Although this is a challenging task, the community has proposed a lot of SGG approaches and achieved good results. In this paper, we provide a comprehensive survey of recent achievements in this field brought about by deep learning techniques. We review 138 representative works that cover different input modalities, and systematically summarize existing methods of image-based SGG from the perspective of feature extraction and fusion. We attempt to connect and systematize the existing visual relationship detection methods, to summarize, and interpret the mechanisms and the strategies of SGG in a comprehensive way. Finally, we finish this survey with deep discussions about current existing problems and future research directions. This survey will help readers to develop a better understanding of the current research status and ideas.


INTRODUCTION
The ultimate goal of computer vision (CV) is to build intelligent systems, which can extract valuable information from digital images, videos, or other modalities as humans do. In the past decades, machine learning (ML) has significantly contributed to the progress of CV. Inspired by the ability of humans to interpret and understand visual scenes effortlessly, visual scene understanding has long been advocated as the holy grail of CV and has already attracted much attention from the research community.
Visual scene understanding includes numerous subtasks, which can be generally divided into two parts: recognition and application tasks. These recognition tasks can be described at several semantic levels. Most of the earlier works, which mainly concentrated on image classification, only assign a single label to an image, e.g., an image of a cat or a car, and go further in assigning multiple annotations without localizing where in the image each annotation be- Zhu longs [38]. A large number of neural network models have emerged and even achieved near humanlike performance in image classification tasks [27], [29], [33], [34]. Furthermore, several other complex tasks, such as semantic segmentation at the pixel level, object detection and instance segmentation at the instance level, have suggested the decomposition of an image into foreground objects vs background clutter. The pixel-level tasks aim at classifying each pixel of an image (or several) into an instance, where each instance (or category) corresponds to a class [37]. The instance-level tasks focus on the detection and recognition of individual objects in the given scene and delineating an object with a bounding box or a segmentation mask, respectively. A recently proposed approach named Panoptic Segmentation (PS) takes into account both per-pixel class and instance labels [32].
With the advancement of deep neural networks (DNN), we have witnessed important breakthroughs in object-centric tasks and various commercialized applications based on existing state-of-the-art models [17], [19], [21], [22], [23]. However, scene understanding goes beyond the localization of objects. The higher-level tasks lay emphasis on exploring the rich semantic relationships between objects, as well as the interaction of objects with their surroundings, such as visual relationship detection (VRD) [15], [24], [26], [41] and human-object interaction (HOI) [14], [16], [20]. These tasks are equally significant and more challenging. To a certain extent, their development depends on the performance of individual instance recognition techniques. Meanwhile, the deeper semantic understanding of image content can also contribute to visual recognition tasks [2], [6], [36], [39], [120]. Divvala et al. [40] investigated various forms of context models, which can improve the accuracy of object-centric recognition tasks. In the last few years, researchers have combined computer vision with natural language process-  1: A visual illustration of a scene graph structure and some applications. Scene graph generation models take an image as an input and generate a visually-grounded scene graph. Image caption can be generated from a scene graph directly. In contrast, Image Generation models inverse the process by generating realistic images from a given sentence or scene graph. The Referring Expression (REF) marks a region of the input image corresponding to the given expression, while the region and expression map the same subgraph of the scene graph. Scene graph-based image retrieval takes a query as an input, and regards the retrieval as a scene graph matching problem. For the Visual Question Answering (VQA) task, the answer can sometimes be found directly on the scene graph, even for the more complex visual reasoning, the scene graph is also helpful.

Image Generation
ing (NLP) and proposed a number of advanced research directions, such as image captioning, visual question answering (VQA), visual dialog and so on. These vision-andlanguage topics require a rich understanding of our visual world and offer various application scenarios of intelligent systems. Although rapid advances have been achieved in the scene understanding at all levels, there is still a long way to go. Overall perception and effective representation of information are still bottlenecks. As indicated by a series of previous works [1], [44], [191], building an efficient struc-tured representation that captures comprehensive semantic knowledge is a crucial step towards a deeper understanding of visual scenes. Such representation can not only offer contextual cues for fundamental recognition challenges, but also provide a promising alternative to high-level intelligence vision tasks. Scene graph, proposed by Johnson et al. [1], is a visually-grounded graph over the object instances in a specific scene, where the nodes correspond to object bounding boxes with their object categories, and the edges represent their pair-wise relationships.
Because of the structured abstraction and greater se-mantic representation capacity compared to image features, scene graph has the instinctive potential to tackle and improve other vision tasks. As shown in Fig.1, a scene graph parses the image to a simple and meaningful structure and acts as a bridge between the visual scene and textual description. Many tasks that combine vision and language can be handled with scene graphs, including image captioning [3], [12], [18], visual question answering [4], [5], contentbased image retrieval (CBIR) [1], [7], image generation [8], [9] and referring expression comprehension (REF) [35]. Some tasks take an image as an input and parse it into a scene graph, and then generate a reasonable text as output.
Other tasks inverse the process by extracting scene graphs from the text description and then generate realistic images or retrieve the corresponding visual scene. Xu et al. [199] have produced a thorough survey on scene graph generation, which analyses the SGG methods based on five typical models (CRF, TransE, CNN, RNN/LSTM and GNN) and also provides a discussion of important contributions by prior knowledge. Moreover, a thorough investigation on the main applications of scene graphs was produced. Nevertheless, in our survey paper, we provide a structure and an analysis from a quite different perspective, which includes scene graph generation from 2D input, 3D input and video input, respectively. Particularly, in the case of 2D scene graph generation, explicit and implicit features, and three conceptual methodologies are analysed. Furthermore, a more careful and thorough discussion on the typical datasets along with respective performance evaluations are presented. We provide a comprehensive and systematic review of recent research outputs on the scene graph generation. We first focus on generating scene graphs from static images and then discuss other input modalities. We provide a survey of 138 papers on SGG 1 , which have appeared since 2016 in the leading computer vision, pattern recognition, and machine learning conferences and journals. Our goal is to help the reader study and understand this research topic, which has gained a significant momentum in the past few years. The main contributions of this article are as follows: 1) We cover almost all the contemporary literature related to this area, and present a comprehensive review of 138 papers on scene graph generation. These papers are classified by input modalities (i.e., Image, Video and 3D mesh.). 2) We propose a general framework for 2D scene graph generation from a global perspective, and analyzed these methods from the point of feature extraction and updating. 3) We provide insightful analysis on all aspects of scene graph generation, including the generation frameworks, object and relationship feature representation, input modalities, training data, training strategies and evaluation metrics.
The rest of this paper is organized as follows; Section 2 gives the definition of a scene graph, thoroughly analyses the characteristics of visual relationships and the structure 1. We provided a curated list of scene graph which is publicly available at https://github.com/mqjyl/awesome-scene-graph of a scene graph. Section 3 surveys scene graph generation methods, which are classified and summarized from the perspective of feature fusion. Section 4 summarizes almost all currently published datasets. Section 5 compares and discusses the performance of some key methods on the most commonly used datasets. Finally, Section 6 summarizes open problems in the current research and discusses potential future research directions. Section 7 concludes the paper with some important remarks.

SCENE GRAPH
Scene graph is a structural representation, which can capture detailed semantics by explicitly modeling objects ("man", "fire hydrant", "shorts"), attributes of objects ("fire hydrant is yellow"), and relationships between paired objects ("man jumping over fire hydrant") as shown in Fig.1. Therefore, the fundamental elements of a scene graph are objects, attributes and relationships where the effective substructure are relationship triplets. Objects are the core building blocks of an image, which can be located with a set of bounding boxes (BB). An object can have zero or more attributes, which can be color (e.g., yellow), states (e.g., standing), materials (e.g., wooden), etc. Relations are the conditions to form relationship triplets, one of which connects two objects together. These relations can be actions (e.g., jumping over), spatial (e.g., is behind), descriptive verbs (e.g., wear), prepositions (e.g. with), comparative (e.g., taller than), prepositional phrases (e.g., drive on), etc [10], [28], [30], [110]. In short, a scene graph is a set of visual triples in the form of (subject, relation, object) or (object, is, attribute). The latter is also considered as a relationship. We use the "is" relation for uniformity [10], [11].
In this survey paper, we focus on the triplet description of a static scene. Given a visual scene S ∈ S [62], such as an image, a video or a 3D mesh, its scene graph is a set of visual triples where O S is the object set, A S is the attribute set and P S is the relation set including "is" relation p S,is where there is only one object involved. Each object o S,k = (l S,k , b S,k ) ∈ O S has a semantic label l S,k ∈ O l and grounded with a bounding box (BB) b S,k in scene S where k ∈ 1, . . . , |O S |. Each relation p S,i→j ∈ P S ⊆ P is the core form of a visual triple r S,i→j = (o S,i , p S,i→j , o S,j ) ∈ R S and i = j, where the third element o S,j could be a S,j ⊆ A S if p S,i→j is the p S,is . As the relationship is one-way, we expresse r S,i→j as (s S,i , p S,i→j , o S,j ) to maintain semantic accuracy where From the point of view of graph theory, a scene graph is a directed graph with three types of nodes: object, attribute, and relation. However, for the convenience of semantic expression, a node of a scene graph is seen as an object with all its attributes, while the relation is called an edge, which means both objects are neighbors. A subgraph can be formed with an object, which is made up of all the visual triplets. Therefore, the subgraph contains all the adjacent nodes of the object, and these adjacent nodes directly reflect the context information of the object. From the top-down view, a scene graph can be broken down into several subgraphs, a subgraph can be split into several triplets, and a triplet can be split into individual objects with their attributes and relations. Accordingly, we can find a region in the scene corresponding to the substructure that is a subgraph, a triplet, or an object. From this strict correspondence, a conclusion can be drawn that the scene graph corresponding to a given scene is structurally unique without considering differences in the semantic expression in a dataset with definite relation categories and object classes, though, most of the times, it is incomplete. The uniqueness supports the argument that the use of a scene graph as a replacement for a visual scene at the language level is reasonable.
Compared with scene graphs, the well-known knowledge graph is represented as multi-relational data with enormous fact triples in the form of (head entity type, relation, tail entity type) [112], [180]. Here, we have to emphasize that the visual relationships in a scene graph are different from those in social networks and knowledge bases. In the vision, images and visual relationships are incidental and are not intentionally constructed. Especially, visual relationships are usually image-specific because they only depend on the content of the particular image they appear in. Although a scene graph is generated from a textual description in some language-tovision tasks, such as image generation, the relationships in a scene graph are always situation-specific. Each of them has the corresponding visual feature in the output image. Objects in scenes are not independent and tend to cluster. Sadeghi et al. [43] coined the term visual phrases to introduce composite intermediates between objects and scenes. This was the first time that researchers saw the research value of visual relationships. Visual phrases, which integrate linguistic representations of relationship triplets encode the interactions between objects and scenes.
Compositionality is the most important feature of a scene graph. On the surface, it is an elevation of scene semantic expression from independent objects to visual phrases. There is, however, a deeper meaning which can be interpreted in two aspects: the frequency of a visual phrase and the common-sense constraints on relationship prediction. For example, when "man", "horse" and "hat" are detected individually in an image, the most likely visual triples are ("man", "ride", "horse"), ("man", "wearing", "hat"), etc. ("hat", "on", "horse") is possible, though not common. But ("horse", "wearing", "hat") is normally unreasonable. From the above analysis, we can see that compositionality can indirectly reflect the possibility that a relationship is formed between two objects and the probabilities of relationship categories if they exist. Therefore, it is instructive for relationship reasoning and object recognition. Zellers et al. [42] presented a fine-grained description of compositionality by examining the structural repetitions of motifs, which are small connected subgraphs with a well-defined structure. These repetitions are mainly reflected in three aspects. First, different kinds of objects appear at different frequencies in a dataset, and the range of attributes varies between different types of objects. For instance, the most common attribute of clothes is color, but for a person, it is a state. Second, there are strong regularities in the local graph structure such that the distribution of relations is highly skewed for the corresponding object categories. In prior examples, relations occur easily between person ("man") and clothes ("hat"), but not between animals ("horse") and clothes ("hat").
Moreover, the most likely relation is "wearing" when object categories are person and clothes. Third, structural patterns exist even in larger subgraphs. Statistically, over 50% of images in the Visual Genome [30] dataset have at least one motif involving two relationships [42].
The relationships described above are static and instantaneous because the information is grounded in an image that can only capture a specific moment or a certain scene. However, when video becomes the information carrier, a visual relation is no longer instantaneous, but a time-varying process. A digital video consists of a series of images called frames, which means relations span over multiple frames and have different durations. Visual relationships in a video can construct a Spatio-Temporal Scene Graph, which includes entity nodes of the neighborhood in the time and space dimensions. In the video domain, humancentric relationship detection and action recognition always reinforce each other [47], [48], [145]. Action detection aims to detect the actions of interest by spatio-temporally localizing the action subjects. It only cares about the person's feature, not whether the action has a receiver. Although the action is a crucial type of human-centric relation, it requires the system to visually understand many perspectives of the two entities in which the subject of action is a person.

SCENE GRAPH GENERATION
Scene graph parses an image or image sequence to create a structured representation, and aims to bridge the gap between visual and semantic perception, to achieve a complete understanding of visual scenes. However, it is difficult to generate an accurate and complete scene graph. In general, scene graph generation is a bottom-up process in which entities constitute the triplets and then these triplets are connected to form the entire scene graph. Obviously, the essential task is to detect the visual relationships, i.e. (subject, relation, object) triples, abridged as (s, r, o).
Visual Relationship Detection has attracted the attention of the research community after the pioneering work by Lu et al [28], and the open of the ground-breaking large-scale scene graph dataset Visual Genome (VG) by Krishna et al [30]. Given a visual scene S and its scene graph T S [31], [62]: . . , b S,n } ⊆ R 4 is the region candidate set, with element b i denoting the bounding box of the i-th region.
• O S = {o S,1 , . . . , o S,n } ⊆ N is the object set, with element o S,i denoting the corresponding class label of region b S,i .
• R S = {r S,1→2 , r S,1→3 , . . . , r S,n→n−1 } is the relation set, with element r S,i→j corresponding to a visual triple t S,i→j = (s S,i , r S,i→j , o S,j ), where s S,i and o S,j denote the subject and object respectively. This set also includes "is" relation where there is only one object invovled.
When attributes detection and relationships prediction are considered as two independent processes, we can decompose the probability distribution of the scene graph p(T S |S) into four components similar to [31]: In the equation, the bounding box component p(B S |S) generates a set of candidate regions that cover most of the crucial objects directly from the input image. The object component p(O S |B S , S) predicts the class label of each detected region. These two steps are exactly the same as two-stage target detection methods and can be implemented by the widely used Faster RCNN detector [17]. Conditioned on the predicted labels, the attribute component p(A S |O S , B S , S) infers all possible attributes of each object, while the relationship component p(R S |O S , B S , S) infers the relationship of each object pair [31]. Once we get all visual triplets, a scene graph is formed. Since attribute detection is generally regarded as an independent research topic, visual relationship detection and scene graph generation are often regarded as the same task. Then, the probability of a scene graph T S can be decomposed into three factors : In the following of this section, we provide a detailed review of more than a hundred deep learning-based methods proposed until 2020 on visual relationship detection and scene graph generation. Due to the great differences of the input information, the representation, sub-tasks and other aspects, it is necessary to introduce works on 2D scene graph, 3D scene graph and spatio-temporal scene graph separately.
Note: in this paper we use "relationship" or "triplet" to refer to a (subject, relation, object) tuple and "relation" or "predicate" for a relation element.

2D Scene Graph Generation
Currently, there are two approaches to generate scene graphs [13]. The mainstream approach follows the twostep pipeline that first detects objects and then solves a classification task to determine the relation of each object pair. The other approach is to jointly infer the objects and their relations based on the object region proposals. The main difference between these two approaches is whether the relation features are used to update the object features. To generate a complete scene graph, both approaches above should detect all existing objects or object proposals in the image as far as possible, and group them into pairs and use the features of their union area (denoted as relation features), as the basic representation for predicate inference. In this section, we focus on the two-step approach, and the basic framework for 2D scene garph generation is shown in Fig.2. Given an image, scene graph generation method first generally generates triplet proposals with Region Proposal Network (RPN), though they are sometimes from the ground-truth human annotation. Each triplet proposal is made up of subject, object and predicate ROIs, respectively. The predicate ROI is the box that tightly covers both the subject and the object. Then, in feature representation, for each object proposal, we can get appearance, spatial information, label, depth, mask, and for each predicate proposal, we can get the appearance, spatial, depth and mask. These multimodal features are vectorized and can be combined and refined in the third step of Feature Refinement using message passing mechanisms, attention mechanisms and visual translation embedding approaches. Finally, the classifiers are used to predict the categories of objects and predicates, and the scene graph is generated.
In this section, SGG methods for 2D inputs will be reviewed and analyzed according to the following strategies.
First, we review the feature representation methods for objects, subjects and predicates in SGG. We classify these methods into "explicit feature" and "implicit feature". An explicit feature corresponds to "multimodal features", which include the appearance, spatial, label, depth, and masks of objects. These are reviewed in Section 3.1.1. Implicit features correspond to indirect and complementary features, including "prior information" (e.g., statistical priors, semantic association priors, language priors) and "commonsense knowledge", which are described in Section 3.1.2 and Section 3.1.3, respectively.
Second, the Feature Refinement methods for objects, subject and predicate are presented. We categorise these methods into "Message Passing", "Attention Mechanism", "Visual Embedding" and Others, which have been deeply analyzed in Section 3.1.4, Section 3.1.5, Section 3.1.6 and Section 3.1.7, respectively.

Multimodal Features
The contingency of visual relationships determines the dominant position of the visual feature in relationship recognition. In fact, when thinking of the total detection process as an end-to-end approach that only takes an image as the input, the other features including the spatial, label, depth and mask can be considered as the transformation of different intermediate forms of visual features. The research focus of low-level object detection and instance segmentation is to obtain these transformations. However, the focus of the SGG is to use these transformations. The simplest efficient way to predict relationships in the SGG is to take the original features or their concatenation (extracted in the feature extraction stage of the general process depicted in Fig.1) as inputs to produce a confidence score for each relation category of the classifier (e.g., CNN, MLP, SVM, etc.). In this section, we describe how to use the original features from three progressive aspects, including "Appearance-Semantic", "Appearance-Semantic-Spatial", and "Appearance-Semantic-Spatial-Context", respectively.
Appearance-Semantic Features: Sadeghi et al. [43] proposed richer-level visual composites, visual phrases, and learned an appearance model to recognize phrases by considering each of them as a whole. Lu et al. [28] formally proposed VRD task on a static image with the first VRD dataset and further developed the first VRD method based on a deep convolutional neural network. They analyzed the semantic importance of visual relations recognition and leveraged language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Appearance-Semantic-Spatial Features: On this basis, Zhu et al. [83] explored the integration of the spatial distribution of objects to facilitate visual relation detection. Spatial distribution can not only reflect on the positional relation of object, but also describe structural information between objects. They described the spatial distribution of objects by using properties of regions, which contain positional relations, size relations, distance relations and shape relations. Moreover, Sharifzadeh et al. [46] utilized 3D information in visual relation detection by synthetically generating depth maps using an RGB-to-Depth model incorporated within relation detection frameworks. They extracted pairwise feature vectors include depth, spatial, label and appearance. Then, they concatenated together as relation features for inference.
Inspired by the idea of object proposals, Zhang et al. [84] introduced Relationship Proposal Networks (Rel-PN). A 3branch RPN is applied to produce a set of candidate boxes that represent subject, relationship, and object proposals, based on the fact that the subject and object come from different distributions. Then a proposal selection module selects the candidate pairs that satisfy spatial constraints. The resulting pairs are passed to two separate network modules designed to evaluate the relationship compatibility using visual and spatial criteria, respectively. Finally, visual and spatial scores are combined with different weights controlled by α to get the final score for predicates. In another work [11], the authors added a semantic module to produce a semantic score for predicates, then all three scores are added up to get the overall score.
Liang et al. [24] also considered three types of features and proposed to cascade the multi-cue based convolutional neural network with a structural ranking loss function. For an input image x, they first extract the feature representations of visual appearance cue, spatial location cue and semantic embedding cue for each relationship instance tuple r = (s, p, o) ∈ R. The learned features combined with multiple cues are further concatenated and fused into a joint feature vector through one fully connected layer. Then the compatibility score between the object pair (s, o) and predicate p can be formulated as: where w p denotes the parameters to be learned for p th predicate. Appearance-Semantic-Spatial-Context Features: Previous studies typically extract features from a restricted objectobject pair region and focus on local interaction modeling to infer the objects and pairwise relation. Xu et al. [25] further introduce the global visual information into this procedure by Multi-Scale Context Modeling (MSCM) which contains two different modules, including object-centric context and region-centric context. The 2D feature maps of regions and their corresponding objects' features were extracted by ROIpooling and a fully-connected layer respectively and fed into the modules sequentially. The object-centric context module was designed to further encode the Object-to-Object and Object-to-Region interactions, the region-centric context module was designed to encode the Region-to-Region and Region-to-Object interactions. Moreover, the bi-directional message propagation was also implemented to reinforce the semantic learning between two modules. The visual feature and multi-scale context were finally integrated for scene graph inference after that.

Prior Information
The scene graph is a semantically structured description of a visual world. Intuitively, the SGG task can be regarded as a two-stage semantic tag retrieval process. Therefore, the determination of the relation category often depends on the labels of the two participating objects. We have expounded on the compositionality of a scene graph in detail in Section 2. Although visual relationships are scene-specific, in a relationship triplet (s, p, o), there exist strong semantic statistical dependencies between the relationship predicate r and the object categories s and o. This statistical probability can be either learned by the model during training or calculated directly from the training annotations. On the other hand, some of the predicates may occur many times but other predicates may only occur once or twice throughout the whole dataset. Therefore, most of the visual relationships are insufficient for training. This issue is well-known as the long-tail distribution of relationships that makes it costly to collect enough training images for all relationships [15], [90], [104], [107]. In this section, we describe how to use the prior information of the specific dataset in SGG from three aspects, i.e., "Statistical Priors", "Semantic Association Priors", and "Language Priors", respectively.
Statistical Priors: Baier et al. [87] described several link prediction methods to derive the absolute frequency of the triplet (s, p, o) in the training data, such as DistMult and ComplEx. However, Liao et al. [85] assume that an inherent semantic relationship that connects the two words rather than a mathematical distance in the embedding space. They proposed to use a generic bi-directional RNN (BRNN) to predict the semantic connection between the participating objects in a relationship from the aspect of natural language. The BRNN has three inputs in sequence (subject word vector, relative spatial information and object word vector), and one output (predicate prediction).
Chen et al. [31] count the statistical co-occurrence probability of all possible relationships for each object pair by using a Graph Gated Neural Network. For each object pair with predicted labels (a subject o i and an object o j ), they constructed a graph with a subject node, an object node, and K relation nodes. Each node v ∈ V = {o i , o j , r 1 , r 2 , . . . , r K } has a hidden state h t v at timestep t. Let m oioj r k denote the correlations between o i and relation node r k as well as o j and relation node r k . At timestep t, the relationship nodes aggregate messages from the object nodes, while object nodes aggregate messages from the relationship nodes: Then, the hidden state h t v is updated with a t v and its previous hidden state by a gated mechanism.
Semantic Association Priors: In another work, Zhang et al. [15] used semantic associations to compensate for infrequent classes on a large and imbalanced benchmark with an extremely skewed class distribution. They learned a visual and a semantic module which maps features from the two modalities into a shared space and employed a modified triplet loss to learn the joint visual and semantic embedding. For each positive visual-semantic pair (x l i , y l i ), l ∈ {s, p, o} and its corresponding set of negative pairs (x l i , y l− ij ), they calculate similarities between each of them and put them into a softmax layer followed by a triplet-softmax loss so that the similarity of positive pairs would be pushed to be 1, and 0 otherwise, i.e.: where N is the number of positive ROIs, K is the number of negative samples per positive ROI, and s(·, ·) is the cosine similarity function. Based on this work, Abdelkarim et al. [97] highlighted the long-tail recognition problem and adopted a weighted version of the softmax triplet loss above.
(a) television-on-wall Language Priors: To further illustrate that language priors play a guiding role in relation recognition, we provide some other observations based on existing datasets that have given rise to a lot of research motivations. First, the visual appearance of the relationships which has the same predicate but different agents varies greatly [26]. For instance, the "television-on-wall" (Fig.3a) and "cat-on-suitcase" (Fig.3d) have the same predicate type "on", but they have distinct visual and spatial features. Second, the type of relations between two objects is not only determined by their relative spatial information but also through their categories. For example, the relative position between the kid and the horse (Fig.3b) is very similar to the ones between the dog and the horse (Fig.3e), but it is preferred to describe the relationship "dog-sitting on-horse" rather than "dogriding-horse" in the natural language setting. It is also very rare to say "person-sitting on-horse". On the other hand, the relationships between the observed objects are naturally based on our language knowledge. For example, we would like to use the expression "sitting on" or "playing" for seesaw but not "riding" (Fig.3c), even though it has a very similar pose as the one of the types "riding" the horse in Fig.3b. Third, relationships are semantically similar when they appear in similar contexts. That is, in a given context, i.e., an object pair, the probabilities of different predicates to describe this pair are related to their semantic similarity. For example, "person-ride-horse" (Fig.3b) is similar to "personride-elephant" (Fig.3f), since "horse" and "elephant" belong to the same animal category [28].
Lu et al. [28] proposed the first visual relationship detection pipeline, which leverages the language priors (LP) to finetune the prediction. They scored each pair of object proposals < O 1 , O 2 > using a visual appearance module and a language module. In the training phase, to optimize the projection function f (.) such that it projects similar relationships closer to one another, they used a heuristic formulated as: where d(r, r ) is the sum of the cosine distances in word2vec space between the two objects and the predicates of the two relationships r and r . Similarly, Plesse et al. [105] computed the similarity between each neighbor r ∈ {r 1 , . . . , r K } and the query r with a softmax function: Based on this LP model, Jung et al. [104] further summarized some major difficulties for visual relationship detection and performed a lot of experiments on all possible models with variant modules. From the perspective of collective learning on multirelational data, Hwang et al. [106] designed an efficient multi-relational tensor factorization algorithm that yields highly informative priors. The first step is to construct a relationship tensor X ∈ R n×n×m , where X(i, j, k) contains the number of occurrences of the i-th object and j-th object having k-th predicate in the dataset. X is a stack of m matrices X k ∈ R n×n for k ∈ {1, . . . , m} and each X k contains information about the k-th predicate among all the objects. As the relationship tensor X is extremely sparse, it is necessary to decompose and regularize it. Then the tensorbased relational module refines the relationship estimation and regulates the learning process of the LP module [28] as a dense relational prior. Based on the Alternating Least Squares (ALS) method, the decomposition model can be written as X k ≈ AR k A T . A ∈ R n×r of rank r is the latent representation of its objects and R k ∈ R r×r is the relationship-specific factor matrix. There is an assumption here, i.e., r ≤ n. Analogously, Dupty et al. [107] learned conditional triplet joint distributions in the form of their normalized low rank non-negative tensor decompositions.
In addition, some other papers have also tried to mine the value of language prior knowledge for relationship prediction. Donadello et al. [108] encoded visual relationship detection with Logic Tensor Networks (LTNs), which exploit both the similarities with other seen relationships and background knowledge, expressed with logical constraints between subjects, relations and objects. In order to leverage the inherent structures of the predicate categories, Zhou et al. [184] proposed to firstly build the language hierarchy and then utilize the Hierarchy Guided Feature Learning (HGFL) strategy to learn better region features of both the coarse-grained level and the fine-grained level. Liang et al. [110] proposed a deep Variation-structured Reinforcement Learning (VRL) framework to sequentially discover object relationships and attributes in an image sample. Recently, Wen et al. [109] proposed the Rich and Fair semantic extraction network (RiFa), which is able to extract richer semantics and preserve the fairness for relations with imbalanced distributions.

Commonsense Knowledge
As previously stated, there are a number of models that emphasize the importance of language priors. However, due to the long tail distribution of relationships, it is costly to collect enough training images for all relationships [90]. On the other hand, existing datasets are biased in terms of object and relation labels, or often come with noisy and missing annotations (i.e. the annotators may miss some visual relationships with low saliency and use inconsistent even incorrect words to represent the same subject-object pair or predicate), which makes the development of a reliable scene graph prediction model very challenging. Therefore, some researchers have proposed to extract commonsense knowledge to refine the object and phrase features to improve generalizability in the scene graph generation. In this section, we analyze three fundamental sub-issues of commonsense knowledge applied to SGG, i.e., the Source, Formulation and Usage, as illustrated in Fig. 4. To be specific, the source of commonsense is generally divided into internal training samples [88], external knowledge base [89] or both [90], [91], and it can be transformed into different formulations [92]. It is mainly applied in the feature refinement on the original feature or other typical procedures [93].
where c ij and d i represent the number of occurrences of object pairs and objects. N is the number of object classes and i ∈ N and j ∈ N . m pk denotes the elements of the conditional probability matrix M r ∈ R P ×R for relationships < o i − r k − o j > and can be calculated by: where t pk and e ij represent the number of occurrences of relationships when subject-object pairs occur and object pair, respectively. P equals N × N , is the number of object pairs and p ∈ P . R is the number of relationships class and k ∈ R.
After this calculation of traversing all the object pairs and triplets in the training set, the elements in the matrices are normalized and used to guide the message propagation for feature refinement. However, considering the tremendous valuable information from the large-scale external bases, e.g., Wikipedia and ConceptNet, increasing efforts have been devoted to distill knowledge from these resources. Gu et al. [89] proposed a knowledge-based module, which improves the feature refinement procedure by reasoning over a basket of commonsense knowledge retrieved from ConceptNet. The entire framework can be divided into the following steps: (1) generate object and relation proposals; (2) refine object and subgraph features with external knowledge; (3) generate the scene graph; (4) reconstruct the input image via an additional generative path. Image reconstruction is performed to provide image-level supervision during training.
Yu et al. [90] introduced a Linguistic Knowledge Distillation Framework that obtains linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), and then construct a teacher network to distill the knowledge into a student network that predicts visual relationships from visual, semantic and spatial representations. The teacher network is constructed from the student network by optimizing the following criterion: where t(Y ) and s φ (Y |X) are the prediction results of the teacher and student networks, C is a balancing term, Φ is the parameter set of the student network. KL measures the KL-divergence of the teacher's and student's prediction distributions. Linguistic knowledge L(X, Y ) is modeled by a conditional probability that encodes the strong correlation between the object pair < subj, obj > and the predicate. Zhan et al. [91] proposed a novel multi-modal feature based undetermined relationship learning network (MF-URLN), which extracts and fuses features of object pairs from three complementary modules: visual, spatial, and linguistic modules. The linguistic module provides two kinds of features: external linguistic features and internal linguistic features. The former is the semantic representations of subject and object generated by the pretrained word2vec model of Wikipedia 2014. The latter is the probability distributions of all relationship triplets in the training set according to the subject's and object's categories.
Formulation: Except from the actual sources of knowledge, it is also important to consider the formulation and how to incorporate the knowledge in a more efficient and comprehensive manner. As shown in several previous studies [31], [88], the statistical correlation has been the most common formulation of knowledge. They employ the cooccurrence matrices on both the object pairs and the relationships in an explicit way. Similarly, Linguistic knowledge from [91] is modeled by a conditional probability that encodes the strong correlation between the object pair < subj, obj > and the predicate. However, Lin et al. [92] pointed out that they were generally composable, complex and image-specific which will lead to a poor learning improvement. Lin proposed Atom Correlation Based Graph Propagation (ACGP) for the scene graph generation task. The most significant one is to separate the relationship semantics to form new nodes and decompose the conventional multi-dependency reasoning path of < subject − predicate − object > into four different types of atom correlations, i.e., < subject − object >, < subject − predicate >, < predicate − predicate >, < predicate − object >, which are much more flexible and easier to learn. It consequently results in four kinds of knowledge graph, T = (N, R), where N denotes the visual concept set with respect to different atom correlations, and R denotes the co-occurrence probabilities between two visual concepts. Next, the information propagation is performed using the graph convolutional network (GCN) with the guidance of the knowledge graphs, to produce the evolved node features N u ∈ R C×p : where W gp ∈ R (p+e)×p is the trainable transformation matrix, σ(·) denotes the activation function. The visual features of the region proposals from the detector are projected to a global category space. These features are concatenated with linguistic embeddings to form the initial node features N S . A ∈ R C×C represents the node adjacency matrix, whose values are defined by the knowledge graph T , and C is the total number of object categories and relationship categories. Usage: In general, the commonsense knowledge is used as guidance on the original feature refinement for most cases [88], [89], [92], but there are a lot of attempts to implement it and it contributes from a different aspect on the scene graph model. Yao et al. [93] demonstrated a framework which can train the scene graph models in an unsupervised manner, based on the knowledge bases extracted from the triplets of Web-scale image captions. The relationships from the knowledge base are regarded as the potential relation candidates of corresponding pairs. The first step is to align the knowledge base with images and initialize the probability distribution D S for each candidate: where the (s, o) denotes the corresponding object pair proposed by the detector, Λ represents the knowledge base and if the relation r i belongs to the set of retrieved relation labels from knowledge base, otherwise 0. In every non-initial iteration t(t > 1), this distribution will be constantly updated by the convex combination of the internal prediction from the scene graph model and the external semantic signals. Given this distribution as the distant labels, Yao optimized the model parameters θ t−1 by maximizing the log-likelihood of D t S as follows: where L p (·) is the entropy-based log-likelihood function. Inspired by a hierarchical reasoning from the human's prefrontal cortex, Yu et al. [94] built a Cognition Tree (CogTree) for all the relationship categories in a coarse-tofine manner. For all the samples with the same groundtruth relationship class, they predicted the relationships by a biased model and calculated the distribution of the predicted label frequency, based on what the hierarchical structure of CogTree can be consequently built. The tree can be divided into four layers (root, concept, coarse-fine, finegrained) and progressively divides the coarse concepts with a clear distinction into fined-grained relationships which share similar features. On the basis of this structure, a model-independent loss function, tree-based class-balanced (TCB) loss is introduced in the training procedure: where the k th node S k in the tracking path of Cognition Tree denotes the ground-truth node (coarse-fine node or fine-grained node) at layer k, k ∈ K and K = 2 or 3 respectively. z S k denotes the predicted probability of ground-truth node, consequently z j ∈ Z(S k−1 ) denotes the probability of internal nodes. w S k is the class-balanced weight of the respective node. This loss can suppress the inter-concept and intra-concept noises in a hierarchical way, which finally contribute to an unbiased scene graph prediction.
Recently, Zareian et al. [95] proposed a Graph Bridging Network (GB-NET). This model is based on an assumption that a scene graph can be seen as an image-conditioned instantiation of a commonsense knowledge graph. They generalized the formulation of scene graphs into knowledge graphs where predicates are nodes rather than edges and reformulated SGG from object and relation classification into graph linking. The GB-NET is an iterative process of message passing inside a heterogeneous graph. It consists of a commonsense graph and an initialized scene graph connected by some bridge edges. The commonsense graph is made up of commonsense entity nodes (CE) and commonsense predicate nodes (CP), and the commonsense edges are compiled from three sources, WordNet, ConceptNet, and the Visual Genome training set. The scene graph is initialized with scene entity nodes (SE), i.e., detected objects, and scene predicate nodes (SP) for all entity pairs. Its goal is to create bridge edges between the two graphs that connect each instance (SE and SP node) to its corresponding class (CE and CP node).
Another work by Zareian et al. [96] perfectly match the discussion of this section. It points out two specific issues of current researches on knowledge-based SGG methods: (1) external source of commonsense tend to be incomplete and inaccurate; (2) statistics information such as co-occurrence frequency is limited to reveal the complex, structured patterns of commonsense. Therefore, they proposed a novel mathematical formalization on visual commonsense and extracted it based on a global-local attention multi-heads transformer. It is implemented by training the encoders on the corpus of annotated scene graphs to predict the missing elements of a scene. Moreover, to compensate the disagreement between commonsense reasoning and visual prediction, it disentangles the commonsense and perception into two separate trained models and builds a cascaded fusion architecture to balance the results. The commonsense is then used to adjust the final prediction.
This section analyzes the knowledge-based scene graph generation methods from three different aspects: source (internal training samples and extrenal large-scale knowledge bases), formulation (statistical correlation) and the novel application of knowledge on different aspects of the scene graph model. In general, the external large-scale bases and specially-designed formulation of statistical correlation, as the main characteristics of commonsense knowledge, have drawn significant attention in recent years. However, [93], [94] have demonstrated that, except from feature refinement, the knowledge can also contribute from different aspects. Given its complex structure which is similar to the scene graph and enriched information, the knowledge may boost the reasoning path directly in the future. Moreover, the graph-based structure also makes it of great importance to guide the message passing on GNN-and GCN-based scene graph generation methods.

Message Passing Mechanism
The essence of the message here is feature and message passing occurs between elements of the scene graph, including objects and relationships. For the source elements, the message is considered as their own feature, but for the target, it is contextual information. Intuitively, individual predictions of objects and relationships can benefit from their surrounding context. We can understand the effect of context at three levels. First, for a triplet, the predictions of different phrase components are dependent on each other. This is the compositionality of scene graph that we previously mentioned in Section 2. For example, the visual connection of the subject (man), which appears to be sitting on something and an object (horse) with the appearance of human on it help to enhance the evidence of the predicate "ride". In return, the specific visual features for "ride" also helps to infer the subject (man) and object (horse) as well. Second, since the triplets are not isolated, messages can be passed between them. The message passing at the subgraph level is based on the assumption that the objects which have relationships are semantically dependent, and the relationships which have overlapped object(s) are also semantically related to each other and even share appearance features. Third, visual relationships are image-specific, thus learning feature representations from a global view is meaningful for relationship prediction. The global information is scattered over each object, specifically, for each object proposal generated by RPN, other proposals all contain its  [26] and the global propagation among all the elements. The global items, according to its specific prior layout structure, can be further divided into the following forms: fully-connected graph [59], [77], chain [73], [74] and tree [78], [79].
contextual information. Therefore, there are several models, which are based on a wide range of message passing to refine object features and extract phrase features. We organize the discussion from two main perspectives in this subsection: the local propagation within triplet items and global propagation among all the elements, as presented in Fig.5.
ViP-CNN, proposed by Li et al. [26], is a phrase-guided visual relationship detection framework, which can be divided into two parts: triplet proposal and phrase recognition. In triplet proposal, the VGG-Net is used to extract CNN features, which are then used to propose class-free regions of interest (ROIs) using the RPN approach. These ROIs are then grouped into some triplet proposals. These triplet proposals are into triplet non-maximum suppression (triplet NMS) to reduce the redundancy. The remaining triplets are used for the phrase recognition branch. In phrase detection, for each triplet proposal, there are three feature extraction branches for subject, object and predicate, respectively. First, three convolutional layers are used on the shared feature map. Then the three proposals are fed into the ROI pooling layer to extract corresponding features with a fixed size. These features are processed by several subsequent fullyconnected (FC) layers for category estimation and bounding box regression. Moreover, the phrase-guided message passing structure (PMPS) is introduced to exchange the information between branches in the convolutional layers and FC layers.
Dai et al. [49] proposed an effective framework called Deep Relational Network (DR-Net), which uses Faster RCNN to locate a set of candidate objects. Each candidate object comes with a bounding box and an appearance feature. For each candidate pair of objects, the framework extracts the appearance feature of an enclosing box that encompasses both objects with a small margin and the spatial feature represented as dual spatial masks, which are derived from the bounding boxes. These two features are subsequently concatenated and further compressed via two fully-connected layers. This compressed pair feature, together with the appearance features of individual objects are fed to the DR-Net for joint inference. Through multiple inference units, whose parameters capture the statistical relations between triplet components, the DR-Net outputs the posterior probabilities of s, r, and o. The iterative updating procedure can be unrolled into a network that consists of a sequence of computing layers. At each step it takes in a fixed set of inputs, i.e. the observed features x s , x r , and x o , and refines the estimates of posterior probabilities.
Another interesting model is Zoom-Net [72], which propagates spatiality-aware object features to interact with the predicate features and broadcasts predicate features to reinforce the features of subject and object. The core of Zoom-Net is a Spatiality-Context-Appearance Module, abbreviated as SCA-M, which consists of two spatialityaware feature alignment cells (i.e., Contrast ROI Pooling and Pyramid ROI Pooling) for message passing between the different components of a triplet. Given the ROI-pooled features of subject (S), predicate (P) and object (O), the SCA-M integrates the local and global contextual information in a spatiality-aware manner.
Li et al. [41] developed an end-to-end Multi-level Scene Description Network (MSDN) which can simultaneously detects objects, recognize their relationships and predict captions at the salient image regions. First, ROIs of objects, phrases and, region captions are generated. Then ROIpooling is used to obtain their visual features. These features pass through two fully connected layers and then pass messages to each others. Finally, the refined features are used to classify objects, predicates and generate captions. Message passing is guided by the dynamic graph constructed from the object and the caption region proposals. However, for a phrase proposal, messages come from caption region proposals that might cover multiple object pairs and contain contextual infomation that has a lager scope than a triplet. For comparison, the Context-based Captioning and Scene Graph Generation Network (C2SGNet) [73] also simultaneously generates region captions and scene graphs from input images, but the message passing between phrase and region proposals is unidirectional, i.e., the region proposals requires additional context information for the relationships between object pairs. Moreover, in an extension of MSDN model, Li et al. [13] proposed a subgraph-based scene graph generation approach called Factorizable Network (F-Net), where the object pairs referring to the similar interacting regions are clustered into a subgraph and share the phrase representation. The main process can be summarized in the following steps: (1) group the object proposals into pairs and establish the fully-connected graph, where every two objects are connected with two directed edges; (2) cluster the fullyconnected graph into several subgraphs to obtain a factorized connection graph by treating each subgraph as a node; (3) employ ROI-Pooling to obtain the corresponding appearance features for objects and subgroups; (4) pass messages between subgraph and object features along the factorized connection graph with a Spatial-weighted Message Passing (SMP) structure for feature refinement; (5) recognize object categories and their relations (predicates) by fusing the subgraph features and object feature pairs.
Although MSDN and F-Net extended the scope of message passing, a subgraph is regarded as a whole to send and receive messages. Liao et al. [53] proposed semantics guided graph relation neural network (SGRNN), in which the target and source must be an object or a predicate within a subgraph. It first establishes an undirected fullyconnected graph by associating any two objects as a possible relationship. Then, they remove the connections that are semantically weakly dependent, through a semantics guided relation proposal network (SRePN), and a semantically connected graph is formed. To refine the feature of a target entity (object or relationship), source-target-aware message passing is performed by exploiting contextual information from the objects and relationships that the target is semantically correlated with for feature refinement. The source entities of a target are its neighborhood and the second-order neighborhood in the expanded semantically connected graph (considering the possible relationship as nodes). The scope of messaging is the same as Feature Interrefinement of objects and relations in [89].
When considering all the other objects as carriers of global contextual information for each object, they will pass messages to each other throughout a fully-connected graph. However, inference on a densely connected graph is very expensive. As shown in previous works [64], [65], dense graph inference can be approximated by mean field in Conditional Random Fields (CRF). Moreover, Johnson et al. [1] designed a CRF model that reasons about the connections between an image and its ground-truth scene graph, and use these scene graphs as queries to retrieve images with similar semantic meanings. Zheng et al. [66], [67] combines the strengths of CNNs with CRFs, and formulates mean-field inference as Recurrent Neural Networks (RNN). Therefore it is reasonable to use CRF or RNN to formulate a scene graph generation problem [49], [56].
Cong et al. [56] adopted CRFs for scene graph inference. It can be formulated as finding the optimal x * = argmax x P (X) in the form of a Gibbs distribution: where the Gibbs energy E(x) is composed of unary and pairwise potentials. The unary potential ψ u (x i ) measures the cost of assigning the i-th node x i (object instance node or relationship node), and the pairwise potential p(x i , x j ) measures the cost of assigning x i to the i-th node given label assignment x j of the j-th node, and the j-th node is one of the 1-hop neighbors of the i-th node. The pairwise potential ψ p (x i , x j ), as the message to be passed, is calculated based on the label word embeddings of its neighbors. Xu et al. [54] also use mean field to perform the approximation of the scene graph inference procedure that passes messages containing contextual information between a pair of bipartite sub-graphs of the scene graph. Since objects (nodes) and relations (edges) appear alternately in a scene graph, they are divided into two sets to formulate two disjoint sub-graphs that are essentially the dual graph to each other.The primal graph defines channels for messages to pass from edge GRUs to node GRUs, while the dual graph defines channels for messages to pass from node GRUs to edge GRUs.
This work is considered as a milestone in scene graph generation, demonstrating that RNN-based models can be used to encode the contextual cues for visual relationship recognition. At this point, Zellers et al. [42] presented a novel model, Stacked Motif Network (MOTIFNET), which uses LSTMs to create a contextualized representation of each object. Given a potentially large set of predicted bounding boxes B S , to model objects (p(O S |B S , S) in Eq. 2), they linearize B S into a sequence and use a bidirectional LSTM to create a contextualized representation of each box: where f i and l i are the visual and the semantic features of the i-th object proposal, respectively. Dhingra et al. [55] proposed an object communication module based on a bi-directional GRU layer and used two different transformer encoders to: (1) further refine the object features and directly output respective object class predictions; (2) gather information for the edges, and its input is the final output of the former encoder after 6 repetitions. Using the frequency softening and bias adaptation to deal with the tailed distribution, the final relationship prediction is performed with the union feature of subject-object pair and the encoder objects feature from earlier transformer modules.
The Counterfactual critic Multi-Agent Training (CMAT) approach [116] is another important extension where an agent represents a detected object. Each agent communicates with the others for T rounds to encode the visual context. In each round of communication, an LSTM is used to encode the agent interaction history and extracts the internal state of each agent. Many other message passing methods based on RNN have been developed. Chen et al. [74] used an RNN module to capture instance-level context, including objects co-occurrence, spatial location dependency and label relation. Dai et al. [76] used a Bi-directional RNN and Shin et al. [73] used Bi-directional LSTM as a replacement. Masui et al. [75] proposed three triplet units (TUs) for selecting a correct SPO triplet at each step of LSTM, while achieving class-number scalability by outputting a single fact without calculating a score for every combination of SPO triplets. Note that almost all existing methods exploit the visual context by modeling message passing between objects. To construct a high-quality scene graph, a prior layout structure of proposals (ojects and unions) forms the basis of modeling. It can be summed up in four forms: triplet set, chain, tree and fully-connected graph. Accordingly, RNN and its variants (LSTM, GRU) as sequential models are used to encode context for chains while TreeLSTM [79] for trees and GNN (or CRF) [59], [60], [77] for graphs.
Tang et al. [78] constructed a dynamic tree structure, dubbed VCTREE, that places objects into a visual context, and then adopted bidirectional TreeLSTM to encode the visual contexts. VCTREE construction can be divided into three stages: (1) learn a score matrix S, where each element is defined as the product of the object correlation and the pairwise task-dependency; (2) obtain a maximum spanning tree using the Prim's algorithm, with a root i satisfying argmax i j =i S ij ; (3) convert the multi-branch tree into an equivalent binary tree (i.e., VCTREE) by changing nonleftmost edges into right branches. The ways of context encoding and decoding for objects and predicates are similar to [42], but they replace LSTM with TreeLSTM. In [42], Zellers et al. tried several ways to order the bounding regions in their analysis. Here, we can see the tree structure in VCTREE as another way to order the bounding regions.
For its particularity of structure, SGG is always considered as graph inference process in several other techniques. Hu et al. [113] explicitly model objects and interactions by an interaction graph, a directed graph built on object proposals based on the spatial relationships between objects, and then propose a message-passing algorithm to propagate the contextual information. The concatenation of the enhanced interaction embeddings and the relative spatial locations are used for edge classification.
Besides, there are some other relevant works which proposed some methods modeling on a pre-determined graph. Zhou et al. [114] mined and measured the relevance of predicates using relative location and constructed a locationbased Gated Graph Neural Network (GGNN) to improve the relationship representation. As described in detail in Section 3.1.2, Chen et al. [31] built a graph to associate the regions detected in the image according to the statistical cooccurrence probabilities of objects from different categories in the training set, and employ a graph neural network to propagate messages through the graph. Dornadula et al. [61] initialized a fully connected graph, i.e., all objects are connected to all other objects by all predicate edges, and updated their representation using message passing protocols within a well-designed graph convolution framework. Zareian et al. [95] formed a heterogeneous graph by using some bridge edges to connect a commonsense graph and initialized a fully connected graph. They then employed a variant of GGNN to propagate information among nodes and updated node representations and bridge edges. Wang et al. [115] constructed a virtual graph with two types of nodes (objects v o i and relations v r ij ) and three types of edges , and then refined representations for objects and relationships with an explicit message passing mechanism.

Attention Mechanisms
Attention mechanisms blossomed soon after the success of RAM [51] for type classification. They enable models to focus on the most significant parts of the input. Bahadahau et al. [52] proposed the soft attention approach for neural machine translation (NMT), which allows the model to align relevant source words to generate each target word. In the task of scene graph generation, attention mechanisms are always used to refine object features and extract relationship features. As with iterative message passing models, there are two objectives: refine local features and fuse contextual infomation. Therefore, two types of attention mechanisms in SGG (as illustrated in Fig. 6), i.e., Self-Attention and Context-Aware Attention mechanisms, will be analyzed in this subsection. Fig. 6: Two kinds of attention mechanisms in SGG: (1) Self-Attention mechanism [63] aggregates multimodal features of one object to generate a comprehensive representation. (2) Context-Aware Attention [58] learns the contextual features using graph parsing.
Attention mechanism can be used for both feature representation and the feature refinement stage of the basic framework shown in Fig.2. In the feature representation stage, attentions can be used in the spatial domain, channel domain or mixed domain to produce a more precise appearance representation of object regions and unions of object-pairs. In the feature refinement stage, attentions are used to update each object and relationship representation by integrating contextual information. Inspired by the aforementioned works, these attention methods can be divided into two categories: self-attention and context-aware attention. Intuitively, the context-aware approaches can also be viewed as message passing mechanisms with attention.
Self-Attention Mechanisms. Zheng et al. [63] proposed a multi-level attention visual relation detection model (MLA-VRD), which uses multi-stage attention for appearance feature extraction and multi-cue attention for feature fusion. In order to capture discriminative information from the visual appearance, the channel-wise attention is applied in each convolutional block of the backbone network to improve the representative ability of low-level appearance features, and the spatial attention learns to capture the salient interaction regions in the union bounding box of the object pair. The multi-cue attention is designed to combine appearance, spatial and semantic cues dynamically according to their significance for relation detection.
In another work, Zhou et al. [80] combined multi-stage and multi-cue attention to structure the Language and Position Guided Attention module (LPGA), where language and position information are exploited to guide the generation of more efficient attention maps. Zhuang et al. [57] proposed a context-aware model, which applies an attention-pooling layer to the activations of the conv5 3 layer of VGG-16 as an appearance feature representation of the union region. For each relation class, there is a corresponding attention model imposed on the feature map to generate a relation class-specific attention-pooling vectors. Han et al. [118] argued that the context-aware model pays less attention to small-scale objects. Therefore they proposed the Vision Spatial Attention Network (VSA-Net), which employs a two-dimensional normal distribution attention scheme to effectively model small objects. The attention is added to the corresponding position of the image according to the spatial information of the Faster R-CNN outputs: (17) where x and y are the coordinates of any point of the feature map. µ, σ are the mean and the variance respectively, which depend on the sizes of the subject and object bounding box: x min , y min are the coordinates of the left-top point of the bounding box in feature map and w, h indicate the width and height, respectively. It is observed that when the object's scale is longer, σ is longer, thus the attention h is relatively smaller, and vice versa. Kolesnikov et al. [121] proposed the Box Attention and incorporated box attention maps in the convolutional layers of the base detection model. The box attention map for this image is represented as a binary image m of the same size as I. Context-Aware Attention Mechanisms. Yang et al. [58] proposed Graph R-CNN based on graph convolutional neural network (GCN) [59], which can be factorized into three logical stages: (1) produce a set of localized object regions, (2) utilize a relation proposal network (RePN) to learns to efficiently compute relatedness scores between object pairs, which are used to intelligently prune unlikely scene graph connections, and (3) apply an attentional graph convolution network (aGCN) to propagate a higher-order context throughout the sparse graph. In the aGCN, for a target node i in the graph, the representations of its neighboring nodes {z j |j ∈ N (i)} are first transformed via a learned linear transformation W . Then, these transformed representations are gathered with predetermined weights α, followed by a nonlinear function σ (ReLU). This layer-wise propagation can be written as: The attention α ij for node i is: where w h and W a are learned parameters and [·; ·] is the concatenation operation. From the derivation, it can be seen that the aGCN is similar to Graph Attention Network (GAT) [60]. In conventional GCN, the connections in the graph are known and the coefficient vectors α ij are preset based on the symmetrically normalized adjacency matrix of features. Qi et al. [62] also leveraged a graph self-attention module to embed entities, but the strategies to determine connection (i.e., edges that represent relevant object pairs are likely to have relationships) are different from the RePN, which uses the multi-layer perceptron (MLP) to learn to efficiently estimate the relatedness of an object pair, where the adjacency matrix is determined with the space position of nodes. Lin et al. [165] designed a direction-aware message passing (DMP) module based on GAT to enhances the node feature (x) with node-specific contextual information, which takes the representations of its neighboring nodes {x j |j ∈ N (i)} as the message passed to the i−th node. Therefore, the output z i for the i−th node is formally expressed as: W v and W z are linear transformation matrices. Moreover, Zhang et al. [81] used context-aware attention mechanism directly on the fully-connected graph to refine object region feature and performed comparative experiments of Soft-Attention and Hard-Attention in ablation studies. Dornadula et al. [61] introduced another interesting GCN-based attention model, which treats predicates as learned semantic and spatial functions that are trained within a graph convolution network on a fully connected graph where object representations form the nodes and the predicate functions act as edges.

Visual Translation Embedding
Each visual relation involves subject, object and predicate, resulting in a greater skew of rare relations, especially when the co-occurrence of some pairs of objects is infrequent in the dataset. Some types of relations contain very limited examples. The long-tail problem heavily affects the scalability and generalization ability of learned models. Another problem is the large intra-class divergence [122], i.e., relations that have the same predicate but from which different subjects or objects are essentially different. Therefore, there are two challenges for visual relation detection models. First, is the right representation of visual relations to handle the large variability in their appearance, which depends on the involved entities. Second, is to handle the scarcity of training data for zero-shot visual relation triplets. Visual embedding approaches aim at learning a compositional representation for subject, object and predicate by learning separate visual-language embedding spaces, where each of these entities is mapped close to the language embedding of its associated annotation. By constructing a mathematical relationship of visual-semantic embeddings for subject, predicate and object, an end-to-end architecture can be built and trained to learn a visual translation vector for prediction. In this section, we divide the visual translation embedding methods according to the translations (as illustrated in Fig.7), including Translation between Subject and Object, and Translation among Subject, Object and Predicate. Fig. 7: Two types of visual translation embedding approaches according to whether to embed the predicate into N-dimensional space [70] or not [68], beyond the subject and object embedding.

Translation Embedding between Subject and Object:
Translation-based models in knowledge graphs are good at learning embeddings, while also preserving the structural information of the graph [138], [139], [140]. Inspired by Translation Embedding (TransE) [138] to represent largescale knowledge bases, Zhang et al. [68] proposed a Visual Translation Embedding network (VTransE) which places objects in a low-dimensional relation space, where a relationship can be modeled as a simple vector translation, i.e., subject + predicate ≈ object. Suppose x s , x o ∈ R M are the M-dimensional features, VTransE learns a relation translation vector t p ∈ R r (r M ) and two projection matrices W s , W o ∈ R r×M from the feature space to the relation space. The visual relation can be represented as: The overall feature x s or x o is a weighted concatenation of three features: semantic, spatial and appearance. The semantic infomation is an (N + 1)-d vector of object classification probabilities (i.e., N classes and 1 background) from the object detection network, rather than the word2vec embedding of label.
In an extension of VTransE, Hung et al. [69] proposed the Union Visual Translation Embedding network (UVTransE), which learns three projection matrices W s , W o , W u which map the respective feature vectors of the bounding boxes enclosing the subject, object, and union of subject and object into a common embedding space, as well as translation vectors t p (to be consistent with VTransE) in the same space corresponding to each of the predicate labels that are present in the dataset. Let x s , x o , x u be the concatenation features (appearance and location) of subject, object, and union respectively, a relationship < S, P, O > meets the constraint Translation Embedding among Subject, Object and Predicate: Another extension is ATR-Net (Attention-Translation-Relation Network), proposed by Gkanatsios et al. [70] which projects the visual features from the subject, the object region and their union into a score space as S, O and P with multi-head language and spatial attention guided. Let A denotes the attention matrix of all predicates, Eq. 22 can be reformulated as: Contrary to VTransE, the authors do not directly align P and O − S by minimizing ||P + S − O||, instead, they create two separate score spaces for predicate classification (p) and object relevance (r), respectively and impose loss constraints, L e P and L e OS (e can be p or r), to force both P and O − S to match with the task's ground-truth as follows: Subsequently, Qi et al. [62] introduced a semantic transformation module into their network structure to represent <S, P, O> in the semantic domain. This module leverages both the visual features (i.e., f i , f j and f ij ) and the semantic features (i.e., v i , v j and v ij ) that are concatenated and projected into a common semantic space to learn the relationship between pair-wise entities. L2 loss is used to guide the learning process: The Multimodal Attentional Translation Embeddings (MATransE) model built upon VTransE [71] learns a projection of <S, P, O> into a score space where S + P ≈ O, by guiding the features' projection with attention to satisfy: where x + are the visual appearance features and W + (s, o, m) are the projection matrices that are learned by employing a Spatio-Linguistic Attention module (SLA-M) that uses binary masks' convolutional features m and encodes subject and object classes with pre-trained word embeddings s, o. Compared with Eq. 22, Eq. 26 can be interpreted as: Therefore, there are two branches, P-branch and OS-branch, to learn the relation translation vector t p separately. To satisfy Eq. 26, it enforce a score-level alignment by jointly minimizing the loss of each one of the P-and OS-branches with respect to ground-truth using deep supervision.
Another TransE-inspired model is RLSV (Representation Learning via Jointly Structural and Visual Embedding) [112]. The architecture of RLSV is a three-layered hierarchical projection that projects a visual triple onto the attribute space, the relation space, and the visual space in order. This makes the subject and object, which are packed with attributes, projected onto the same space of the relation, instantiated, and translated by the relation vector. This also makes the head entity and the tail entity packed with attributes, projected onto the same space of the relation, instantiated, and translated by the relation vector. It jointly combines the structural embeddings and the visual embeddings of a visual triple t = (s, r, o) as new representations (x s , x r , x o ) and scores it as follow:

Other SGG methods
Although many of the above models use more than one method, we take the foremost one that can best reflect the idea of the paper as the main classification reference. Beyond that, there are several other noteworthy DL architectures for scene graph generation and visual relationship detection. Jin et al. [197] proposed SABRA to classify all the negative triplet samples into five general types and use the Balanced Negative Proposal Sampling to get training samples with a balanced distribution. This approach controls the weight on different sample types, diminishing the sideeffect of the negative samples on prediction. Wei et al. [186] proposed HOSENet to concentrate on the semantic overlap between the objects, by building two object sets (L s ,L o ) and calculating the similarity between subject and object within a triplet. Wang et al. [195] believed that larger targets represent major relationship positions in a scene, thus they built a tree-like structure and a Relation Ranking Module (RRM) to pay more attention to the relationships and objects within the salient region. Herzig et al. [98] defined the permutation-invariance property for structured prediction models and defined a graph labeling function F, said to be graph-permutation invariant (GPI). Knyazev et al. [185] employed generative adversarial networks (GANs) conditioned on scene graphs to generate replacement for parts of real scene graphs to increase the diversity of the training distribution. Huang et al. [181] designed a new Contrasting Cross-Entropy loss, which promotes the detection of rare relations by suppressing incorrect frequent ones. Zareian et al. [128] proposed a graph-based weakly supervised learning framework based on a novel graph alignment algorithm, which enables training without bounding box annotations. Fukuzawa and Toshiyuki [82] proposed an approach to reduce a visual relationship detection problem to an object detection problem. They used RetinaNet to detect objects and computed relationship features for each pair of bounding boxes with Gradient Boost Decision Tree (GBMT) based on the output of object detection. Yang et al. [136] presented an approach for inferring accurate support relations between objects from given RGBD images of cluttered indoor scenes. Yet additional models include LinkNet [45], STL [137], CogTree [183], HetH [195], TCN-VRP [190], RONNIE [119], Px2graph [100], BLOCK [130], and MR-NET [131].

Spatio-Temporal Scene Graph Generation
Recently, with the development of relationship detection models in the context of still images (ImgVRD), some researchers began to pay attention to understand visual relationships in videos (VidVRD). Compared to images, videos provide a more natural set of features for detecting visual relations, such as the dynamic interactions between objects. Due to their temporal nature, videos enable us to model and reason about a more comprehensive set of visual relationships, such as those requiring temporal observations (e.g., man, lift up, box vs. man, put down, box), as well as relationships that are often correlated through time (e.g., woman, pay, money followed by woman, buy, coffee). Meanwhile, motion features extracted from spatial-temporal content in videos help to disambiguate similar predicates, such as "walk" and "run" (Fig. 8a). Another significant difference between VidVRD and ImgVRD is that the visual relations in a video are usually changeable over time, while these of images are fixed. For instance, the objects may be occluded or out of one or more frame temporarily, which causes the occurrence and disappearance of visual relations. Even when two objects consistently appear in the same video frames, the interactions between them may be temporally changed [101]. Fig. 8b shows an example of temporally changing visual relation between two objects within a video. Two visual relation instances containing their relationship triplets and object trajectories of the subjects and objects. Different from static images and because of the additional temporal channel, dynamic relationships in videos are often correlated in both the spatial and temporal dimensions. All the relationships in a video can collectively form a spatial-temporal graph structure, as mentioned in [99], [102], [145], [171]. Therefore we redefine the VidVRD as

Spatio-Temporal Scene Graph Generation (ST-SGG).
To be consistent with the definition of 2D scene graph, we also define a spatio-temporal scene graph as a set of visual triplets R S . However, for each r S,i→j = (s S,i , p S,i→j , o S,j ) ∈ R S , s S,i = (l S,k1 , T s ) and o S,j = (l S,k2 , T o ) both have a trajectory (resp. T s and T o ) rather than a fixed bbox. Specifically, T s and T o are two sequences of bounding boxes, which respectively enclose the subject and object, within the maximal duration of the visual relation. Therefore, VidVRD aims to detect each entire visual relation R S instance with one bounding box trajectory.
ST-SGG relies on video object detection (VOD). The mainstream methods address VOD by integrating the latest techniques in both image-based object detection and multi-object tracking [175], [176], [177]. Although recent sophisticated deep neural networks have achieved superior performances in image object detection [17], [19], [173], [174], object detection in videos still suffers from a low accuracy, because of the presence of blur, camera motion and occlusion in videos, which hamper an accurate object localization with bounding box trajectories. Inevitably, these problems have gone down to downstream video relationship detection and even are amplified.
Shang et al. [101] first proposed VidVRD task and introduced a basic pipeline solution, which adopts a bottomup strategy. The following models almost always use this pipeline, which decomposes the VidVRD task into three independent parts: multi-object tracking, relation prediction, and relation instances association. They firstly split videos into segments with a fixed duration and predict visual relations between co-occurrent short-term object tracklets for each video segment. Then they generate complete relation instances by a greedy associating procedure. Their object tracklet proposal is implemented based on a video object detection method similar to [178] on each video segment. The relation prediction process consists of two steps: relationship feature extraction and relationship modeling. Given a pair of object tracklet proposals (T s , T o ) in a segment, (1) extract the improved dense trajectory (iDT) features [179] with HoG, HoF and MBH in video segments, which capture both the motion and the low-level visual characteristics; (2) extract the relative characteristics between T s and T o which describes the relative position, size and motion between the two objects; (3) add the classeme feature [68]. The concatenation of these three types of features as the overall relationship feature vector is fed into three predictors to classify the observed relation triplets. The dominating way to get the final video-level relationships is greedy local association, which greedily merges two adjacent segments if they contain the same relation.
Tsai et al. [102] proposed a Gated Spatio-Temporal Energy Graph (GSTEG) that models the spatial and temporal structure of relationship entities in a video by a spatialtemporal fully-connected graph, where each node represents an entity and each edge denotes the statistical dependencies between the connected nodes. It also utilizes an energy function with adaptive parameterization to meet the diversity of relations, and achieves the state-of-the-art performance. The construction of the graph is realized by linking all segments as a Markov Random Fields (MRF) conditioned on a global observation.
Shang et al. [144] has published another dataset, Vi-dOR and launched the ACM MM 2019 Video Relation Understanding (VRU) Challenge 2 to encourage researchers to explore visual relationships from a video [142]. In this challenge, Zheng et al. [141] use Deep Structural Ranking (DSR) [24] model to predict relations. Different from the pipeline in [101], they associate the short-term preliminary trajectories before relation prediction by using a sliding window method to locate the endpoint frames of a relationship triplet, rather than relational association at the end. Similarly, Sun et al. [143] also associate the preliminary trajectories on the front by applying a kernelized correlation filter (KCF) tracker to extend the preliminary trajectories generated by Seq-NMS in a concurrent way and generate complete object trajectories to further associate the shortterm ones.

3D Scene Graph Generation
The classic computer vision methods aim to recognize objects and scenes in static images with the use of a mathematical model or statistical learning, and then progress to do motion recognition, target tracking, action ecognition etc. in video. The ultimate goal is to be able to accurately obtain the shapes, positions and attributes of the objects in the threedimensional space, so as to realize detection, recognition, tracking and interaction of objects in the real world. In the computer vision field, one of the most important branches of 3D research is the representation of 3D information. The common 3D representations are multiple views, point clouds, polygonal meshes, wireframe meshes and voxels of various resolutions. To extend the concept of scene graph to 3D space, researchers are trying to design a structured text representation to encode 3D information. Although existing scene graph research concentrates on 2D static scenes, based on these findings as well as on the development of 3D object detection [149], [150], [151], [155] and 3D Semantic Scene Segmentation [152], [153], [154], scene graphs in 3D have recently started to gain more popularity [44], [146], [147], [148]. Compared with the 2D scene graph generation problem at the image level, to understand and represent the interaction of objects in the three-dimensional space is usually more complicated.
Stuart et al. [146] were the first to introduce the term of "3D scene graph" and defined the problem and the model related to the prediction of 3D scene graph representations across multiple views. However, there is no essential difference in structure between their 3D scene graph and a 2D scene graph. Moreover, Zhang et al. [135] started with the cardinal direction relations and analyzed support relations between a group of connected objects grounded in a set of RGB-D images about the same static scene from different views. Kim et al. [147] proposed a 3D scene graph model for robotics, which however takes the traditional scene graph only as a sparse and semantic representation of three-dimensional physical environments for intelligent agents. To be precise, they just use the scene graph for the 3D scene understanding task. Similarly, Johanna et al. [148] tried to understand indoor reconstructions by constructing 3D semantic scene graph. None of the these works have proposed an ideal way to model the 3D space and multilevel semantics.
Until now, there is no unified definition and representation of 3D scene graph. However, as an extension of the 2D scene graph in 3D spaces, 3D scene graph should be designed as a simple structure which encodes the relevant semantics within environments in an accurate, applicable, usable, and scalable way, such as object categories and relations between objects as well as physical attributes. It is noteworthy that, Armeni et al. [44] creatively proposed a novel 3D scene graph model, which performs a hierarchical mapping of 3D models of large spaces in four stages: camera, object, room and building, and describe a semiautomatic algorithm to build the scene graph. Recently, Rosinolet al. [172] defined 3D Dynamic Scene Graphs as a unified representation for actionable spatial perception. More formally, this 3D scene graph is a layered directed graph where nodes represent spatial concepts (e.g., ob-jects, rooms, agents) and edges represent pair-wise spatiotemporal relations (e.g., "agent A is in room B at time t"). They provide an example of a single-layer indoor environment which includes 5 layers (from low to high abstraction level): Metric-Semantic Mesh, Objects and Agents, Places and Structures, Rooms, and Building. Whether it is a four [44] -or five-story [172] structure, we can get a hint that 3D scene contains rich semantic information that goes far beyond the 2D scene graph representation.

DATASETS
In this section, we provide a summary of some of the most widely used datasets for visual relationship and scene graph generation. These datasets are grouped into three categories-2D images, videos and 3D representation.

2D Datasets
The majority of the research on visual relationship detection and scene graph generation has focused on 2D images; therefore, several 2D image datasets are available and their statistics are summarized in Table 1. The following are some of the most popular ones: Visual Phrase [43] is on visual phrase recognition and detection. The dataset contains 8 object categories from Pascal VOC2008 [157] and 17 visual phrases that are formed by either an interaction between objects or activities of single objects. There are 2,769 images including 822 negative samples and on average 120 images per category. A total of 5,067 bounding boxes (1,796 for visual phrases + 3,271 for objects) were manually marked.
Scene Graph [1] is the first dataset of real-world scene graphs. The full dataset consists of 5,000 images selected from the intersection of the YFCC100m [161] and Microsoft COCO [162] datasets and each of which has a humangenerated scene graph.
Visual Relationship Detection (VRD) [28] dataset intends to benchmark the scene graph generation task. It highlights the long-tail distribution of infrequent relationships. The public benchmark based on this dataset uses 4,000 images for training and test on the remaining 1,000 images. The relations broadly fit into categories, such as action, verbal, spatial, preposition and comparative.
Visual Genome [30] has the maximum number of relation triplets with the most diverse object categories and relation labels up to now. Unlike VRD that is constructed by computer vision experts, VG is annotated by crowd workers and thus a substantial fraction of the object annotations have poor quality and overlapping bounding boxes and/or ambiguous object names. As an attempt to eliminate the noise, prior works have explored semi-automatic ways (e.g., class merging and filtering) to clean up object and relation annotations and constructed their own VG versions. Of these, VG200 [68], VG150 [54], VG-MSDN [41] and sVG [49] have released their cleansed annotations and are the most frequently used. Other works [26], [56], [72], [83], [86], [90], [110], [114], [117] use a paper-specific and nonpublicly available split, disabling direct future comparisons with their experiments. Moreover, [15] presents experiments on a large-scale version of VG, named VG80K, and [156] proposes a new split that has not been benchmarked yet. Sun [111] constructed two datasets for hierarchical visual relationship detection (HVRD) based on VRD dataset and VG dataset, named H-VRD and H-VG, by expanding their flat relationship category spaces to hierarchical ones, respectively. The statistics of these datasets are summarized in Table 2.
VG150 [54] is constructed by pre-processing VG to improve the quality of object annotations. On average, this annotation refinement process has corrected 22 bounding boxes and/or names, deleted 7.4 boxes, and merged 5.4 duplicate bounding boxes per image. The benchmark uses the most frequent 150 object categories and 50 predicates for evaluation. As a result, each image has a scene graph of around 11.5 objects and 6.2 relationships.
VrR-VG [156] is also based on Visual Genome. Its preprocessing aims at reducing the duplicate relationships by hierarchical clustering and filtering out the visuallyirrelevant relationships. As a result, the dataset keeps the top 1,600 objects and 117 visually-relevant relationships of Visual Genome. Their hypothesis to identify visuallyirrelevant relationships is that if a relationship label in different triplets is predictable according to any information, except visual information, the relationship is visuallyirrelevant. This definition is a bit far-fetched but helps to eliminate redundant relationships.
Open Images [10] is a dataset of 9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. The images are very diverse and often contain complex scenes with several objects (8.3 per image on average). It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations. The boxes have largely been manually drawn by professional annotators to ensure accuracy and consistency. Open Images also offers visual relationship annotations, indicating pairs of objects in particular relations (e.g., "woman playing guitar", "beer on table"), object properties (e.g., "table is wooden"), and human actions (e.g., "woman is jumping"). In total it has 3.3M annotations from 1,466 distinct relationship triplets. So far, there are six released versions which are available on the official website and [10] describes Open Images V4 in details, i.e., from the data collection and annotation to the detailed statistics about the data and the evaluation of the models trained on it.
UnRel [128] is a challenging dataset that contains 1,000 images collected from the web with 76 unusual language triplet queries such as "person ride giraffe". All images are annotated at box-level for the given triplet queries. Since the triplet queries of UnRel are rare (and thus likely not seen at training), it is often used to evaluate the generalization performance of the algorithm.
SpatialSense [159] is a dataset specializing in spatial relation recognition. A key feature of the dataset is that it is constructed through adversarial crowdsourcing: a human annotator is asked to come up with adversarial examples to confuse a recognition system.
SpatialVOC2K [158] is the first multilingual image dataset with spatial relation annotations and object features for image-to-text generation. It consists of all 2,026 images  with 9,804 unique object pairs from the PASCAL VOC2008 dataset. For each image, they provided additional annotations for each ordered object pair, i.e., (a) the single best, and (b) all possible prepositions that correctly describe the spatial relationship between objects. The preposition set contains 17 English prepositions and 17 French prepositions.

Video Datasets
The area of video relation understanding aims at promoting novel solutions and research on the topic of object detection, object tracking, action recognition, relation detection and spatio-temporal analysis, that are integral parts into a comprehensive visual system of the future. So far there are two public datasets for video relational understanding. ImageNet-VidVRD [101] is the first video visual relation detection dataset, which is constructed by selecting 1,000 videos from the training set and the validation set of ILSVRC2016-VID [163]. Based on the 1,000 videos, the object categories increases to 35. It contains a total of 3,219 relationship triplets (i.e., the number of visual relation types) with 132 predicate categories. All videos were decomposed into segments of 30 frames with 15 overlapping frames in advance, and all the predicates appearing in each segment were labeled to obtain segment-level visual relation instances.
VidOR [144] consists of 10,000 user-generated videos (98.6 hours) together with dense annotations on 80 categories of objects and 50 categories of predicates. The whole dataset is divided into 7,000 videos for training, 835 videos for validation, and 2,165 videos for testing. All the annotated categories of objects and predicates appear in each of the train/val/test sets. Specifically, objects are annotated with a bounding-box trajectory to indicate their spatio-temporal locations in the videos; and relationships are temporally annotated with start and end frames. The videos were selected from YFCC-100M multimedia collection and the average length of the videos is about 35 seconds. The relations are divided into two types, spatial relations (8 categories) and action relations (42 categories) and the annotation method is different for the two types of relations.

3D Datasets
Three dimensional data is usually provided via multi-view images such as point clouds, meshes, or voxels. Recently, several 3D datasets related to scene graphs have been released to satisfy the needs of SGG study.
3D Scene Graph is constructed by annotated the Gibson Environment Database [160] using the automated 3D Scene Graph generation pipeline proposed in [44]. Gibson's underlying database of spaces includes 572 full buildings composed of 1,447 floors covering a total area of 211km 2 . It is collected from real indoor spaces using 3D scanning and reconstruction and provides the corresponding 3D mesh model of each building. Meanwhile, for each space, the RGB images, depth and surface normals are provided. A fraction of the spaces are annotated with semantic objects.
3DSGG, proposed in [147], is a large scale 3D dataset that extends 3RScan with semantic scene graph annotations, containing relationships, attributes and class hierarchies. A scene graph here is a set of tuples (N, R) between nodes N and edges R. Each node is defined by a hierarchy of classes c = (c 1 , · · · , c d ) and a set of attributes A that describe the visual and physical appearance of the object instance. The edges define the semantic relations between the nodes. This representation shows that a 3D scene graph can easily be rendered to 2D.

PERFORMANCE EVALUATION
In this section, we first introduce some commonly used evaluation modes and criteria for the scene graph generation task. Then, we provide the quantitative performance of the promising models on popular datasets. Since there is no uniform definition of a 3D scene graph, we will introduce these contents around 2D scene grapha and spatio-temporal scene graph.

Tasks
Given an image, the scene graph generation task consists of localizing a set of objects, classifying their category labels, and predicting relations between each pair of these objects. Most prior works often evaluated their SGG models on several of the following common sub-tasks. We preserve the names of tasks as defined in [28] and [54] here, despite the inconsistent terms used in other papers and the inconsistencies on whether they are in fact classification or detection tasks.
1) Phrase Detection (PhrDet) [28]: Outputs a label subject-predicate-object and localizes the entire relationship in one bounding box with at least 0.5 overlap with the ground truth box. It is also called Union boxes detectionin [49]. 2) Predicate Classification (PredCls) [54]: Given a set of localized objects with category labels, decide which pairs interact and classify each pair's predicate. 3) Scene Graph Classification (SGCls) [54]: Given a set of localized objects, predict the predicate as well as the object categories of the subject and the object in every pairwise relationship. 4) Scene Graph Generation (SGGen) [54]: Detect a set of objects and predict the predicate between each pair of the detected objects. This task is also called Relationship Detection (RelDet) in [28] or Two boxes detection in [49]. It is similar to phrase detection, but with the difference that both the bounding box of the subject and object need at least 50 percent of overlap with their ground truth. Since SGGen only scores a single complete triplet, the result cannot reflect the detection effects of each component in the whole scene graph. So Yang et al. [58] proposed the Comprehensive Scene Graph Generation (SGGen+) as an augmentation of SGGen. SGGen+ not only considers the triplets in the graph, but also the singletons (object and predicate). To be clear, SGGen+ is essentially a metric rather than a task.
There are also some paper-specific task settings including Triple Detection [87], Relation Retrieval [68] and so on.
In the video based visual relationship detection task, there are two standard evaluation modes: Relation Detection and Relation Tagging. The detection task aims to generate a set of relationship triplets with tracklet proposals from a given video, while the tagging task only considers the accuracy of the predicted video relation triplets and ignores the object localization results.

Metrics
Recall@K. The conventional metric for the evaluation of SGG is the image-level Recall@K(R@K), which computes the fraction of times the correct relationship is predicted in the top K confident relationship predictions. In addition to the most commonly used R@50 and R@100, some works also use the more challenging R@20 for a more comprehensive evaluation. Some methods compute R@K with the constraint that merely one relationship can be obtained for a given object pair. Some other works omit this constraint so that multiple relationships can be obtained, leading to higher values. There is a superparameter k, often not clearly stated in some works, which measures the maximum predictions allowed per object pair. Most works have seen PhrDet as a multiclass problem and they use k = 1 to reward the correct top-1 prediction for each pair. While other works [63], [85], [105] tackle this as a multilabel problem and they use a k equal to the number of predicate classes to allow for predicate co-occurrences [70]. Some works [42], [70], [90], [129], [164] have also identified this inconsistency and interpret it as whether there is graph constraint (i.e., the k is the maximum number of edges allowed between a pair of object nodes). The unconstrained metric (i.e., no graph constraint) evaluates models more reliably, since it does not require a perfect triplet match to be the top-1 prediction, which is an unreasonable expectation in a dataset with plenty of synonyms and mislabeled annotations. For example, 'man wearing shirt' and 'man in shirt' are similar predictions, however, only the unconstrained metric allows for both to be included in ranking. Obviously, the SGGen+ metric above has a similar motivation as removing the graph constraint. Gkanatsios et al. [70] re-formulated the metric as Recall k @K(R k @K). k = 1 is equivalent to 'graph constraints" and a larger k to "no graph constraints", also expressed as ngR k @K. For n examined subject-object pairs in an image, Recall k @K(R k @K) keeps the top-k predictions per pair and examines the K most confident out of nk total.
Given a set of ground truth triplets, GT , the image-level R@K is computed as: where T op K is the top-K triplets extracted from the entire image based on ranked predictions of a model [169]. However, in the PredCLs setting, which is actually a simple classification task, the R@K degenerates into the tripletlevel Recall@K (R tr @K). R tr @K is similar to the top-K accuracy. Furthermore, Knyazev et al. [169] proposed weighted triplet Recall(wR tr @K), which computes a recall at each triplet and reweights the average result based on the frequency of the GT in the training set: where T is the number of all test triples, [·] is the Iverson bracket, w t = 1 (nt+1) t 1/(nt+1)∈[0,1] and n t is the number of occurrences of the t-th triplet in the training set. It is friendly to those infrequent instances, since frequent triplets (with high n t ) are downweighted proportionally. To speak for all predicates rather than very few trivial ones, Tang et al. [78] and Chen et al. [31] proposed meanRecall@K(mR@K) which retrieves each predicate separately then averages R@K for all predicates.
Notably, there is an inconsistency in Recall's definition on the entire test set: whether it is a microor macro-Recall [70]. Let N be the number of testing images and GT i the ground-truth relationship annotations in image i. Then, having detected T P i = T op Ki ∩ GT i true positives in the image i, micro-Recall micro-averages these positives as |T Pi| |GTi| macro-averages the detections in terms of images. Early works use micro-Recall on VRD and macro-Recall on VG150, but later works often use the two types interchangeably and without consistency. Zero-Shot Recall@K. Zero-shot relationship learning was proposed by Lu et al. [28] to evaluate the performance of detecting zero-shot relationships. Due to the long-tailed relationship distribution in the real world, it is a practical setting to evaluate the extensibility of a model since it is difficult to build a dataset with every possible relationship. Besides, a single wR tr @K value can show zero or few-shot performance linearly aggregated for all n ≥ 0. Precision@K. In the video relation detection task, P recision@K(P @K) is used to measure the accuracy of the tagging results for the relation tagging task. mAP. In the OpenImages VRD Challenge, results are evaluated by calculating Recall@50(R@50), mean AP of relationships (mAP rel ), and mean AP of phrases (mAP phr ) [129]. The mAP rel evaluates AP of < s, p, o > triplets where both the subject and object boxes have an IOU of at least 0.5 with the ground truth. The mAP phr is similar, but applied to the enclosing relationship box. mAP would penalize the prediction if that particular ground truth annotation does not exist. Therefore it is a strict metric because we can't exhaustively annotate all possible relationships in an image.

Quantitative Performance
We present the quantitative performance on Recall@K metric of some representative methods on several commonly used datasets in Table 3-4. We preserve the respective task settings and tasks' names for each dataset, though SGGen on VG150 are the same to the RelDet on others. ‡ denotes the experimental results are under "no graph constraints".
By comparing Table 3 and Table 4, we notice that only a few of the proposed methods have been simultaneously verified on both VRD and VG150 datasets. The performance of most methods on VG150 is better than that on VRD dataset, because VG150 has been cleaned and enhanced. Experimental results on VG150 can better reflect the performance of different methods, therefore, several recently proposed methods have adopted VG150 to compare their performance metrics with other techniques.
Recently, two novel techniques i.e., SABRA [197] and HET [195] have achieved SOTA performance for PhrDet and RelDet on VRD, respectively. SABRA enhanced the robustness of the training process of the proposed model by subdividing negative samples, while HET followed the intuitive perspective i.e., the more salient the object, the more important it would be for the scene graph.
On VG150, excellent performances have been achieved by using the Language Prior's model, especially RiFa [109]. In particular, RiFa has achieved good results on the unbalanced data distribution by mining the deep semantic information of the objects and relations in triples. SGRN [53] generates the initial scene graph structure using the semantic information, to ensure that its information transmission process accepts the positive influence from the semantic information. Theoretically, Commonsense Knowledge can greatly improve the performance, but in practice, several models that use Prior Knowledge have unsatisfactory performance. We believe the main reason is the difficultly to extract and use the effective knowledge information in the scene graph generation model. Gb-net [95] has paid attention to this problem, and achieved good results in PredDet and PhrDet by establishing connection between scene graph and Knowledge Graph, which can effectively use the commonsense knowledge.
Due to the long tail effect of visual relationships, it is hard to collect images for all the possible relationships. It is therefore crucial for a model to have the generalizability to detect zero-shot relationships. VRD dataset contains 1,877 relationships that only exist in the test set. Some researchers have evaluated the performance of their models on zeroshot learning. The performance summary of zero-shot predicate and relationship detection on VRD dataset are shown in Table 5.
Compared with the traditional Recall, meanRecall calculates a Recall rate for each relation. Therefore, meanRecall can better describe the performance of the model on each relation, which is obtained by averaging the Recall of each relation. Table 6 shows the meanRecall metric performance of several typical models. In Table 6, IMP's meanRecall performance in detecting tail relationships is not ideal. In IMP+, due to the introduction of bidirectional LSTM to extract the characteristics of each object, more attention has been paid to the object itself, so there is an improvement. The core idea of VCTREE comes from MotifNet, but it improves the strategy of information transmission by changing the chain structure to the tree structure, making the information transmission between objects more directional. MemoryNet [198] has achieved SOTA results on both PredCLs and SGGen, which focuses on semantic overlap between low and high frequency relationships.

Challenges
There is not doubt that there are many excellent SGG models which have achieved good performance on the standard image datasets, such as VRD and VG150. However, there are still several challenges that have not been well resolved. First, the number of relationship triples is enormous. If there are R predicate categories and N objects, there would be R×A 2 N possible relationships. Hence, it would be inefficient to first detect all individual objects and then classify all pairs. Meanwhile, classification requires limited object categories, which is not scalable for real-world images. Many works [15], [26], [49], [53], [58], [84], [126], [193] have done much to filter out a set of object pairs with low probability of interaction from the detected object set. Li et al. [26] and Zhang et al. [84] extended the idea of region proposal in object detction to visual relationships and use ground truth pair-wise bounding boxes to learn triplet proposals to reduce the number of region pairs. Yang et al. [58] and Liao et al. [53] designed relation proposal modules to sparsify the candidate scene graph. An effective proposal nerwork will definitely reduce the learning complexity and computational cost for the subsequent predicate classification, thus improving the accuracy of relationship detection. However, compared to individual region proposal, learning the interactive intention between two regions is more difficult since the relative positions of the two bounding boxes are variable, and there is more noise around them.
The second main challenge comes from the long-tail distribution of the visual relationships, which has been mentioned many times in this paper. Since interaction occurs between two objects, there is a greater skew of rare relationships, as object co-occurrence is infrequent in a realworld scenario. An uneven distribution makes it difficult for the model to fully understand the properties of some rare relationships and triplets. For example, if a model is trained to predict 'on' 1,000 times more than 'standing on', then, during the test phase, "on" is more likely to prevail over "standing on". This phenomenon where the model is more likely to predict a simple and coarse relation than the accurate relation is called Biased Scene Graph Generation. Under this condition, even though the model can output a reasonable predicate, it is too coarse and obscure to describe the scene. However, for several downstream tasks, an accurate and informative pair-wise relation is undoubtedly the most fundamental requirement. Therefore, to perform a sensible graph reasoning, we need to distinguish between the more fine-grained relationships from the ostensibly probable but trivial ones, which is generally regarded as unbiased scene graph generation. Lu et al. [28] emphasized the importance of zero-shot relationship learning when they first proposed the visual relationship detection task. Since then, a lot of works [24], [56], [63], [68], [90], [103], [123], [125], [126], [168] have provided solutions and test the performance of their model for zeroshot learning. Morever, Chen et al. [127] and Dornadula et al. [61] designed few-shot predicate classifiers to improve the detection performance for rare relationships. However, this problem had not been resolved fundamentally yet. The performance of existing SGG methods is unsatisfactory when faced with biased data in real-world scenarios. Yang et al. [134] proposed a novel Shuffle-Then-Assemble pretraining strategy, which is to discard the paired annotation of any relationship to alleviate the bias. Forthermore, some researchers recently proposed unbiased SGG [133], [183], [188], [189] to make the tail classes to receive more attention in a coarse-to-fine mode. Whether taking full use of the power of counterfactual causality, or building a hierarchical cognitive structure from the biased predictions, or constructing a predicate-correlation perception learning scheme, the causality or predicate-correlation are utilized to make the tail relationships to receive more attention and produce an unbiased SGG.
The third challenge is that the visual appearance of the same relation varies greatly from scene to scene (Fig.3a and  3d). This makes the feature extraction phase more challenging. As we have described in Section 3.1.2, a great deal of methods focus on semantic features, trying to make up for the lack of visual features. However, we have emphasized that visual relationships are incidental and scene-specific. This requires us to think from the bottom up and try to extract more discriminative visual features.

Opportunities
The community has published hundreds of scene graph models and has obtained a wealth of research result. We think there are several avenues for future work. On the one hand, from the learning point of view, building a large dataset with fine-grained labels and accurate annotations is necessary and significant. It contains as many scene as possible, preferably constructed by computer vision experts. The models trained on such a dataset will have better performance on visual semantic and develop a broader understanding of our visual world. However, this is a very challenging and expensive task. On the other hand, from the application point of view, we can design the models by subdividing the scene to reduce the imbalance of the relationship distribution. Obviously, the categories and probability distributions of visual relationships are different in different scenarios. Of course, even the types of objects are different. Therefore, we can design relationship detection models for different scenarios.
Another research direction is the 3D scene graph. First, it is necessary to define a unified and effective structure of 3D scene graph and determine what infomation the 3D scene graph should encode. We regard that a unified structure needs to be constructed for 2D and 3D SGG, and the difference between 2D and 3D SGG only lies in the object detection and feature information. The semantic relations in 3D scene graphs should be mostly consistent with 2D scene graphs, since 2D images are just a 2D sample of the 3D world. Besides, the unified structures can also take full advantage from the 2D SGG methods for 3D SGG. Armeni et al. augment the basic scene graph structure with essential 3D information and generate a 3D scene graph which extends the scene graph to 3D space and ground semantic information there [44], [147]. However, their proposed structure representation does not have expansibility and generality. This is not only difficult to generate the 3D scene graph, but also difficult to parse a 3D scene graph in applications. Second, due to 3D infomation can be grounded in many storage formats, which are fragmented to specific types based on the visual modality (e.g., RGB-D, point clouds, 3D mesh/CAD models, etc.), the presentation and extraction of 3D semantic information has technological challenges.

CONCLUSION
This paper provides a comprehensive survey of the developments in the field of scene graph generation using deep learning techniques. Based on different input modalities, we introduced the representative works on 2D scene graph, spatio-temporal scene graph and 3D scene graph in different sections, respectively. Futhermore, we provide a summary of some of the most widely used datasets for visual relationship and scene graph generation, which are grouped into 2D images, video, and 3D representation, respectively. The performance of different approaches on different datasets are also compared. Finally, we discussed the challenges, problems and opportunities on the scene graph generation research. We believe this survey can promote more indepth ideas used on SGG.