Keywords

1 Introduction

Many appealing real-world applications involve data streams that cannot be well represented in a planar structure, but exist in irregular domain. This case applies to knowledge bases [35], 3D models [18], social media [22], and biological networks [7] which are usually represented by graphs.

In graph representation learning, the key challenge is to learn a low-dimensional representation of the data that is most informative to preserve the structural information among the nodes in graphs. Through graph embedding, we can represent the nodes in a low-dimensional vector form. This paves the way to apply machine learning in graph analysis and data mining tasks easily and efficiently such as node classification [11, 22], link prediction [7], clustering [4], and visualization [30].

Recently, there has been significant interest in graph representation learning mainly focuses on static graphs [5, 7, 8, 11, 22, 29] which attracted the attention of researchers due to its extensive usage in numerous real-world applications. However, a wide range of real-world applications are intrinsically dynamic and the underlying graph structure evolves over time and are usually represented as a sequence of graph snapshots over time [14].

Learning dynamic graph representations is challenging due to the time-varying nature of graph structures, where the graph nodes and edges are in continues evolution. New nodes and edges can be introduced or removed in each time step. Consequently, this requires the learned representations not only to preserve structural information of the graphs, but also to efficiently capture the temporal variations over time.

Recently, novel methods for learning dynamic graph representations have been proposed in literature. Some recent work attempts to learn dynamic graph representation such as [10, 15, 36, 37], where they mainly apply a temporally regularized weights to enforce the smoothness of node representations from different adjacent time steps. However, these methods generally fail to learn effective representations when graph nodes exhibit substantially distinct evolutionary behaviors over time [24].

Trivedi et al. [27] handle temporal reasoning problem in multi-relational knowledge graphs through employing a recurrent neural network. However, their learned temporal representations are limited to modeling first-order proximity between nodes, while ignoring the higher-order proximities among neighborhoods which are essential for preventing the graph structure as explained in [25, 34].

Recently, the authors in [24] propose dynamic graph embedding approach that leverage self-attention networks to learn node representations. This method focus on learning representations that capture structural properties and temporal evolutionary patterns over time. However, this method cannot effectively capture the structural evolution information over time, since it employs structure attention layers to each time step separately and generate node representations, which is followed by temporal attention layers to capture the variations in generated representations.

Recently, attention mechanisms have achieved great success in NLP and sequential learning tasks [1, 31]. Attention mechanisms learn a function that aggregates a variable-sized inputs while focusing on the most relevant sequences of the input to make decisions, which makes them unique. An attention mechanism is commonly referred to as self-attention, when it computes the representation of a single sequence.

Veličković et al. [29] extend the self-attention mechanism and apply it on static graphs by enabling each node to attend over its neighbors. In this paper, we specifically focus on applying graph attention networks (GATs) [29] because of its effectiveness in addressing the shortcomings of prior methods based on graph convolutions such as [8, 11]. GATs allow for assigning different weights to nodes of the same neighborhood by applying multi-head self-attention layers, which enables a leap in model capacity. Additionally, the self-attention mechanism is applied to all graph edges, and thus, it does not depend on direct access to the graph structure or its nodes, which was a limitation of many prior dynamic graph representation learning techniques.

Inspired by this recent work, we present a temporal self-attention neural network architecture to learn node representations on dynamic graphs. Specifically, we apply self-attention along structural neighborhoods over temporal dynamics through leveraging temporal convolutional network (TCN) [2, 20]. We learn dynamic node representation by considering the neighborhood in each time step during graph evolution by applying a self-attention strategy without violating the ordering of the graph snapshots.

Overall our paper makes the following contributions:

  • We present a novel neural architecture named (TemporalGAT) to learn representations on dynamic graphs through integrating GAT, TCN, and a statistical loss function.

  • We conduct extensive experiments on real-world dynamic graph datasets and compare with state-of-the-art approaches which validate our method.

2 Problem Formulation

In this work, we aim to solve the problem of dynamic graph representation learning. We represent dynamic graph G as a sequence of graph snapshots, \(G_1,G_2,\ldots ,G_\mathcal {T},\) from timestamps 1 to \(\mathcal {T}\). A graph at specific time t is represented by \(G_t = (V_t,E_t,F_t)\) where \(V_t\), \(E_t\) and \(F_t\) represent the nodes, edges and features of the graph respectively. The goal of dynamic graph representation learning is to learn effective latent representations for each node in the graph \(v \in V\) at each time step \(t = 1,2,\ldots , \mathcal {T}\). The learned node representations should efficiently preserve the graph structure for all node \(v \in V\) at any time step t.

3 TemporalGAT Framework

In this section, we present our proposed TemporalGAT framework, as illustrated in Fig. 1. We propose a novel model architecture to learn representations for dynamic graphs through utilizing GATs and TCNs networks to promote the model ability in capturing temporal evolutionary patterns in a dynamic graph. We employ multi-head graph attentions and TCNs as a special recurrent structure to improve model efficiency. TCNs has proven to be stable and powerful for modeling long-range dependencies as discussed in previous studies [2, 20]. In addition, this architecture can take a sequence of any length and map it to an output sequence of specific length which can be very effective in dynamic graphs due to varying size of adjacency and feature matrices.

Fig. 1.
figure 1

The framework of TemporalGAT.

The input graph snapshot is applied to GAT layer which has dilated causal convolutions to ensure no information leakage from future to past graph snapshots. Formally, for an input vector \(x\in \mathbb {R}^n\) and a filter \(f: \{0,\ldots ,k-1\}\rightarrow \mathbb {R} \), the dilated convolution operation \(C_d\) on element u of the vector x is defined as:

$$\begin{aligned} Conv_d(u) = (x *_d f)(u) = \sum _{i=0}^{k-1} f(i)\cdot x_{u-d\cdot i} \end{aligned}$$
(1)

where d is the dilation factor, k is the filter size, and \(u-d\cdot i\) makes up for the direction of the past information. When using a large dilation factors, the output at the highest level can represent a wider range of inputs, thus effectively expanding the receptive field [32] of convolution networks. For instance, through applying dilated convolution operations, it is possible to aggregate the input features from previous snapshots towards final snapshot.

The inputs to a single GAT layer are graph snapshots (adjacency matrix) and graph feature or 1-hot encoded vectors for each node. The output is node representations across time that capture both local structural and temporal properties. The self-attention layer in GAT attends over the immediate neighbors of each node by employing self-attention over the node features. The proposed GAT layer is a variant of GAT [29], with dilated convolutions applied on each graph snapshot:

$$\begin{aligned} h_u =\sigma \left( \sum _{v\in N_u}\alpha _{vu} W_d x_v\right) \end{aligned}$$
(2)

where \(h_u\) is the learned hidden representations of node u, \(\sigma \) is a non-linear activation function, \(N_u\) represents the immediate neighbors of u, \(W_d\) is the shared transformation weight of dilated convolutions, \(x_v\) is the input representation vector of node v, and \(\alpha _{vu}\) is the coefficient learned by the attention mechanism defined as:

$$\begin{aligned} \alpha _{vu} = \frac{\exp \left( \sigma \left( A_{vu} \cdot a^T [W_d x_v \Vert W_d x_u ] \right) \right) }{\sum _{w\in N_u} \exp \left( \sigma \left( A_{wu} \cdot a^T [W_d x_w \Vert W_d x_u] \right) \right) } \end{aligned}$$
(3)

where \(A_{vu}\) is the edge weight of the adjacency matrix between u and v, \(a^T\) is a weight vector parameter of the attention function implemented as feed-forward layer and \(\Vert \) is the concatenation operator. \(\alpha _{vu} \) is based on softmax function over the neighborhood of each node. This is to indicate the importance of node v to node v at the current snapshot. We use residual connections between GAT layers to avoid vanishing gradients and ensure smooth learning of the deep architecture.

Following, we adopt binary cross-entropy loss function to predict the existence of an edge between a pair of nodes using the learned node representations similar to [24]. The binary cross-entropy loss function for certain node v can be defined as:

$$\begin{aligned} \mathcal {L}_v = \sum _{t=1}^{\mathcal {T}} \sum _{u \in pos^t}-\log (\sigma (z^t_u \cdot z^t_v )) - W_{neg} \cdot \sum _{g \in neg^t} \log (1-\sigma (z^t_v \cdot z^t_g )) \end{aligned}$$
(4)

where \(\mathcal {T}\) is the number of training snapshots, \(pos^t\) is the set of nodes connected with edges to v at snapshot t, \(neg^t\) is the negative sampling distribution for snapshot t, \( W_{neg}\) is the negative sampling parameter, \(\sigma \) is the sigmoid function and the dot operator represents the inner product operation between the representations of node pair.

4 Experiments

In this section, we conduct extensive experiments to evaluate the performance of our method via link prediction task. We present experiential results of our proposed method against several baselines.

4.1 Datasets

We use real-world dynamic graph datasets for analysis and performance evaluation. An outline of the datasets we use in our experiments is given in Table 1.

Table 1. Dynamic graph datasets used for performance evaluation.

The detailed dataset descriptions are listed as follows:

  • Enron [12] and UCI [21] are online communication network datasets. Enron dataset is constructed by email interactions between employees where the employees represent the nodes and the email communications represent the edges. UCI dataset is an online social network where the messages sent between users represent the edges.

  • YelpFootnote 1 is a rating network (Round 11 of the Yelp Dataset Challenge) where the ratings of users and businesses are collected over specific time.

The datasets have multiple graph time steps and were created based on specific interactions in fixed time windows. For more details on the dataset collection and statistics see [24].

4.2 Experimental Setup

We evaluate the performance of different baselines by conducting link prediction experiment. We learn dynamic graph representations on snapshots \(S=\{1,2,\ldots ,t-1\}\) and use the links of \(t-1\) to predict the links at t graph snapshot. We follow the experiment design by [24] and classify each node pair into linked and non-linked nodes, and use sampling approach to achieve positive and negative node pairs where we randomly sample 25% of each snapshot nodes for training and use the remaining 75% for testing.

4.3 Parameter Settings

For our method, we train the model using Adam optimizer and adopt dropout regularization to avoid model over-fitting. We trained the model for a maximum of 300 epochs and the best performing model on the validation set, is chosen for link perdition evaluation. For the datasets, we use a 4 TCN blocks, with each GAT layer comprising attention heads computing 32 features, and we concatenate the output features. The output low-dimensional embedding size of the last fully-connected layer is set to 128.

4.4 Baseline Algorithms

We evaluate our method against the several baseline algorithms including static graph representation approaches such as: GAT [29], Node2Vec [7], GraphSAGE [8], graph autoencoders [9], GCN-AE and GAT-AE as autoencoders for link prediction [38]. Dynamic graph representation learning including Know-Evolve [27], DynamicTriad [36], DynGEM [10] and DySAT [24].

Table 2. Link prediction results on Enron, UCI and Yelp datasets.

4.5 Link Prediction

The task of link prediction is to leverage structural and temporal information up to time step t and predict the existence of an edge between a pair of vertices (uv) at time \(t + 1\).

To evaluate the link prediction performance of each baseline model, we train a logistic regression classifier similar to [36]. We use Hadmard operator to compute element-wise product of feature representation for an edge using the connected pair of nodes as suggested by [7]. We repeat the experiment for 10 times and report the average of Area Under the ROC Curve (AUC) score.

We evaluate each baseline at each time step t separately, by training the models up to snapshot t and evaluate the performance at \(t+1\) for each snapshot up to \(\mathcal {T}\) snapshots. We report the averaged micro and macro AUC scores over all time steps for the methods in Table 2 (given in paper [24]).

From the results, we observe that TemporalGAT outperforms state-of-the-art methods in micro and macro AUC scores. Moreover, the results suggest that GAT using TCN architecture with minimal tuning outperforms graph representation methods, which validates the efficient of TCN in capturing the temporal and structural properties of dynamic graph snapshots.

5 Related Work

5.1 Static Graph Representation Learning

Static graph embedding can be observed as dimensionality reduction approach that maps each node into a low dimensional vector space which preserves the vertex neighborhood proximities. Earlier research work for linear (e.g., PCA) and non-linear (e.g., IsoMap) dimensionality reduction methods have been studied extensively in the literature [3, 23, 26].

To improve large-scale graph embedding scalability, several approaches have been proposed such as [6, 7, 22], which adopt random walks and skip-gram procedure to learn network representations. Tang et al. [25] designed two loss functions to capture the local and global graph structure.

More recently, network embedding approaches design models that rely on convolutions to achieve good generalizations such as [8, 11, 19, 29]. These methods usually provide performance gains on network analytic tasks such as node classification and link prediction. However, these approaches are unable to efficiency learn representations for dynamic graphs due to evolving nature.

5.2 Dynamic Graph Representation Learning

Methods for dynamic graphs representation learning are often an extension of static methods with an additional component to model the temporal variation. For instance, in matrix factorization approaches such as [3, 26] the purpose is to learn node representations that come from eigenvectors of the graph Laplacian matrix decomposition. DANE [16] is based on this idea to update the eigenvectors of graph Laplacian matrix over time series.

For the methods based on random walk such as [7, 22], the aim is to model the node transition probabilities of random walks as the normalized inner products of the corresponding node representations. In [33], the authors learn representations through observing the graph changes and incrementally re-sample a few walks in the successive time step.

Another line of works for dynamic graph representation employ temporal regularization that acts as smoothness factor to enforce embedding stability across time steps [36, 37]. Recent works learn incremental node representations across time steps [10], where the authors apply an autoencoder approach that minimizes the reconstruction loss with a distance metric between connected nodes in the embedding space. However, this may not guarantee the ability of model to capture long-term proximities.

Another category of dynamic graph representation learning is point processes that are continuous in time [13, 17, 28]. These approaches model the edge occurrence as a point process and parameterize the intensity function by applying the learned node representations as an input to a neural network.

More recently, [24] proposed an approach that leverage the most relevant historical contexts through self-attention layers to preserve graph structure and temporal evolution patterns. Unlike this approach, our framework captures the most relevant historical information through applying a temporal self-attention architecture using TCN and GAT layers to learn dynamic representations for real-world data.

6 Conclusion

In this paper, we introduce a novel end-to-end dynamic graph representation learning framework named TemporalGAT. Our framework architecture is based on graph attention networks and temporal convolutional network and operates on dynamic graph-structured data through leveraging self-attention layers over time. Our experiments on various real-world dynamic graph datasets show that the proposed framework is superior to existing graph embedding methods as it achieves significant performance gains over several state-of-the-art static and dynamic graph embedding baselines.

There are several challenges for future work. For instance, learning representations for multi-layer dynamic graphs while incorporating structural and feature information is a promising direction.