EGAT: Extended Graph Attention Network for Pedestrian Trajectory Prediction

To improve foresight and make correct judgment in advance, pedestrian trajectory prediction has a wide range of application values in autonomous driving, robot interaction, and safety monitoring. However, most of the existing methods only focus on the interaction of local pedestrians according to distance, ignoring the influence of far pedestrians; the range of network input (receptive field) is small. In this paper, an extended graph attention network (EGAT) is proposed to increase receptive field, which focuses not only on local pedestrians, but also on those who are far away, to further strengthen pedestrian interaction. In the temporal domain, TSG-LSTM (TS-LSTM and TG-LSTM) and P-LSTM are proposed based on LSTM to enhance information transmission by residual connection. Compared with state-of-the-art methods, the model EGAT achieves excellent performance on both ETH and UCY public datasets and generates more reliable trajectories.


Introduction
Because of complexity and uncertainty of interaction between pedestrian and environment, it is difficult to predict human trajectory. Early methods [1,2] have made some achievements in the study of human behavior by manual energy function, but these methods have poor generalization ability and are not suitable for constructing human-human interactions in crowded space. For methods in deep learning, such as Recurrent Neural Network (RNN) [3,4] and Generative Adversarial Networks (GAN) [5,6], the human interaction is modeled based on social pooling. Although the receptive field is improved, location information of pedestrians is lost. Moreover, the generator of GAN is designed with RNN, so methods of pedestrian trajectory prediction based on RNN and GAN are not only inefficient, but also costly.
Graph structure is a natural method to represent human interaction, which is more intuitive and effective than pooling methods. Graph Convolutional Network (GCN) based on graph data shows powerful modeling function, and it has become a new hotspot in the research of pedestrian interaction. In the graph, a node represents a pedestrian, and the connecting edge of two nodes represents the interaction between pedestrians. However, existing methods based on GCN cannot distinguish the importance of nodes because they distribute the weights of nodes equally. Due to the different influence of adjacent pedestrians on the target pedestrian in trajectory prediction, attention mechanism is more helpful to encode potential pedestrian interaction. On this basis, Graph Attention Network (GAT) [7] comes into being and has been widely applied. Kosaraju et al. [8] proposed Social-BiGAT, which relies on a graph to simulate human interaction, but does not make full use of graph representation. Huang et al. [9] and Mohamed et al. [10] introduced a flexible graph attention mechanism to improve social modeling, but only model the local interaction of close pedestrians.
At present, there are many problems in the field of pedestrian trajectory prediction. Firstly, when pedestrians are walking in a real scene, from single walking to group activity [11], social interactions are not only affected by spatial proximity. As shown in Figure 1(a), the blue pedestrian's trajectory is mainly influenced by the black people who are far away, while the purple pedestrian who is near has less influence on it.
Secondly, in the temporal domain, during modeling pedestrian's historical trajectory based on LSTM, the current state of a pedestrian only depends on the hidden state of previous moment, which ignores information transmission of the current moment and affects judgment of pedestrian's intention. See Figure 1(b) for the pink missing connection in LSTM.
irdly, when prediction length increases, the prediction accuracy of LSTM-based trajectory prediction models tends to decline.
EGAT is proposed in this paper to solve these problems. At first, Feature Update Mechanism (FUM) is designed in EGAT to explore global influence for pedestrians. For those far away but influential pedestrians, FUM can pay attention to them and increase the receptive field. Because the local interaction between pedestrians is extended to global interaction, the network structure is called EGAT. Next, a pedestrian's movement of next moment is mainly affected by his intention of current moment, such as going straight, turning left, or turning right erefore, to enhance information transmission at the current time, the residual connection (i.e., the missing connection in Figure 1(b)) is added to LSTM to form TSG-LSTM (TS-LSTM and TG-LSTM). TS-LSTM and TG-LSTM models' temporal correlation for individual and interaction, respectively, not only simulates the real scene, but also reflects human dynamic movement. en, P-LSTM predicts pedestrian trajectory based on the observed trajectory. Different from LSTM, residual connection is also added to P-LSTM. As the prediction length increases, P-LSTM alleviates prediction accuracy decreases.

Related Work
is section mainly introduces the content involved in EGAT, including human-human interactions, trajectory prediction based on RNN or attention mechanism, and application of GCN. e relevant literature of each part is compared, and the advantages of our model are put forward.

Human-Human Interactions.
Early human interaction is defined by [1] as a social force with attraction and repulsion, which is an effective method. Due to the influence of objective environment, human-human interactions become more complex. e early models are not enough to simulate these interactions and have poor environment adaptability. On this basis, the subsequent research methods [12,13] consider more manual rules and functions, but limit the improvement of accuracy. With the development of deep learning, complex group activities are concerned. Bagautdinov et al. [14] proposed to recognize group activity through human-human interactions. Xu et al. [15] defined pedestrian relationship based on spatial affinity. Alahi et al. [3] proposed social LSTM to aggregate interactions through social pooling. ese methods only model pedestrian local interaction based on distance. In this paper, EGAT can not only predict the trajectory of a single person, but also predict potential human-human influences. It is not limited to nearby pedestrians for a target pedestrian, but focuses on all other non-local pedestrians.

Trajectory Prediction Based on RNN.
In recent years, RNN and its variants, LSTM and GRU, have been widely used in the field of trajectory prediction. e models share parameters and show good performance. Liu et al. [16] proposed spatiotemporal RNN, which has a transformation matrix to model spatiotemporal context in each layer. Gupta et al. [5] added adversarial training based on social LSTM to improve performance. Zhang et al. [4] proposed SR-LSRM, which activates how to use the current intention of neighbors to iteratively refine the current state of crowd participants. Li et al. [17] also achieved good results by using GRU. Above research proves that RNN methods are very suitable for trajectory prediction. In this work, LSTM is improved and TSG-LSTM is proposed to encode observed trajectories of pedestrians at different time steps. Based on the observed trajectory, P-LSTM is used to predict future trajectory for pedestrians.   Computational Intelligence and Neuroscience

Trajectory Prediction Based on Attention Mechanism.
Attention mechanism originates from imitating human vision and has significant effect on selection of relevant data [18]. e correlation coefficient between pedestrians and neighbors based on speed is determined by Su et al. [19]. Sadeghian et al. [6] combined with CNN to add bidirectional attention for pedestrians. Vemula et al. [20] used hidden state of EdgeRNN to calculate soft attention score and reflect importance of neighbors. However, these methods generally calculate the relationship between current pedestrian and adjacent pedestrians, ignoring the relationship with other long-distance pedestrians [21]. e purpose of this paper is to pay attention to all nodes in the graph, capture longdistance dependence, and extract more social features.

Application of GCN.
GCN is very effective for data processing in non-Euclidean space. Its core idea is to map nodes or edges to vector space through deep learning methods, and then cluster and classify. GCN is widely used in action recognition [22], scene graph generation [23], video recognition [24], and other fields. Liang et al. [25] designed RNN on spatial graph to encode inductive deviation of pedestrian motion patterns. A directed social graph is dynamically constructed by Zhang et al. [26] to effectively obtain interactions of pedestrians.
e Edge-Enhanced Graph Convolutional Neural Network (EGCN) proposed by Jeon et al. [27] is inherently scalable to graph nodes. In this model, frame sequences are constructed as a fully connected attention graph, in which pedestrian features involve interaction features and spatial location. e main contributions in this paper are summarized as follows:

Methods
e structure of our proposed model is shown in Figure 2. e model consists of encoder and decoder. e encoder mainly includes FUM and TSG-LSTM (TS-LSTM and TG-LSTM). P-LSTM is a decoder. FUM, TSG-LSTM, and P-LSTM are the special designs of this paper. In Figure 2, FUM is shown in red box, TS-LSTM is shown in blue box, and TG-LSTM is shown in pink box. P-LSTM is shown in Figure 3. When encoding in the spatial and temporal domain, spatial relationship of pedestrians is encoded by FUM, while TSG-LSTM encodes the historical trajectory of pedestrians in the temporal domain. Before FUM, TS-LSTM encodes for a single pedestrian. After FUM, pedestrians already have interaction information with other pedestrians. erefore, TG-LSTM encodes interaction relationship of pedestrians. In decoding, P-LSTM is used to predict the future trajectory of pedestrians based on encoder.

Encoding for a Single Pedestrian by TS-LSTM. Long
Short-Term Memory (LSTM) networks have been successfully proved to be able to learn and infer attributes of a sequence, which is suitable for predicting pedestrian trajectory [3-5, 9, 28]. For observed sequences, one LSTM is denoted as TS-LSTM to encode the change of one pedestrian's movement state at different time steps. For pedestrian i, firstly, the coordinate (x t i , y t i ) of the pedestrian at time step t is embedded into a fixed length vector v t i by an embedding function. e definition is shown in equation (2). Secondly, the vector is used as an input to TS-LSTM. ω represents an embedded function. irdly, TS-LSTM is used to calculate the hidden state of LSTM cell; see equation (3). W is a shared parameter and h t i is the output. e difference between TS-LSTM and LSTM is that the residual connection is added after output. e purpose of this design is to better combine the current position feature of each pedestrian, to ensure historical information will not be lost, and achieve better information transmission.

FUM for Spatial Interaction Modeling.
During pedestrian movement, the change of trajectory mainly comes from interactions of surrounding pedestrians. erefore, it is not enough to encode a single person's motion state by TS-LSTM. To share information across pedestrians in a crowded scene, FUM is proposed in the spatial domain to treat pedestrians as nodes of a graph at each time step. FUM consists of FU and GAT. FU is the innovation to compute global interactions of nodes. GAT follows a self-attention mechanism to define the importance of neighbors. e algorithm flow of FUM is shown in Algorithm 1.
Feature Updating (FU) from equation (3): h only represents the features of a single pedestrian, but the interaction between pedestrians cannot be shared. To achieve global interaction and increase input range of graph attention network, FU is defined. For a target pedestrian i, the function of FU is to update interaction features of node i by implementing weighted fusion of all node features. e weight is to calculate the intimacy between node i and other nodes. At time t, relevant definitions between node i and node j are shown in the following equations: is a function to calculate the intimacy for any two nodes, so it increases the receptive field of the model. s(h t j ) is a display function to compute the features of node j. e final output z t i is defined by residual connection. w s and w z are weight parameters to learn. ere are four definitions of d(h t i , h t j ) in equation (8) and ablation experiments in Section 4.2 to verify their effectiveness. e detailed calculation of Z is shown in Figure  4, Graph Attention Network. In the spatial domain, for a graph G (P, L), P represents the set of pedestrians, . . , T m represents the human-human interaction at time step t. If there is a connection between two pedestrians, l t ij equals 1, otherwise 0. e adjacency matrix A is constructed according to whether there are connecting edges among pedestrians, A ∈ R N×N . Because a fully connected graph is constructed at time t and all pedestrians are assumed to be connected, therefore if node j is a neighbor of i, A ij � 1, otherwise 0. In the temporal domain, there are connecting edges for pedestrians with the same ID. Given an observed sequence, through spatialtemporal construction, the relationship of pedestrians . . . formed a spatiotemporal graph in Figure 5. For a spatial graph G at time step t, features of pedestrians are aggregated by graph convolution. Figure 5 also illustrates the process of graph convolution and the distribution of attention.
Graph convolution network has many convolution layers. In the process of one-layer graph convolution, suppose Z (l) ∈ R N×D l represents the feature matrix of N pedestrians at the lth layer, and D is the feature dimension. Output of graph convolution can be written as equation (9), where A � A + I, I is a self-connected matrix, and σ is an activation function. e function of trainable weight matrix W is to transform the dimension, W ∈ R D l ×D l+1 .
It can be seen from equation (9) that the adjacency matrix A is only used to define whether there is a connection between two nodes and cannot explain connection strength.
erefore, an attention matrix B needs to be defined to show the connection strength of any two nodes. During observed period, . . , T m is fed to a graph convolution layer. e attention coefficient of the node pair (i, j) can be computed by the following equation: where T represents transposition, a ∈ R 2D′ is the weight vector of single-layer perceptron, W t ∈ R D′×D , ‖ is concatenation operation, and N i represents neighbors of node i in the graph. At time step t, the attention matrix Output of FUM. For the observed sequence, after graph attention convolution, the final output of FUM shown in equation (11) is a softmax operation corresponding to a node i (i � 1, . . . , N), ⊙ represents the multiplication of elements. z t i in Z (l+1) is concatenated by multi-head attention. e number of attention heads is 4. FUM can have multiple FU blocks, as shown in Figure 6, and ablation experiments in Section 4.2 to determine the number of blocks.

TG-LSTM for Temporal Interaction Modeling.
After FUM, pedestrians already have interactive information in the spatial domain. However, it is still necessary to encode the historical trajectory of pedestrian in the temporal domain. Similar to TS-LSTM, TG-LSTM is proposed. In this way, the spatial and temporal information can be fused. e definition of TG-LSTM is shown in the following equation: z t i is the input and comes from equation (11), W g is a shared weight of TG-LSTM, and g t i is the output. Affected by the surrounding complex environment, pedestrian trajectory is uncertain. To simulate pedestrian trajectory in a real environment, in the process of training, noise u is randomly sampled from the standard normal distribution N (0,1) for each pedestrian. In complex interaction scenarios, trajectory prediction depends not only on the target pedestrian himself, but also on historical movements of surrounding pedestrians.
en, a single motion state in TS-LSTM, interactive state in TG-LSTM, and the noise u are concatenated to complete encoding. erefore, at time step t, the observed trajectory is finally encoded as follows:

P-LSTM for Trajectory Prediction.
For pedestrian trajectory prediction, the current state of a pedestrian can reflect his movement intention in the future. To enhance information dependence at the current moment, residual connection is also required. is can not only improve prediction performance, but also alleviate the problem that the prediction accuracy decreases when the prediction length increases. e structure of P-LSTM shows the trajectory of three pedestrians in Figure 3.  (14) and (15), where e T m i is the initial state of P-LSTM, which is derived from equation (13). v T m i is from equation (2), W e is an updatable weight, δ e represents multilayer perception operation, and (x x

Definition of Loss Function.
To make pedestrians respond to changes of environment and improve the accuracy of trajectory prediction, the diverse loss method proposed by Gupta et al. [5] simulates the polymorphism of pedestrian movement. e definition of loss is shown in equation (16). During training, different Gaussian noise u is sampled to produce k results in one prediction. L2 distance is calculated k times, and the minimum value is taken as the loss. Y i is the actual trajectory, Y i is the predicted trajectory, and k is a super parameter. In this paper, k � 20.

Experiments and Results Analysis
In this section, Section 4.1 first introduces the experimental settings. Next, ablation experiments for FUM and residual connection are displayed in Section 4.2. en, our model EGAT is compared with other models in Section 4.3. Finally, experimental results of our proposed model are analyzed in Section 4.4.

Experiment Settings.
e experiment settings include datasets, evaluation metrics, and implementation details.

Datasets.
e model is experimented on two pedestrian trajectory datasets: ETH [12] and UCY [29]. ETH includes two scenes: ETH and HOTEL. UCY consists of three scenes: ZARA1, ZARA2, and UNIV. e original dataset of each scene is a video shot from an aerial view, which involves many complex situations, such as pedestrians Figure 4: e structure of FU to calculate Z. H is the initial input from equation (3). b is the sequence length and n is the number of pedestrians. f is the embedded dimension, which is defined as 32 dimensions. 4 represents the number of attention heads. θ, ϕ, and s are convolution operations. θ and ϕ calculate the intimacy between nodes, that is, the weight of adjacent nodes. s is a display function, which is used to calculate the feature of adjacent nodes. + and ⊗ represent addition and multiplication of matrices, respectively. BN is normalization. It is better to show in color. walking, pedestrians staying talking, and complex environment. ese datasets have 2206 human motion trajectories. All the data has been converted to world coordinates and the trajectory is sampled every 0.4 seconds. When training on five scene datasets, following previous studies [3,5,9], the leave-one-out method is adopted. e model is trained on four scenes, and the remaining one is tested. e observed trajectory is 3.2 seconds (8 time steps), and the predicted trajectory is 4.8 seconds (12 time steps).

Evaluation Metrics.
ere are two metrics to evaluate the model's performance. ey are the average displacement error (ADE) and the final displacement error (FDE). Definitions are shown in equations (17) and (18). Specifically, ADE evaluates the average prediction performance, while FDE only considers the final prediction accuracy. e smaller the value of the two metrics, the better the prediction results. e two metrics are defined as follows:

Implementation Details.
e proposed network EGAT is implemented in PyTorch 1.2 framework using Python language, and trained with two NVIDIA GeForce GTX-1080 GPUs. e setting of learning rate in different datasets is shown in Table 1. Adam optimizer is used and batch size is 64. TSG-LSTM and P-LSTM have only one layer. e size of hidden state and output of TSG-LSTM is 32 dimensions. Embedded vector v t i is 32 dimensions. FUM has two layers, and its input is normalized. e size of noise u is set to 16 dimensions.

Training Visualization.
e trends of Loss, ADE, and FDE during training are shown in Figure 7. e change of Loss shows that the training process is divided into three stages: 15% of epochs are used to encode for a single pedestrian by TS-LSTM, 15% to 25% of epochs are trained for FUM and TG-LSTM, and the remaining epochs are decoded based on the previous encoding to predict trajectory. When the epoch is less than 25% in the process of training, the model is encoding and has not predicted the trajectory. In this case, the displacement error between the predicted trajectory and the ground truth cannot be calculated. e error is calculated by ADE and FDE. erefore, when the epoch is less than 25%, ADE and FDE have no curves in Figure 7.

Ablation Study.
In this section, the ablation experiments of FUM and residual connection are carried out. For FUM, the intimacy function and FU blocks are studied. For residual connection, the experimental performance of TSG-LSTM and P-LSTM is verified, and the influence of residual connection on the model is compared on all datasets.

Ablation Study of FUM.
To evaluate the effectiveness of FUM, the ablation experiments are as follows.
(1) Baseline. STGAT-20V-20 is directly applied to predict pedestrian trajectory without FUM and LSTM residual connection, and the prediction length is 8.  (2) Intimacy Function.  Table 2 based on embed-Gaussian function show that the model performs best when the number of blocks is 4. As blocks increases, the performance decreases.
is is because the node information can be transmitted back and forth in a long distance. After more blocks, the feature information becomes smooth.

Temporal Residual.
In the output of TSG-LSTM and P-LSTM, residual connection is designed separately to enhance the transmission and combination of feature information. Six methods are compared in Table 3.
(1) Baseline. STGAT-20V-20.  Table 3, experimental results of TG-LSTM are better than TS-LSTM, but the personal information is lost. (5) FUM + TSG-LSTM. TSG-LSTM is a combination of TS-LSTM and TG-LSTM. It contains not only personal information but also interactive information, so the experimental performance is further improved. e ablation results in Table 3 prove the significance of TSG-LSTM. (6) FUM + TSG-LSTM + P-LSTM (EGAT). After adding FUM and residual connection, as can be seen from the last row of Table 3, our model EGAT applies P-LSTM to enhance the current information transmission of pedestrians in prediction process, so the experimental performance is the best. Table 4 compares differences between design (EGAT) and no-design residual connections (UN-EGAT). Experimental results show that the average value of ADE and FDE can be reduced by 20% and 17% by adding residual connection. e lower the value, the better the network performance.

Comparison with the State-of-the-Art.
e comparison between EGAT and other models is based on five scenarios of ETH and UCY, using evaluation metrics ADE and FDE with prediction length of 12. e experimental results show that the performance of the proposed EGAT model is better than most of the methods.

Evaluation Metrics Analysis.
e proposed model is compared with the state-of-the-art models in Table 5. STGAT-20V-20 is considered as the baseline model. EGAT is superior to STGAT-20V-20 in all datasets. e values of ADE and FDE in ETH and HOTEL, ADE in ZARA2, and AVG ADE are the best among the models listed in Table 5.
e other values are close to the optimal values. ere are two reasons why the optimal value is not reached. In UNIV, pedestrians are dense, and the environment is complex. e interaction between pedestrians is affected by many factors, such as motion speed, motion direction, motion state, and so on. ese factors affect the prediction accuracy of the model. In ZARA1, the trajectory of pedestrians is often affected by the surrounding pedestrians and obstacles, which may change or limit human activities, resulting in the model being unable to capture more social interactions.

Inference Time and Parameters.
e results of all models are run on two NVIDIA GeForce GTX-1080 GPUs. As can be seen from Table 6, EGAT is superior to some models. When calculating inference time (in seconds), EGAT uses residual connection to concatenate individual state and interactive state, which makes the inference time increase. As for parameters, EGAT's parameters are slightly higher than STGAT, because the intimacy with all nodes on a graph needs to be calculated.  In this section, the visualization results of attention and prediction trajectory are analyzed, the existing problems are described, and the future research direction of this paper is prospected.

Attention Visualization.
It is found that the difference of attention allocation between EGAT and STGAT is mainly reflected in the last four time steps. erefore, Figure 8 compares the changes of attention in four time steps. e purple star annotates the difference between EGAT and STGAT. rough comparison, it can be found that EGAT can more successfully reflect the importance of pedestrians, which is closer to a social reality scene. In (a) and (b), the pedestrian marked by a purple star of EGAT has the greatest impact on the target pedestrian trajectory, which is more accurate than STGAT. In (c), STGAT pays more attention to a stationary pedestrian, which is contrary to reality. However, EGAT correctly judges the stationary pedestrian (left purple star), allocates small attention to it, and focuses on the movement of adjacent pedestrians (right purple star). Visualization shows that the EGAT can expand receptive field, get global feature information, and enhance information transmission.

Predicted Trajectory.
e visual results of trajectory prediction for EGAT, STGAT, and S-LSTM are shown in Figure 9. Four scenarios are compared. It can be seen that the prediction performance of EGAT is the best among the three models. Group (a) compares the movement of two pedestrians. According to the coincidence of the ground truth and the predicted trajectory, EGAT achieves better prediction whether the two pedestrians are parallel or cross. In group (b), both EGAT and STGAT can produce reasonable trajectories to avoid collision. If the trajectory is carefully observed, EGAT's predicted trajectory is closer to the real trajectory. However, S-LSTM performed poorly. Group (c) introduces the scene of group walking, including parallel walking and meeting. Although the scene is complex, EGAT gives more accurate prediction. Group (d) focuses on the scene of nonlinear walking. In the first three pictures, from top to bottom, the second trajectory turns. EGAT successfully captures the intention of pedestrian and accurately realizes turning. STGAT and S-LSTM only realize local interaction between pedestrians, and the predicted trajectory is still straight, which makes the results different from the real trajectory. Similarly, in the following three pictures, two pedestrians next to the car can turn smoothly in EGAT, while STGAT has a certain realization, but it is not as accurate as EGAT. e trajectory generated by S-LSTM is not satisfactory. In a word, the proposed model EGAT can not only predict linear motion successfully, but also capture nonlinear motion reasonably, and its performance is better.

Problems and Research Direction.
First, the shallow structure of GCN: experimental results show that if network layers of GCN are greater than two, the performance will decline [31]. e reason is that if layers stack too deep, features of each node in the graph will be excessively smooth. erefore, the number of graph convolution layers is usually only two to three, the network structure cannot be deepened vertically. Second, the extension of attention is not sufficient. Due to single-feature information in datasets, although the proposed model improves receptive field, the model still focuses attention on spatial distance, and fusion of information such as walking direction and speed of pedestrians is not enough. In Figures 10(a) and 10(b), the target pedestrian generally pays more attention to pedestrians in front, while pedestrians marked by a red triangle get more attention, although they are all located behind the target pedestrian.
is is mainly because the model does not combine direction information. Moreover, when there is a great number of pedestrians at the same time, the calculation of node intimacy reduces the difference between pedestrians, and it is easy to have a uniform distribution of attention, as shown in Figure 10(c). erefore, the future research focuses on the deep exploration of graph convolution, fusion application of information, and improvement of the model generalization ability.

Conclusion
A novel EGAT framework is proposed in this paper, which can predict pedestrian trajectory in different scenes. EGAT not only improves the receptive field of the model, but also improves the prediction performance when the prediction length increases. During encoding, Graph Attention Network is extended to model human-human interactions in the spatial domain, and the historical trajectory of pedestrians is encoded by TSG-LSTM in the temporal domain. When decoding, P-LSTM predicts the pedestrian trajectory based on observed trajectories. EGAT is superior to STGAT on two public datasets. e experimental results show that EGAT can allocate reasonable weights to pedestrians according to their motion states, and the model can get more accurate trajectories.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper. Computational Intelligence and Neuroscience 11