Abstract

Pedestrian trajectory prediction is an essential but challenging task. Social interactions between pedestrians have an immense impact on trajectories. A better way to model social interactions generally achieves a more accurate trajectory prediction. To comprehensively model the interactions between pedestrians, we propose a multilevel dynamic spatiotemporal digraph convolutional network (MDST-DGCN). It consists of three parts: a motion encoder to capture the pedestrians’ specific motion features, a multilevel dynamic spatiotemporal directed graph encoder (MDST-DGEN) to capture the social interaction features of multiple levels and adaptively fuse them, and a motion decoder to produce the future trajectories. Experimental results on public datasets demonstrate that our model achieves state-of-the-art results in both long-term and short-term predictions for both high-density and low-density crowds.

1. Introduction

The task of pedestrian trajectory prediction is to predict pedestrians’ future trajectories given their historical trajectories in the scenario. Pedestrian trajectory prediction plays a notable role in many aspects, such as automatic driving [1] and robot navigation [25]. To predict an accurate trajectory, only considering the historical trajectory of the target pedestrian is not enough. Other pedestrians’ influences on the target pedestrian, which are called “social interaction features,” can often help make a better prediction. With the longer prediction horizon and denser crowds, the temporal correlations in the trajectories between current and previous time steps grow weaker and the impact of interactions on pedestrians’ motion grows stronger.

To model social interactions, traditional methods use rule-based functions [610]. While rule-based methods can only capture simple interactions, data-driven methods use neural networks to automatically extract the social interaction features from the data, which can make use of the interaction features more effectively. Many data-driven methods obtained social interaction features based on pooling [1114] or attention mechanisms [1, 1520]. The graph convolutional neural networks have developed rapidly in recent years, and the graph structure is naturally suitable for directly describing the interactions between pedestrians. As a result, graph convolutional neural networks [2125] have achieved excellent results in pedestrian trajectory prediction.

Although there are many graph convolutional neural network-based methods, they do not make full use of them. For example, Social-BiGAT [21] only uses the graph representation as the pooling mechanism on the states of the recurrent neural networks. The new methods STGAT [22] and Social-STGCNN [23] constructed spatiotemporal graphs to model social interactions and achieved excellent results in predictions.

However, they ignore a crucial point that even if the social interactions with nearby pedestrians or distant pedestrians are of the same type, they will result in different actions of the target pedestrian. As shown in Figure 1, at time steps and , when the target pedestrian marked with the red circle avoids the nearby pedestrians and the distant pedestrians, respectively, his avoidance movements will be different. The former is a sudden avoidance producing a trajectory with high curvature, while the latter is an early avoidance producing a trajectory with low curvature. Moreover, with the increase in prediction horizon, pedestrians far from the target pedestrian may become more important. From time step to , pedestrian B has little impact on the target pedestrian, but in the total period from to , the merging of them is the main factor affecting the target pedestrian’s trajectory. In other words, the influence of nearby pedestrians is mainly sudden and short-term, while faraway pedestrians have long-term effects on the target pedestrian’s movement tendency.

Most previous methods [2125] use a single graph to model these two types of influences and tend to capture “average social interaction features.” However, these two types of influences are more suitable to be modeled separately at different levels of a multilevel graph. Besides, many methods [23, 25] build an undirected graph to model social interactions. However, social interactions between pedestrians are nonsymmetrical. Therefore, building a digraph is more suitable for social interactions. Other methods [24] build a directed graph by predefined rules, such as inserting edges from all people inside the view area. But predefined rules are incomplete. For example, a pedestrian may slow down to wait for his companion without looking at him. Thus, a data-driven way to build a directed edge is much better.

To address the limitations of these works, we propose a multilevel dynamic spatiotemporal directed graph representation to model the interactions between pedestrians comprehensively. In our graph, different levels model interactions of pedestrians at different distance ranges. As shown in Figure 1, whether there is a spatial edge from a pedestrian to the target pedestrian at a level depends on whether their distance is within the corresponding distance range. With the change of time, the spatial edge between two pedestrians may break at one level and link at another. Even if the edge keeps linking at the same level, the influence of the neighbour also changes dynamically over time. To process the multilevel graph, we propose a multilevel dynamic spatiotemporal digraph convolutional network (MDST-DGCN). At each level of the graph, we use a node aggregator architecture to generate social interaction embedding by sampling and aggregating features from a node’s spatial neighborhood like GraphSAGE [26]. Because social interactions are location independent, we do an aligning operation before aggregating features, which can advance performance significantly. Through the orderly use of sampling, aligning, and aggregating, the aggregator architecture becomes a naturally data-driven way to describe a directed edge. For each level of the graph, after the spatial interactions are captured, an LSTM [27] is used to capture the temporal correlations of interactions. And then, MDST-DGCN fuses interaction features of all levels adaptively. Through modelling social interactions at different levels, our multilevel dynamic spatiotemporal digraph convolutional network (MDST-DGCN) can fully extract pedestrians’ social interaction features.

In summary, our contribution is twofold. First, we propose using the spatiotemporal dynamic map with a multilevel concept to separate pedestrian nodes, resulting in varying effects on the trajectory depending on the distance between pedestrians, which may aid in the extraction of social interaction features by partitioning pedestrian distances at various levels. Second, we create an aggregator based on the GraphSAGE that converts the original static adjacency graph structure into a dynamic directed graph structure by sampling, aligning, and aggregating, reducing the effect of individual coordinates on the model and the overfitting phenomenon. We verified the performance of the model on the general pedestrian trajectory datasets. The experimental results show that our model has achieved state-of-the-art results in both long-term and short-term predictions for both high-density and low-density crowds.

2.1. Pedestrian Trajectory Prediction

Pedestrian trajectory prediction has become a focal task in recent years, and corresponding solutions have been springing up. Comprehensively modelling the interactions between pedestrians is a crucial point to obtain better prediction results.

Traditionally, researchers created hand-crafted functions [3, 610] to predict trajectories, but hand-crafted functions are limited, so they are unable to model all types of social interactions. Recently, deep learning-based methods have become popular because they can learn to model various interactions from data.

Some researchers designed their methods based on pooling mechanisms [1114] to capture dependencies between pedestrians. The S-LSTM [11] introduces a “social” pooling layer which allows the LSTMs of spatially proximal sequences to share their hidden states with each other. Group-LSTM [12] adjusts the pooling layer by dropping the information of pedestrians who are moving coherently with the target pedestrian. MX-LSTM [13] has a pooling layer, which exploits the Vislet information. The above three pooling methods only consider the pedestrians in the local area and fuse their features averagely, while SGAN introduces a pooling module considering all pedestrians in a computationally efficient way and adaptively select their features with a max-pooling operation.

While most pooling-based methods treat pedestrians equally, attention-based methods [1, 1520] assign different weights to interactive pedestrians. Most of these methods [1, 1114, 1620] assign an LSTM for each pedestrian, and the pooling mechanisms or the attention mechanisms usually work on the hidden states of pedestrians’ LSTMs to adaptively fuse other pedestrians’ motion features with the target pedestrian. More recently, STAR [28] captures complex spatiotemporal interactions by interleaving between spatial and temporal transformers [29].

As the graph structure is naturally suitable for directly describing the interactions between pedestrians, graph convolutional neural networks are introduced to this task. Social-BiGAT [21] replaced the pooling mechanisms with the graph attention network, which also works on the hidden states of LSTMs. In other words, Social-BiGAT did not model the whole duration of the crowds’ interactions as a spatiotemporal graph but only used the graph attention network to capture the spatial social interactions. Social-STGCNN [23] and STGAT [22] both constructed spatiotemporal graphs to model social interactions. However, the graph of Social-STGCNN is a complete undirected graph. It does not conform to the asymmetry of pedestrian interactions. Zhang et al. [24] built a directed graph by inserting edges from all people inside the view area. However, all of these graphs model all the social interactions at only one level. Instead, we build a multilevel dynamic spatiotemporal directed graph to overcome their limitations.

2.2. Graph Convolutional Neural Network

Graph convolutional neural network is an emerging topic in deep learning research, and it provides a practical approach to process graph data with nongrid structures. We can divide graph convolutional neural networks into spectral approaches [3032] and spatial approaches [26, 33, 34]. Spectral approaches work with a spectral representation of the graphs, while spatial approaches define convolutions directly on the graph, operating on groups of spatially close neighbours. Spectral approaches’ learned filters depend on the Laplacian eigenbasis, which depends on the graph structure. Thus, a model trained on a specific structure cannot be directly applied to a graph with a different structure. However, the graph used to model pedestrians’ social interactions changes with time. Thus, spectral approaches are not suitable for pedestrian trajectory prediction. And, our approach belongs to the spatial approaches.

In fact, our approach follows the methodology of GraphSAGE [26]. However, our graph is a multilevel dynamic spatiotemporal directed graph, while GraphSAGE can only process a fixed spatial graph without multiple levels. ST-GCN [34] built a dynamic spatiotemporal graph to automatically learn both the spatial and temporal patterns of human actions to recognize skeleton-based actions. Social-STGCNN [23], which is a variant of ST-GCN that builds a single-level undirected graph to model all the social interactions, has achieved excellent results in pedestrian trajectory prediction.

3. Methods

3.1. Problem Definition

Given the historical trajectories of all pedestrians in the scenario, the task of trajectory prediction is to predict their future trajectories simultaneously. The notations , , …, represent pedestrians in the scenario. The position of a specific pedestrian at any historical time step is defined as . Our goal is to predict the positions of pedestrians at any future time step , and for a specific pedestrian , the predicted position is denoted as , while the ground truth is defined as . The first-order difference trajectory of a pedestrian is defined as , where .

3.2. Overall Model

As shown in Figure 2(b), MDST-DGCN consists of three parts: a motion encoder, a multilevel dynamic spatiotemporal directed graph encoder (MDST-DGEN), and a motion decoder. The motion encoder is used to capture the pedestrian-specific motion features, and the MDST-DGEN is used to capture the social interaction features. We construct a multilevel dynamic spatiotemporal digraph processed by the MDST-DGEN to model the social interactions between pedestrians. After the motion features and social interaction features are extracted, they are fed into the motion decoder to predict future trajectories.

3.3. Graph Construction

We construct a multilevel dynamic spatiotemporal directed graph to model the multilevel social interactions between pedestrians. The nodes of the graph are the pedestrians in the scenario. Given the hyperparameter level distance list , we construct a graph with levels. At each time step, if the distance from node to node is more than and less than , a spatial edge from to will exist in the level. Specifically, in the level, a spatial edge exists when the distance is less than . For each node at all levels, we add a loop spatial edge. Figure 2(a) shows how to build a two-level spatial graph with the level distance list at a certain time step. In addition to spatial edges, there are temporal edges, which connect the same pedestrians in consecutive frames. If there is only one level and , the graph will degrade into a complete graph, which is of the same structure as STGAT. At the time step , the attribute of node is the position of pedestrian .

3.4. Motion Encoder

The motion encoder is used to extract pedestrian-specific motion features. The input is the first-order difference trajectory . The motion encoder is composed of a linear layer and an LSTM. The linear layer transforms the into a higher dimension vector. Then, it is fed into the LSTM to get a motion feature vector. For each pedestrian , the process can be formulated asHere, denotes the trainable weights of the linear layer, is the trainable weights of the LSTM , and the hidden states of at the previous time step and the current time step of pedestrian are denoted as and , respectively. At last, the motion encoder obtains each pedestrian’s motion feature vector , which is marked as in the following sections.

3.5. MDST-DGEN

MDST-DGEN is a crucial component of our model. It processes the multilevel dynamic spatiotemporal directed social graph to obtain the social interaction features. If the graph is of levels, MDST-DGEN will have DGCN-LSTMs to process each level of it and an MSFM to fuse the features extracted from each level. In our implementation, DGCN-LSTMs share the weights, so increasing the number of levels does not increase the parameters of the model.

3.6. DGCN-LSTM

After building the multilevel graph, each level of the graph is fed into a DGCN-LSTM. A DGCN-LSTM consists of a node aggregator architecture to process the spatial edges and an LSTM to process the temporal edges. We follow the design of GraphSAGE [26], which processes graphs by sampling and aggregating. Our node aggregator architecture generates embedding by sampling, aligning, and aggregating features from a node’s spatial neighbourhood at each level.

3.6.1. Sampling

Due to the different numbers of pedestrians in the scene, to process all nodes of different graphs in parallel, we expand the number of neighbours to a fixed number by uniformly sampling a certain number of neighbours. Here, if there is an edge from node to node , will be the neighbour of . We denote the neighbours of any node as the neighbourhood set .

3.6.2. Aligning

For the node , its attribute is the pedestrian’s position and the attributes of its neighbourhood set can be denoted as . Social interaction is location independent, so we design an aligning operation to make the node aggregator architecture more generalizable. After aligning is done, the aligned attributes of any node ’s neighbourhood set can be denoted as . The intuitive understanding of the alignment operation is that we change the origin of coordinates to the position of node .

3.6.3. Aggregating

After the aligning, we aggregate the aligned attributes of ’s neighborhood set to obtain the new feature embedding of . It can be formulated as follows:where MAX is the max operator that take the elementwise max of the transformed attribute vectors and is the trainable linear mapping to convert a low-dimension vector to high dimension. We implement the max operator by using a max-pooling layer. Through the orderly use of sampling, aligning, and aggregating, our model can meet the requirement of a directed graph that the relation between two nodes in the directed graph is asymmetric.

After the spatial edges are processed, an LSTM is used to process the temporal edges as follows:where is the trainable weights of the LSTM and the hidden states of at previous time step and current time step are correspondingly denoted as and . At last, the DGCN-LSTM obtains each pedestrian’s social interaction feature vector at a certain level, and in the following sections, we denote of the level as .

3.7. MSFM

There are levels in our graph, so there are DGCN-LSTMs and the node ’s social interaction feature vectors obtained by them can be denoted as . We use an MSFM to fuse all levels’ social interaction feature vectors of node . The MSFM computes the weighted sum of . The formulations are as follows:Here, is the motion feature vector of pedestrian , represents transposition, is the corresponding social interaction feature vector at level , the fusion weight is a scalar, and is the final fused social interaction feature vector.

3.8. Motion Decoder

The motion decoder is used to predict future trajectories based on the motion features and the fused social interaction features. There are two types of motion decoders: motion decoders without noise and motion decoders with noise. The former makes the whole model a deterministic one, and the latter makes it a stochastic one. For the deterministic type, we only concatenate and as the initial hidden state of an LSTM and we train the model with L1 loss. For the stochastic type, we concatenate, , and a noise vector sampled from a standard Gaussian distribution to work as the initial hidden state of an LSTM. The formulation which shows how to get the initial hidden state of the stochastic motion decoder is as follows:

Moreover, we train the whole model with the variety loss proposed by SGAN [14] to encourage it to produce diverse samples. At the first prediction time step , the decoder gets as the initial input and predicts the next position offset . The predicted position offset is marked as . The formulations which show how the stochastic motion decoder works are as follows:where and are the trainable weights of the corresponding linear layers, means concatenating operation, and denotes the trainable weights of the LSTM .

4. Experiments

4.1. Datasets, Baseline Methods, and Metrics
4.1.1. Datasets

We evaluate our method on three commonly used datasets, ETH [35], UCY [36], and a high-density pedestrian dataset, pedestrian walk path dataset [37], which is referred to as PEDWALK in the rest of the article. ETH and UCY contain 1536 pedestrians’ real-world trajectories, while PEDWALK contains the manually labeled trajectories of 12684 pedestrians, and coordinates are provided in pixels. The image size of PEDWALK is pixels. ETH and UCY consist of a total of five unique scenes: ETH, HOTEL (from ETH), ZARA1, ZARA2, and UNIV (from UCY). For ETH and UCY, we follow the leave-one-out evaluation methodology in SGAN [14], training on 4 scenes and testing on the remaining one. For PEDWALK, we use 70% of its total frames for training and leave the remaining 30% for evaluation. The interval of trajectory sequences of ETH and UCY is 0.4 seconds, while the interval of trajectory sequences of PEDWALK is 0.8 seconds. We take 8 ground truth positions as observation and predict the trajectories of the following 12 time steps. It means, for ETH and UCY, we observe for 3.2 seconds and predict the future at 4.8 seconds (short-term prediction), while for PEDWALK, we observe for 6.4 seconds and predict the future at 9.6 seconds (long-term prediction).

4.1.2. Baseline Methods

We compare MDST-DGCN of deterministic type (MDST-DGCN-D) with deterministic models, e.g., LSTM [27], S-LSTM [11], Social Attention [15], and CIDNN [19]. Furthermore, we compare MDST-DGCN of stochastic type (MDST-DGCN-S) with stochastic models, e.g., SGAN [14], SGAN-P [14], SoPhie [16], GAT [21], Social-BiGAT [21], STGAT [22], and Social-STGCNN [23].

4.1.3. Metrics

There are two commonly used metrics: average displacement error (ADE) and final displacement error (FDE). ADE is the average L2 distance between ground truth and the predicted trajectory over all the predicted time steps, and FDE is the distance between the predicted final position and the actual final position at the end of the prediction period . For stochastic models, similar to prior work [14, 22], 20 samples are generated and the closest sample to the ground truth is selected to compute ADE and FDE. After checking the codes of SGAN, STGAT, and Social-STGCNN, we find there are two different ways to select the closest sample: selecting the closest trajectory of each pedestrian in a sample used by Social-STGCNN [23] and selecting the closest sample used by SGAN [14] and STGAT [22]. A sample includes all pedestrians’ trajectories in the scenario for a total duration of time steps. Following the tradition of SGAN and STGAT, we select the closest sample to compute the ADE and FDE of MDST-DGCN-S.

4.2. Model Configuration and Training Details

For the motion encoder, the output dimension of the linear layer is 32 and the hidden state dimension of is 64. For the MDST-DGEN, the output dimension of and is 64. We implement with a convolution layer. To process nodes in different scenarios in parallel, the fixed neighbour number needs to be larger than the maximum number of pedestrians in a sample. The most crowded scene in PEDWALK contains 133 pedestrians, and in ETH and UCY, there are 57 pedestrians in the most crowded scene. So we set it 135 for PEDWALK and 60 for ETH and UCY. For the motion decoder, the output dimension of is 32, the hidden state dimension of is 64, and the output dimension of is 2. For the MDST-DGCN-S, the dimension of the noise vector is half of the hidden state dimension.

Our implementation is based on the PyTorch library. The model is trained on one NVIDIA GeForce GTX 1080Ti graphics card for 200 epochs. To calculate the variety loss with less GPU memory usage, we generate only 5 possible output predictions for each scene. In training, a batch size of 32 was used; we use the Adam optimizer with a learning rate of 0.0001. is the default-level distance list for ETH and UNIV, and is the default-level distance list for PEDWALK.

4.3. Quantitative Evaluation

To validate the proposed MDST-DGCN, we present the prediction performance for both short-term trajectory prediction on ETH and UCY and long-term trajectory prediction on PEDWALK, and we present the prediction performance for various pedestrian densities. We elaborate on an ablation study to validate the effects of our multilevel graph and the aligning operation.

4.3.1. MDST-DGCN-D

As Table 1 shows, MDST-DGCN-D outperforms all deterministic methods and some stochastic methods on ETH and UCY. And, as Table 2 shows, MDST-DGCN-D even outperforms stochastic methods including STGAT. It shows that our model has good performance in capturing interaction features, and we think there are three reasons. First, PEDWALK has many more pedestrians in a scene than ETH and UCY, and then it has more interaction types and more frequent interaction activities in a sample. Second, high-density limits the randomness of pedestrian movement. Third, the prediction horizon on PEDWALK is 9.6 s, while it is 4.8 s on ETH and UCY. When the prediction horizon is short, lots of decisions in movement occur in the observation period and continue to the prediction stage, so lots of useful cues exist in pedestrians’ motion features and it is not necessary to infer from interactive information. High density and long-term predictions enhance the impact of interactions on trajectory prediction, and high density reduces the effect of multimodality.

4.3.2. MDST-DGCN-S

As Tables 1 and 2 show, when the best sample of 20 predictions is selected to calculate ADE and FDE, MDST-DGCN-S outperforms all methods on PEDWALK and achieves comparable ADE and FDE with STGAT. The reasons why MDST-DGCN-S is not better than STGAT on ETH and UCY are the same as the reasons stated in (1). When the best trajectory of 20 predictions is selected, MDST-DGCN-S outperforms Social-STGCNN in ADE, but Social-STGCNN gets better FDEs in several subdatasets. It is mainly because there are accumulated errors when LSTM is used in our model.

4.3.3. Various Pedestrian Densities

Table 2 presents the results on the PEDWALK for various pedestrian densities. We use samples with the specified densities to make the comparison. With the increase in density, the performance of each method decreases. Both MDST-DGCN-D and MDST-DGCN-S outperform other methods for various pedestrian densities. When the density is low, such as , the performance gap between SGAN and other methods is much smaller, which means when crowds are sparse, the effects of interactions are smaller and models get fewer useful cues to infer pedestrians’ future movements, but the multimodality will work better. This phenomenon also confirms our previous reasoning in (1).

4.3.4. Different Level Distance Lists

Table 3 presents the ADEs and FDEs of MDST-DGCN-D with different level distance lists. The level distance list means that MDST-DGCN-D models all social interactions at the same level, which is similar to STGAT and Social-STGCNN. Details about the level distance list are presented in Section 3. C. As shown in Table 3, modelling social interactions by a multilevel graph promotes the performance. On UNIV, the level distance list helps MDST-DGCN-D to get the highest improvement. It is mainly because UNIV has a higher pedestrian density than the other four subdatasets, and more people will walk within one meter, the social comfort distance.

4.3.5. Effects of Aligning Operation

As shown in Table 3, the aligning operation advances the performance on ETH, HOTEL, and UNIV, but it reduces the performance on ZARA1 and ZARA2. Because ZARA1 and ZARA2 are collected in the same place and have the same coordinate system, when they are used separately as a test set, the model without aligning will overfit on the coordinates.

4.4. Qualitative Evaluation

We compare the predicted trajectories of MDST-DGCN-D and STGAT in Figure 3. Figure 3(a) shows that the target pedestrian is walking in the same direction with a nearby pedestrian A, and he will finally gather with a faraway pedestrian B, both of STGAT and MDST-DGCN-D.

We successfully predict the merging phenomenon. However, MDST-DGCN-D succeeds in predicting that the target pedestrian maintains his relative position with nearby pedestrian A, while STGAT does not. Thus, MDST-DGCN-D obtains more accurate predictions. As shown in Figure 3(b), two pedestrians in a group are changing their directions in advance to avoid collisions with the pedestrians standing in the distance. For the target pedestrian, MDST-DGCN-D assigns a weight of 0.72 to the social interaction feature of the third level, which helps avoid possible collisions with distant pedestrians. However, STGAT only successfully predicts group behaviour but fails to predict early collision avoidance behaviour. All predictions in Figure 3 indicate that a multilevel graph structure can model social interactions more accurately and comprehensively.

We also visualize the trajectory distributions of MDST-DGCN-S and STGAT in Figure 4. As shown in Figure 4, in all three samples of pedestrian avoidance, pedestrian following, and pedestrian walking in group, our model outperforms STGAT.

We count the distribution of fusion weight on PEDWALK, which shows that the social interaction features of the first level and second level are of different importance in a sample. The distribution of fusion weight is shown in Figure 5.

5. Conclusions

In this article, we propose a multilevel dynamic spatiotemporal directed graph representation to model the interactions between pedestrians and introduce MDST-DGCN to process the multilevel graph. Experimental results indicate that our multilevel graph structure can model social interactions more accurately and comprehensively and show that MDST-DGCN outperforms most of the state-of-the-art methods.

Data Availability

Previously reported ETH and UCY data were used to support this study and are available at https://doi.org/10.1109/ICCV.2009.5459260 and https://doi.org/10.1111/j.1467-8659.2007.01089.x. These prior studies are cited at relevant places within the text as references. Previously reported PEDWALK data were used to support this study and are available at https://doi.org/10.1109/CVPR.2015.7298971. The prior study is cited at relevant places within the text as references.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was in part supported by the Major Program of the National Natural Science Foundation of China (91938301) and the National Natural Science Foundation of China (62002345).