Pedestrian trajectory prediction via the Social-Grid LSTM model

: In the design of intelligent driving systems, reliable and accurate trajectory prediction of pedestrians is necessary. With the prediction of pedestrians’ trajectory, the possible collisions can be avoided or warned as early as possible by changing the behaviour of intelligent vehicles. The trajectory prediction problem can be considered as a sequence learning problem, in which one of the recurrent neural network (RNN) models called long short term memory (LSTM) has been regarded as a promising method. The authors present a new method for predicting the pedestrian's trajectory, which is called Social-Grid LSTM based on RNN architecture. The proposed method combines the human–human interaction model called social pooling and the Grid LSTM network model. The performance of the proposed method is demonstrated on two available public datasets, and compared with two baseline methods (LSTM and Social LSTM). The experimental results indicate that the authors’ proposed method outperforms previous prediction approaches.


Introduction
Recently, the research in intelligent vehicles has made significant advances. As is well known, the pedestrian-vehicle interaction plays an important role on the research in making intelligent vehicles drive safely. Meanwhile, there are still some challenges in predicting future pedestrian trajectories by analysing their past trajectories. These challenges mostly come from the complex movement patterns of pedestrians, such as pedestrians have to stop or turn right immediately to avoid collisions with other pedestrians or vehicles. As shown in Fig. 1, there is a crowd scene of pedestrians. For humans, any individual position in the future few seconds can be predicted by considering their heading direction and walking speed information. However, it is not an easy task to predict trajectories for many pedestrians synchronously. Therefore, it is necessary to make intelligent driving systems learn to get the ability of predicting pedestrian trajectories based on labelled observation data.
In the existing pedestrian trajectory prediction research, the existing methods can be classified into two types: model-based prediction methods and recurrent neural network (RNN)-based methods. In model-based trajectory prediction methods, the mathematical functions were usually designed and specific pedestrian properties were defined [1,2]. In general, model-based trajectory methods can only predict future pedestrian trajectories in a short time period and the trajectories were usually predicted inaccurately in complex scenes.
With the developments of deep learning, the long short term memory (LSTM) models [3] have been used to solve sequence learning and prediction tasks. In recent years, it has attracted much attention to apply LSTM-based deep learning methods to solve trajectory prediction tasks. In recent years, some LSTM-based approaches have been developed to predict pedestrian trajectories. The simplest method is to use one LSTM model for each pedestrian to predict trajectories. Another approach based on the LSTM is to incorporate the influence factors among neighbouring pedestrians, which is called Social LSTM [4]. However, the prediction errors in existing methods are usually large for complex scenes. To better avoid collisions between pedestrians with intelligent vehicles, it is important to develop new methods for reducing the trajectories prediction error.
In this paper, a new LSTM-based trajectory prediction method is proposed to reduce the prediction error of pedestrians, which is called Social-Grid LSTM. The proposed Social-Grid LSTM makes use of the LSTM cell structure by adding a social pooling operation to establish an influence relationship among neighbouring pedestrians. In the basic LSTM cell structure, we adopt the twodimensional Grid LSTM architecture [5] which is different with the general LSTM structure in layer-to-layer parameter transfer mechanism. One innovation of the proposed Social-Grid LSTM method is that it integrates the human-human interaction model called social pooling and the two-dimensional Grid LSTM model. The performance of the proposed method was tested on two benchmark datasets, and the performance was compared with two popular trajectory prediction methods which include LSTM and Social LSTM. The experimental results show the advantages of the proposed method.
The remainder of the paper is arranged as follows. The second section introduces the research background including problem formulation and related works in pedestrian trajectory prediction. The third section will describe the details of our proposed Social-Grid LSTM model. The experiments and results analysis will be presented in the fourth section. In the end, in the fifth section, we will get the conclusion of the research and discuss some future works.

Problem formulation
At first, we will describe the pedestrian trajectory prediction problem as below. The inputs are the observed position coordinates and the outputs are the next future position coordinates. We assume that the spatial coordinates of all pedestrians in each scene are obtained at every different time instants. At time instant t, the ith pedestrian's position in the scene can be represented as (x t i , y t i ). In general, firstly, we observe all pedestrians' history positions from t = 1 to t = T o . At the next step, we predict the future trajectories of pedestrians from time t = T o+1 to t = T p . Therefore, the problem of pedestrian trajectory prediction can be defined as follows: Inputs: Observed history trajectories.
Outputs: Predicted next future trajectories.
Pedestrians moving in the scenes take actions under the influence of their neighbours' behaviours. Therefore, we predict the future trajectories with a social pooling operation, and with a simple LSTM model, it will make some prediction errors. This motivates us to adopt a two-dimensional Grid LSTM model for reducing prediction errors.

Related works
As is known, in the research of intelligent vehicles, the interaction between pedestrian and vehicles is increasingly becoming an important and indispensable research problem. For safety considerations, the intelligent vehicle needs to select good driving strategies to avoid collisions with pedestrians and other static or dynamic obstacles. Therefore, the research on pedestrian trajectory prediction is critical for avoiding collisions between pedestrians with vehicles. At present, the research on pedestrian trajectory prediction can be roughly categorised into two classes. The first class are traditional model-based methods. The second class are RNN-based methods. Considering human-human interactions, a social force model was proposed by Helbing and Molnar [1] to describe a pedestrian motion model. For modelling human-human interactions, Antonini et al. [6] proposed a discrete choice framework. Bonabeau [7] also presented an agent-based method to model behavioural patterns of individuals. Tay and Laugier [8] has proposed a Gaussian processes model to predict pedestrian smooth paths. Kooij et al. [9] presented a dynamic Bayesian network for pedestrian trajectory prediction, where spatial layout of the environment, situation and the pedestrian awareness were united as latent states on the switching linear dynamical system (SLDS) to predict pedestrian dynamics changes. Normally, the traditional model-based approaches rely on manually designed energy functions and hand-crafted factors. Those methods can only predict trajectories in a short term.
Recently, deep learning has received much attention for classification and prediction applications [10]. Although the RNN model has been applied to solve sequence learning tasks [11], it resulted in the problem of gradient vanishing or the problem of gradient exploding [12] when training with simple RNNs. Therefore, some RNNs' variants including LSTM [13] and Gated Recurrent Units [14] were proposed to sequence learning tasks and obtained better performance. The performance of LSTM has been demonstrated in machine translation, image captioning and so on. Park et al. [15] proposed a method based on LSTM to predict vehicle trajectory.
For the pedestrian trajectory prediction problem, it can be defined as a sequence learning task so that the RNN model can be taken into account for solving the task. Alahi et al. [4] proposed a LSTM-based networks model which is called Social LSTM. Each pedestrian in a scene is modelled with one LSTM for predicting trajectory and through the social pooling processing to share the information between each other. Besides, they proposed a trajectory prediction method which is called Social Attention [16]; this method captures the relative importance of each pedestrian. Moreover, for the traditional single-direction LSTM architecture, only the past information in a data sequence was considered. Another bidirectional LSTM has been proposed to predict trajectories, which takes both the past and future context into account [17]. Lee et al. [18] presented an RNN encoder-decoder framework which applies variational auto-encoder for predicting trajectories.
As we all know, the special gating mechanism is used in a LSTM network [13]. It means that the specific parts of input data are selected by the reading gate, writing gate and erasing gate from memory cell in the sequential direction. However, deep networks including RNN suffer from the problem that the inputs cannot be dynamically selected between layers in the depth direction. The Grid LSTM was proposed in [5], where a network settled in a grid of more than one dimension; and its architecture has shown good performance in character prediction, machine translation and image classification.
However, as far as we know, the Grid LSTM method has not been applied in trajectory prediction tasks. Furthermore, the Grid LSTM model needs to have the ability of considering the relationship among neighbouring pedestrians. Therefore, in this paper, a novel method called Social-Grid LSTM is proposed, which combines two-dimensional Grid LSTM with social pooling operation which can effectively estimate the neighbouring influences among pedestrians.

Social-Grid LSTM model
In the following, in order to improve the prediction accuracy of pedestrian trajectories in complex scenes, we will present a novel Social-Grid LSTM model which combines the social pooling operation and the two-dimensional Grid LSTM network. The details of the method are introduced below.

Overall structure of the Social-Grid LSTM model
LSTM network models have been shown to address the sequence learning tasks successfully. Therefore, we integrate a twodimensional Grid LSTM model with the social pooling operation for the trajectory prediction tasks. Particularly, we use one Grid LSTM for each pedestrian in a scene. The Grid LSTM models learn the state of the pedestrian by training on history data. Taking the interaction of pedestrians in a neighbourhood into account, we adopt the social pooling strategy by connecting the neighbouring Grid LSTM models. The overview of our model is shown in Fig. 2.
In general, the pedestrian individuals adjust their motions by taking the movements of neighbouring pedestrians into consideration. They are influenced by others in their current surroundings and would change the motion over time. Therefore, we share the states between the neighbouring two-dimensional Grid LSTM models. However, the problem is that the number of neighbours of each pedestrian is different in a crowd scene. Hence, we solve the problem by adding a social pooling layer, it is described in right top of Fig. 2. It means that the pooled hidden state information is received from the neighbours' Grid LSTM cells by Grid LSTM cell at every time step. In this paper, we adopt a 2 × 2 pooling grid to pooling the information of neighbours.
The hidden state of the Grid LSTM at time instant t of the ith pedestrian is defined as h t i . We represent the social pooled hidden state between neighbours with a tensor H t i . The hidden-state dimension is D, and neighbourhood grid size N 0 = 2. Therefore, the H t i can be defined as follows: where h t − 1 j is the hidden state of the Grid LSTM for the jth pedestrian at time instant t − 1. To verify if (x, y) is in the (m, n) cell of the N 0 × N 0 grid or not, we apply the function of 1 mn x, y , and N i is the set of pedestrian neighbours for the ith pedestrian.
The entire implementation process of our proposed method called Social-Grid LSTM is shown in the above Algorithm 1 (see Fig. 3). It mainly include two steps, the first is the process of model creation and the second is the training phase. Next, we will introduce some details of the two-dimensional Grid LSTM network model used in the proposed method.

Two-dimensional Grid LSTM for trajectory prediction
In this subsection, as a major step in the proposed Social-Grid LSTM method, we will apply the two-dimensional Grid LSTM method developed in [5] for training the model for pedestrian trajectory prediction. For the LSTM network model solving sequential problems, a series of input and output pairs are processed by the network, which can be represented as (x 1 , y 1 ), (x 2 , y 2 ), …, (x m , y m ). Pedestrian trajectory position data can be represented in this form. Therefore, for each pair (x i , y i ), the output y i is produced by both the last time output hidden value h t − 1 and the current input x i . The hidden value h t − 1 is determined by all the previous inputs x 1 , x 2 , …, x i − 1 .
Meanwhile, as is shown in Fig. 4, the state of the cell in the network as memory vector m t is determined by the previous inputs x 1 , x 2 , …, x i − 1 and current input x i . For each LSTM cell, the hidden state vector h t is identified by the forgetting gate G f , input gate G i , output gate G o . The brief structure is shown above, and the computation mechanism at every step in the cell was defined in [13] as follows: where σ is the logistic sigmoid activation function, U gf , U gi , U gc , The computation of each cell outputs a new hidden h t and a memory m t . The output is computed by considering the hidden vector h t . It can be viewed as a functional LSTM(): where U is the concatenation of the weight matrices U gf , U gi , U gc , U go . However, the Grid LSTM deploys cells along more than one dimension including the depth. In the proposed Social-Grid LSTM method, we use the two-dimensional Grid LSTM in which the cells are deployed along two dimensions, where the vertical one is along depth and the temporal one is for timing. The two-dimensional Grid LSTM can be viewed as a parameter transferring mechanism where the values cannot grow combinatorially in the cells.
The two-dimensional blocks in a two-dimensional Grid LSTM receives two memory vectors m 1 , m 2 and two hidden vectors h 1 , h 2 as inputs. The computation is shown below. The input two hidden vectors from each dimension are concatenated firstly as vector H: Then the cell computes with the two transform functions LSTM(), each function for each dimension, getting the expected outputs: where the U j of the each transform has distinct weight matrices U gf j , U gi j , U gc j , U go j . Each transform function LSTM() applies the basic LSTM mechanism as (10) across the two dimensions. For a block, the grid of the network model is processed with the input of memory and hidden vectors two sides, and it outputs the memory and hidden vectors two sides at next time. However, considering that the input data are not separated to be sent to a block, the data along one of the sides of the grid will be processed by a pair of input memory and hidden vectors.

Datasets and metrics
In this section, we mainly do experiments on two public pedestrian-trajectory datasets: ETH [19] and UCY [20]. The first includes two scenes and has two components: UNIV dataset and HOTEL dataset. The second also has two scenes; however, it is split into three small datasets: ZARA-01 dataset, ZARA-02 dataset and UNIV dataset. We do the experiment to evaluate our proposed model on the above five datasets. As is shown in [19,20], these five datasets exhibit many complex interactions between pedestrians such as crossing each other, walking together, collision avoidance and groups dispersing and forming in the scenes. The datasets were provided in the form of series of combination of four elements which contains frame number, pedestrian ID, xcoordinate and y-coordinate. These datasets are recorded at 0.4 s per frame. In [4], the methods based on LSTM perform better than any other traditional approaches such as Linear Model [1], Social Force [2] and Iterative Gaussian Process [21]. Therefore, to compare the performance of the proposed method, we chose the LSTM and Social LSTM methods for performance comparisons.
To evaluate the performance of the different methods, we adopt two metrics which are given as follows: • Average displacement error: It means that the mean Euclidean distance error over all estimated points at each time instant of the predicted trajectory and the true trajectory. • Final displacement error: It means that computing the mean Euclidean distance error between the final points of the predicted trajectory and the true trajectory after T p time steps.

Implementation details
In our proposed method, we adopt the Grid LSTM of two dimensions as shown in Fig. 2. The hidden state dimension is set to be 128 for the Grid LSTM which has two layers in the depth. Additionally, the embedding layers in the network model embed the input data into a 64-dimensional vector with rectified linear units non-linearity.
In the training, the batch size is set to be 16 and the network model was trained with the learning rate of 0.003. The optimisation function is RMS-prop function [22]. Besides, before training, the coordinates of two adjacent positions on the different datasets are preprocessed to the same interval number of frames by interpolating.
In order to take advantage of all five datasets while training the models, we adopt the approach of leaving one out. It means that we train the model on four of these datasets and test on the last dataset. The approach is repeated on all five datasets. At the same time, the two other methods used for comparison were trained and tested by the same procedure.
The frame rate is 0.4 s per frame. Therefore, during the testing stage, we observed eight frames and predict for the next 12 frames, it corresponds to observing a trajectory of 3.2 s and predicting for the future trajectory of the next 4.8 s. To compare the performance of different methods, the two above metrics were adopted.

Result analysis
The prediction errors in two metrics of all the methods on the five datasets are presented in Table 1. The best results are shown in bold. As we can see, the independent LSTM network model has large prediction errors which are due to not considering the pedestrian-pedestrian interactions. However, in the evaluation of the UCY-Univ dataset, the independent LSTM method outperforms slightly the other two methods. For our proposed Social-Grid LSTM method, it out performs both the Social LSTM method and the independent LSTM network model on all the five datasets in the two metrics.
In our comparison of other methods, by improving the LSTM network structure in depth with two-dimensional Grid LSTM, the trajectory prediction error has been reduced. As shown in Fig. 5, we plot the true trajectories and predicted trajectories for all pedestrians in three of the scenes. The lines with node 'o' are the true trajectories. However, the predicted trajectories are presented with the lines of node 'x'. The different colours of the trajectories represent different pedestrian in the scene. From the result of the figure, it can be seen that the displacement errors between true trajectories and predicted trajectories are very small, and the errors are mainly generated at the end of the prediction time instant.

Conclusion
In this paper, a novel pedestrian trajectory prediction method called Social-Grid LSTM is proposed, which integrates the social pooling layer and the two-dimensional Grid LSTM network model. The social pooling layer learns the relative influence of each pedestrian in the crowded scenes by sharing the information between each network model corresponding to one pedestrian. We also analysed how the information is transformed between two layers in the depth of the Grid LSTM. It was demonstrated that the proposed method outperformed the Social LSTM method and the independent LSTM network model on two public datasets.
For the future work, we may take the next step of combining attention mechanisms [23,24] to handle the trajectory prediction task. In addition, we will apply the proposed method into the advanced driver assistance system [25] of the intelligent vehicle by predicting the pedestrians' trajectories.