Learning a Memory-Enhanced Multi-Stage Goal-Driven Network for Egocentric Trajectory Prediction

We propose a memory-enhanced multi-stage goal-driven network (ME-MGNet) for egocentric trajectory prediction in dynamic scenes. Our key idea is to build a scene layout memory inspired by human perception in order to transfer knowledge from prior experiences to the current scenario in a top-down manner. Specifically, given a test scene, we first perform scene-level matching based on our scene layout memory to retrieve trajectories from visually similar scenes in the training data. This is followed by trajectory-level matching and memory filtering to obtain a set of goal features. In addition, a multi-stage goal generator takes these goal features and uses a backward decoder to produce several stage goals. Finally, we integrate the above steps into a conditional autoencoder and a forward decoder to produce trajectory prediction results. Experiments on three public datasets, JAAD, PIE, and KITTI, and a new egocentric trajectory prediction dataset, Fuzhou DashCam (FZDC), validate the efficacy of the proposed method.


Introduction
The prediction of future trajectories of surrounding agents in egocentric view is a crucial task in robotics and intelligent driving [1][2][3][4][5][6].Such capacity plays an important role in improving the safety, reaction speed, and decision-making abilities of mobile robots and autonomous vehicles.As humans, we are able to quickly understand the behaviors of other moving agents such as pedestrians and cars in a dynamic scene, as shown in Figure 1.In particular, our perception of visual stimuli is influenced by the surrounding environment or context, and our prediction ability comes from adapting prior experiences to the current scene.Furthermore, our understanding of future movements is not limited to the trajectories of other agents, but also includes series of planned actions known as intentions [7].
In recent years, the rapid advancement of deep learning has greatly contributed to the accuracy of trajectory prediction.This progress has spawned a large amount of related research, leveraging a series of methods such as attention mechanisms [8][9][10][11], graph neural networks [12][13][14], generative models [15][16][17], and goal-driven networks [18][19][20][21][22] to better capture complex spatio-temporal relationships and understand movement patterns.However, state-of-the-art trajectory prediction methods either employ a parametric network that aims to encode the past trajectory and visual features [23][24][25][26], or a memory module that uses the past trajectory as a key to read future encodings from memory [27][28][29].In addition, there is another line of research that adopts hybrid algorithms for trajectory prediction [30][31][32].Unlike humans, the methods above lack an understanding of the overall scene layout and are therefore less interpretable, which is undesired in safety-critical applications such as mobile robots and autonomous vehicles.Inspired by human perception, we seek to exploit the scene context for trajectory prediction in this work.However, it is challenging to design such a strategy for two reasons.Firstly, scene contexts are highly diverse, so it is important to build a memory module that is effective, efficient, and representative of typical scene layouts.In addition, what should be retrieved from previously observed scenes that are similar to the current scene remains an open question.Obviously, a single end goal may be insufficient, as the uncertainty in trajectory prediction increases with the lengthening of the prediction time span.Another possibility is to read previously observed trajectories directly (e.g., [28]).However, due to the diverse nature of scenes one would encounter, these trajectories have to be adapted to the current scene, which makes the complete previous trajectories redundant.In our work, we strike a balance between the two strategies above by applying the notion of intentions [7], i.e., a series of intermediate goals, to guide a multi-stage goal generator to produce multi-stage goals.More specifically, we first match a test scene to our scene layout memory to obtain the trajectories that belong to visually similar training scenes.Next, we use a past trajectory encoder to obtain a feature representation of the currently observed trajectory and compare it to trajectories from the visually similar scenes seen during training.The closest trajectory in the feature space is then retrieved, along with its future goals, to further go through a memory filter and then to be used as the input to the multi-stage goal generator with a backward decoder to obtain multi-stage goals.The multi-stage goals are then carried over to a forward decoder, which takes a conditional variational autoencoder (CVAE) to produce final trajectory predictions, which is commonly used in state-of-the-art trajectory prediction methods.Overall, our final model is semiparametric in nature, comprising a parametric CVAE and bidirectional decoders as well as a non-parametric retrieval-based memory module.This design choice allows us to borrow the advantages of both types of methods, obtaining state-of-the-art performance in a number of egocentric vision-based trajectory prediction datasets.
This paper extends our previous work [26] in several important ways.Firstly, we build a scene layout memory bank to encode typical scene layouts, which is a significant new enhancement in our approach.Secondly, by comparing the test-time scene to the bank of typical scene layouts, we retrieve a set of intermediate goal features from similar scenes.This allows for a second step of matching the trajectories with a past trajectory encoder, using the currently observed trajectory as the key to retrieve the most similar trajectory seen during training.Thirdly, the retrieved trajectory is then used to query its own goal features to go through a memory filter and then a multi-stage goal generator.In terms of experimental evaluation, we conduct extensive experiments on two more datasets (i.e., KITTI and Fuzhou DashCam, FZDC) to demonstrate the superior performance of our method, and the new method also outperforms our previous work on the two existing datasets (i.e., JAAD and PIE).In particular, we propose a new egocentric trajectory prediction dataset, FZDC, that is publicly available to facilitate further research in this area.
In summary, our main contributions are outlined as follows: • We present an innovative memory-enhanced multi-stage goal-driven network (ME-MGNet) for trajectory prediction.Given a test scene, we first propose a scene layout memory module inspired by human perception to borrow knowledge from previously seen similar scenes.The scene-level matching to previous experiences is followed by trajectory-level matching using the encoded features of past trajectories.

•
Using the future goals associated with the retrieved past trajectory, we further build a joint reconstruction autoencoder that produces a series of goals.A memory filter is also presented to select if these goals are close enough to the current scenario.

•
We integrate the above steps into a multi-stage goal generator that uses a backward decoder to produce multi-stage goals, and these goals are fed into a forward decoder that takes the output from a conditional autoencoder to obtain the final trajectory prediction results.

•
The experimental evaluation results of four publicly available egocentric trajectory prediction datasets demonstrate the superior performance of our method compared to the state-of-the-art.We have created a new egocentric trajectory prediction dataset, FZDC, that will be made available to the research community.Moreover, ablation studies also verify the efficacy of the various components of the ME-MGNet.
The rest of this paper is organized as follows.Section 2 reviews recent literature in trajectory prediction; more specifically, goal-driven and memory-based methods, respectively.Then, Section 3 describes the proposed method in detail, followed by experimental evaluations in Section 4 and closing remarks in Section 5.

Trajectory Prediction in Dynamic Video Scenes
At present, the majority of trajectory prediction methods operate using either an egocentric perspective or a bird's-eye view.The egocentric camera perspective is generally regarded as the most intuitive viewpoint for observing the surrounding environment.Nonetheless, it presents additional challenges due to its restricted field of view and the effects of ego-motion.Many studies have tackled these challenges by converting the perspective into a bird's-eye view with the aid of 3D sensors [33][34][35][36][37][38].Although this approach is viable, it is prone to measurement errors and complications in multimodal data processing, particularly when using LiDAR and stereo sensors.
In this study, we focus on trajectory prediction from the egocentric perspective.A number of studies have been proposed to address this problem.For example, Bhattacharyya et al. [39] used Bayesian long short-term memory (LSTM) networks to model observation uncertainty, integrating these models with ego-motion to predict the distribution of potential future positions.Yagi et al. [23] used information such as pose, locations, scales, and past ego-motion to predict the future trajectory of a person.Chandra et al. [40] captured the interrelationships between nearby heterogeneous objects to predict trajectories.Yao et al. [41] proposed a multi-stream RNN encoder-decoder structure that captures both object location and appearance.Makansi et al. [42] estimated a reachability prior for objects based on the semantic map and projected this information into the future.Unlike the above methods, we propose a scene layout memory inspired by human perception, and we use the trajectories observed in visually similar scenes during training to aid in test-time trajectory prediction.In particular, we retrieve a set of intermediate goals from the memory, which is a representation that is neither overly simple nor redundant.

Goal-Driven Methods for Trajectory Prediction
In fact, goal-driven networks are widely used in trajectory prediction under the egocentric view.For instance, Mangalam et al. [19] proposed a long-range multi-modal trajectory prediction method by inferring distant trajectory endpoints.Rhinehart et al. [43] introduced a generative multi-agent forecasting method that learns on agent goals and models the relationships between individual agent goals.Zhao et al. [44] predict an agent's potential future goal states by encoding its interactions with the environment and other agents, subsequently generating a trajectory state sequence conditioned on the goal.Yao et al. [24] propose a bidirectional decoder on the predicted goal to improve the accuracy of long-term trajectory prediction.Wang et al. [25] predict a series of stepwise goals at various temporal scales and integrate them into both encoders and decoders for trajectory prediction.The main contribution that sets our work apart from previous studies is that we predict future goals as intentions and build an intention memory bank of diverse scene layouts.Specifically, by comparing the currently observed trajectory to historical trajectories in the feature space, relevant intention features are retrieved from the memory bank to guide trajectory prediction.

Memory-Based Methods for Trajectory Prediction
A considerable body of research has introduced neural networks with memory functionality to address trajectory prediction.These networks can store and retrieve crucial information from sequential data, thereby improving trajectory prediction accuracy based on explicit memory information.One of the main advantages of these memory-based methods is their interpretability, and these methods are also considered complementary to the commonly used parametric neural networks.For example, Marchetti et al. [27] proposed a memory-augmented neural network to predict the trajectories of multiple targets.Specifically, they used recurrent neural networks to capture past and future trajectory features, leveraging a memory component to store and retrieve these features.Further, Xu et al. [28] proposed a sample-based learning framework incorporating a retrospective memory mechanism, storing samples from the training set into a pair of memory banks for matching relevant motion patterns during inference, thus improving the prediction accuracy.Furthermore, Huynh et al. [29] devised an adaptive learning framework that utilizes similarities between trajectory samples encountered during the testing process to enhance prediction accuracy.Different from the aforementioned methods, our work introduces the contextual information of the scene to build a scene layout memory bank, and we perform a two-level matching process (i.e., scene-level and trajectory-level) to retrieve a set of concise goal features for trajectory prediction.More importantly, we integrate the memory module into a conditional variational autoencoder, resulting in a semi-parametric final model that enjoys the benefits of both parametric and nonparametric methods.

Trajectory Prediction Using Clustering Methods
Lastly, as we will discuss in Section 3, our scene layout memory module involves a clustering step to encode the typical scene layouts.Therefore, we also include a brief discussion on trajectory prediction methods with a clustering component.For example, Akopov et al. [45] propose a cluster-based optimization of an evacuation process using a parallel bi-objective real-coded genetic algorithm (P-RCGA) based on the dynamic interactions of distributed processes that exchange the best potential decisions through a global population.Alam et al. [46] present a vessel trajectory prediction method that uses historical AIS data to cluster route patterns for each vessel type, thereby improving prediction accuracy.In addition, Sun et al. [47] propose a multimodal trajectory prediction method that involves a clustering step based on deep historical and future representations.Furthermore, Xue et al. [48] present a pedestrian trajectory prediction method using long short-term memory with route class clustering that captures pedestrian movement patterns.Unlike existing methods, the clustering that we perform in this work serves the purpose of identifying visually similar scene layouts, which has not been explored previously.

Memory-Enhanced Multi-Stage Goal-Driven Network
In this section, we present the proposed memory-enhanced multi-stage goal-driven network (ME-MGNet) in detail.Specifically, we first introduce our research objectives and hypotheses in Section 3.1, followed by presenting the formulation of the egocentric trajectory prediction problem in Section 3.2 and an overview to the proposed ME-MGNet in Section 3.3.Next, we describe the four main components of our methods, i.e., scene layout classification in Section 3.4, memory bank in Section 3.5, conditional variational autoencoder in Section 3.6, and multi-stage goal generator in Section 3.7.Finally, we present the overall learning objective and loss functions in Section 3.8.

Research Objectives and Hypotheses
The objective of egocentric trajectory prediction is to forecast the trajectory of a target object in a dynamic environment from a first-person perspective over a period of time.In the field of intelligent driving, accurate trajectory prediction enables vehicles to proactively respond to the movements of other vehicles and pedestrians on the road, facilitating safe and reasonable driving decisions.For mobile robots, trajectory prediction aids in more intelligent path planning, obstacle avoidance, and efficient task completion.In this paper, the main objective is to devise an effective strategy based on a memory module and multi-stage goal prediction for egocentric trajectory prediction in intelligent vehicles.
Based on the literature discussed in Section 2, we can readily see that memory modules and goal-driven models have been widely used for trajectory prediction.Unlike existing work, however, the two main research hypotheses in this paper are as follows.Firstly, we conjecture that building a memory of typical scene layouts would benefit trajectory prediction, because in two similar road scenarios, the behavior patterns of vehicles and pedestrians exhibit certain similarities.For example, at intersections, pedestrians often wait by the roadside before crossing the road.On straight roads, pedestrians tend to walk along the road or sidewalk.Therefore, scene-level appearance can be used to classify target trajectories with similar behavior patterns.Secondly, by predicting multiple intermediate goals, the trajectory prediction process can be more effectively guided, thereby reducing cumulative errors during the inference process and improving long-term prediction performance.This is similar to real-life scenarios, where road users typically need to plan a series of target positions to guide their direction before reaching their destination.Multiple intermediate goals are more detailed and can more accurately represent the movement intentions than a single goal.

Problem Formulation
The goal of trajectory prediction is to predict the future sequence of positions of a target within a scene based on the observed sequence of past positions.At time step t, we use X t = [X 1 t , X 2 t , . . ., X n t ] to represent the past trajectories of n objects.In addition,

Overview
As shown in Figure 2, the memory-enhanced multi-stage goal-driven network (ME-MGNet) is mainly built around four key components: the scene layout classification, the memory bank, the conditional variational autoencoder, and the multi-stage goal generator.Before we move on to discuss the details for each of these components, let us begin with a high-level overview of how the ME-MGNet works.
The encoding stage of the network can be divided into two parts.In the first part, which is shown in the upper portion of Figure 2, we first search the scene layout memory for the scene class that is visually closest to the test scene with image-level features, and we use the past encoder to encode the past trajectory in order to retrieve the most similar entry from the past memory.Next, we map the past memory to its corresponding intention memory in the memory bank, then we concatenate the encoded past trajectory feature and the intention feature before feeding them into the joint reconstruction decoder to predict the motion intention of the target.In the second part, which is shown in the lower portion of Figure 2, following a popular practice (e.g., [19,24,25]), a conditional variational autoencoder (CVAE) is adopted as the encoder to learn the distribution of future trajectories.Different from existing approaches, we add a goal generation network that takes the latent representation Z from CVAE as input and outputs the motion intention represented by a series of several goals.In the decoding stage of the network, the motion intention generated by the upper and lower branches in the encoding stage is fed into the memory filter, and the difference between the two is compared to decide whether to use the motion intention output from the memory module.Then, the output intention from the memory filter is used as the hidden state in the backward decoding process of the multi-stage goal generator.Finally, the forward decoder is connected to the hidden state vector of the backward decoder to predict the trajectory coordinates at each time step.
In the following, we outline the key steps of our method during inference.This will provide readers with a clear workflow of the ME-MGNet.
Step 1.At the current time step, given the input image, we use a feature extraction network in our scene layout classification module to obtain the scene layout features from the image.
Step 2. We compare the extracted scene-level features with the K scene layout features obtained using the K-means clustering algorithm during training to identify the most similar scene layout category.Subsequently, we select the corresponding memory bank from the scene layout memory according to the category.
Step 3.For each object in the scene, we encode its past trajectory using the past encoder to obtain feature d t .Next, we compare d t with each of the features in the past memory bank M p in order to find the feature k i that is the most similar to d t .Since the features in M p have a one-to-one correspondence to the features in the intention memory bank M f , the future goal feature v i in M f is then retrieved, and together with d t , they are fed into the joint reconstruction decoder to obtain the future goals Ḡt .
Step 4. The past trajectory is used as the input into the conditional variational autoencoder (CVAE) module to obtain the predicted future goals Ĝt .
Step 5.By comparing and integrating the goals obtained in Step 3 and Step 4, i.e., Ḡt and Ĝt , the memory filter eliminates the Ḡt that deviates significantly from Ĝt , and outputs the combined result G ′ t .
Step 6.The filtered future goals G ′ t are input into the multi-stage goal generator to produce multi-stage goals, then they are combined with the forward decoding process of the CVAE to predict the trajectory coordinates at each time step.

Scene Layout Classification
Human perception almost always reflects an integration of top-down and bottomup processes [49,50].In particular, humans can quickly understand the gist of a scene, which is primarily a top-down process.Most existing methods in trajectory prediction, however, employ a bottom-up process and therefore lack an understanding of the overall appearance of the scene.In the context of trajectory prediction, the object movement patterns in similar scenes often have certain similarities.By classifying the scene layout, the environment in which the trajectory data are located can be divided into different scene categories to better learn the data distribution and improve the accuracy and robustness of trajectory prediction.
In this paper, we use the K-means clustering algorithm [51] to perform clustering of road scenes in the training set and then classify similar scene layouts into the same category based on scene-level visual features.As shown in Figure 3, the overall procedure of the scene layout classification module in both training and prediction is shown.In the training stage, we use the ResNet-50 network [52], pre-trained on the ImageNet dataset [53], to extract the features from each image.Specifically, we utilize the 2048-dimensional features from the layer preceding the global average pooling layer of ResNet-50.This feature representation captures the high-level semantic information of the images while maintaining a relatively low dimensionality.Then, K feature vectors are randomly selected from the extracted features to serve as the initial cluster centroids.For each feature vector, its distances to the K cluster centroids are calculated, and the vector is assigned to the cluster with the closest centroid.The Euclidean distance is used to measure the distance between two feature vectors.Finally, the clusters are iteratively updated to cluster similar scene layouts into different categories.As shown in Figure 4, each column represents a cluster and corresponds to a specific scene layout.These clusters exhibit similar road layout features, such as crosswalks, through streets, and intersections.After training, we can divide the images in the training set into K different clusters, each with its corresponding cluster centroid.Each cluster represents a typical scene layout.During the prediction stage, we continue to use the pre-trained ResNet-50 network to extract the 2048-dimensional features from the current test image.By comparing the test image features with the cluster centroids using the Euclidean distance, we can identify the most similar cluster and thereby determine the scene layout category of the current test image.In addition, we construct a memory bank for the trajectory data under each type of scene layout to learn the behavior patterns of objects in similar scenes.In other words, we build K different memory banks to handle trajectory prediction in various scenes, as further elaborated in the next section.

The Memory Bank
Next, let us move on to discuss the design of our memory bank.The primary goal of the memory bank is to establish a one-to-one correspondence between the past memory and the intention memory, effectively linking the past trajectory to the future intentions.Specifically, each memory bank is composed of a related past memory bank and intention memory bank.The past memory bank stores past trajectory features, while the intention memory bank stores corresponding future goal features, linking past trajectories with future goals through key-value pairs.Following existing literature, the term intention refers a series of goals, and each goal is a future coordinate of the target.Unlike the dense future trajectory, the series of goals is sparse and concise; for example, the total trajectory prediction span is 45 time steps on the JAAD dataset, and the goals correspond to the coordinates at the 15th, 30th, and 45th time steps, respectively.More formally, assume the past memory bank is M p = {k i |i = 1, 2, 3, • • • , M}, where k i represents the instance at the i-th memory address, recording the past trajectory features extracted from the i-th training sample.Correspondingly, the intention memory bank is where v i represents the instance at the i-th memory address, recording the future goal features extracted from the i-th training sample.The two memory banks both store M instances, which is the number of total trajectories in a scene category.For each scene category, the total number of trajectories varies, and we omit the cluster membership subscript for notation simplicity.The past trajectory feature k i in the past memory bank is the key, and the corresponding value, which is the future goal feature v i , can be found in the intention memory bank through the key.
The joint reconstruction autoencoder.As shown in Figure 5a, to obtain the features in the memory bank, this paper proposes a joint reconstruction autoencoder to generate feature representations in the memory bank.At time t, two encoders are used to encode the past trajectory X t and the future goal G t , respectively, to obtain the past trajectory feature k i and the future goal feature v i .Here, a one-dimensional convolutional layer and a GRU unit are used as encoders to encode the temporal information.The encoding process can be written using the following formulation: The decoder consists of a three-layer perceptron and takes the concatenated past trajectory features and future goal features as inputs, outputting the reconstructed past trajectory Xt and future goals Gt .Unlike other memory models that construct feature representations, the future goals here are specifically composed of the positions of a series of three goals in the future trajectory.As a result, the reconstructed stage goals can more accurately describe the target's behavioral intention.Therefore, the loss function of the joint reconstruction autoencoder can be defined as follows: where α is a weighting parameter.After training, the autoencoder can take the past trajectory and future goals as inputs, perform encoding and decoding operations, and thus reconstruct the past trajectory and future goal similar to the input.At test time, when the future goals are not available, we use the goal features retrieved from the memory bank, followed by the decoder part of the joint reconstruction autoencoder, to produce the predictions of future goals.This process is illustrated in the upper-right portion of Figure 2.
Memory bank initialization.The joint reconstruction autoencoder can map the past trajectory and the future goals into a meaningful pair of representations.As shown in Figure 5b, during the memory bank initialization process, the two encoders in Figure 5a are used to encode the past trajectory and the future goals, respectively.This process generates the past trajectory feature k i and the future goal feature v i in the memory bank, with the two features having a one-to-one correspondence in the form of key-value pairs.Memory retrieval.After the memory bank initialization, it is also necessary to perform a retrieval operation based on the memory bank to obtain the future goal features corresponding to the past trajectories in the memory bank that are most similar to the current test data.This helps to better predict future trajectories by combining the encoded features.More specifically, the past encoder in Figure 5 is used to encode the observed test trajectory X t at the current time step t to obtain the encoded feature d t .Next, the cosine similarity between the encoded feature d t and all the keys k i in the memory bank is calculated to perform retrieval and to find the past trajectory features that are the most similar.The similarity calculation formulation is given as follows: After calculating the pairwise similarities above, we sort the similarities of all keys, then select the key with the highest similarity and retrieve its corresponding value (i.e., one of the goal features v i ).This goal feature is concatenated with the observed trajectory feature d t and sent to the joint reconstruction decoder to output the future goals.

Memory filter.
Relying solely on the trajectory information in the memory bank is not enough to cope with the diverse nature of scenes.In particular, test scenes may deviate greatly from the training scenes, and under such scenarios, the retrieved future goals may negatively impact the trajectory prediction.Therefore, we propose a memory filter to decide if the future goals Ḡt retrieved from the memory bank have a large deviation from the current test situation.If that is true, combining the predicted goals obtained from the memory often worsens the result.More specifically, our memory filter compares the prediction results Ĝt obtained by the goal generation network in the conditional autoencoder (described in Section 3.6) to the future goals generated by the joint reconstruction decoder in order to decide whether to discard the latter.This process is illustrated in the upper-right portion of Figure 2. When the goal Ḡt is filtered out, the filter outputs Ĝt ; otherwise, it outputs the average of the goals Ḡt and Ĝt .Therefore, the formulation for the final output result G ′ of the filter is as follows: where s t ( Ĝt , Ḡt ) is the filtering function proposed in this paper, which estimates a binary score based on the predicted goals Ĝt obtained with the goal generation network and the predicted goals Ḡt retrieved from the memory bank.The formulation for calculating the score s t is given as follows: where 1(o t > δ) represents an indicator function.When o t exceeds a pre-defined threshold of δ, the function outputs 1; otherwise, it outputs 0. In this paper, the threshold δ is set to 0.5 by default.The calculation formulation of o t is given as follows: where sig and FC denote a sigmoid layer and a fully connected layer, respectively.We train the layers in the memory filter following the method proposed in the certainty-based selector in [29].In summary, the final score s t is 0 or 1, which determines whether to filter out the goals predicted by the memory bank.

The Conditional Variational Autoencoder
Following recent work [19,24,25], we also use a conditional variational autoencoder (CVAE) to encode the past trajectory sequence to derive a latent variable Z, thereby learning and generating an approximate distribution of future trajectories.More specifically, our CVAE consists of the following modules: (1) A recognition network Q ϕ (Z q |X t , Y t ), which captures the correlation between the latent variable Z and the actual trajectory Y t .(2) A conditional prior network P θ (Z p |X t ), which models the latent variable Z based on past observed trajectories X t .(3) A goal generation network P ω ( Ĝt |X t , Z), which encodes input features and generates multi-stage goals.Here, ϕ, θ, ω denote the parameters of the corresponding networks.Each of these three networks consists of a three-layer multi-layer perceptron.In contrast to prior approaches, our goal generation network outputs several stage goals in the future trajectory, which more accurately represent the target's behavioral intention and thus better guide the subsequent decoding process.
In the training stage, we encode the past trajectory X t and the ground truth future trajectory Y t separately using gated recurrent unit encoders (i.e., the GRU encoders in Figure 2), yielding feature vectors h X and h Y , respectively.To capture the dependence information between the past trajectories and ground truth future trajectories, we concatenate the feature vectors h X and h Y and use them as inputs into the recognition network to predict the distribution mean µ q z and standard deviation σ q z of future trajectories.The conditional prior network assumes that only h X is used to predict the distribution mean µ p z and standard deviation σ p z , without knowledge of the ground truth future trajectory.Next, we sample Z q from N (µ q z , σ q z ), concatenate it with h X , and ultimately use the goal generation network to obtain the goals Ĝt required by the memory filter.During the test phase, the ground truth future trajectories are not available.Therefore, unlike the training phase, Z p is sampled from N (µ p z , σ p z ) to generate the goals Ĝt .

The Multi-Stage Goal Generator
The network architecture of the multi-stage goal generator is shown in Figure 6.The stage goals output by the memory filter are used as inputs, guiding the generation of lower-level stage goals in a top-down manner.Specifically, the three input stage goals are passed through the fully connected layers to obtain three goal features.At the same time step, the three goal features are connected with the GRU hidden features in a backward recursive process of the lower layer, thereby guiding the generation of hidden features h g t+jρ/m of several stage goals from time t + ρ to t + 1.The number of stage goal features m can be adaptively chosen according to the performance profile of the final model in a specific domain, which will be further discussed in Section 4. The output of the multi-stage goal generator can be defined as follows: Here, G ′ t represents the stage goals produced by the memory filter.j = {1, 2, 3, . . ., m} denotes the index of the j-th stage goal, ρ is the number of time steps to be predicted, and m ∈ [1, ρ] represents the division of the future trajectory into m stage goals.In addition, the function f MSGE denotes our multi-stage goal generator.
During the final decoding stage, the stage goal features obtained using the multi-stage goal generator are concatenated with the hidden state feature from the forward recursive inference to predict the final trajectory for that moment.It is worth noting that, due to the variable number of stage goal features output by the multi-stage goal generator, the connection between the generator and the forward recursive inference in the right portion of Figure 2 is represented with dashed lines.

Loss Functions
The overall loss function of the model in this paper consists of three parts: the trajectory prediction loss, the goal generation loss, and the KL divergence (KLD) loss.Specifically, the trajectory prediction loss quantifies the error between the model-predicted trajectory Y ′ t and the true future trajectory Y t .The future goal loss measures the error between the future goals G ′ t produced by the memory filter and the true future goals G t .Their calculation can be given as follows: In addition, the KLD loss is used to measure the closeness between the distribution N (µ q z , σ q z ) output by the recognition network of the CVAE module and the distribution N (µ p z , σ p z ) output by the prior network in the same module.The total losses, including the KLD losses, are as follows: where Z q and Z p are latent variables sampled from distributions N (µ q z , σ q z ) and N (µ p z , σ p z ), respectively.

Experiments 4.1. Datasets
To better demonstrate the efficacy and robustness of the ME-MGNet, we evaluate our method using three public datasets, namely JAAD, PIE, and KITTI, and a new dataset, Fuzhou DashCam (FZDC), that we create in this paper: JAAD dataset [54].The JAAD dataset is designed for studying trajectory prediction from an ego-vehicle perspective, focusing on the behavior of pedestrians and drivers on urban roads.It comprises 346 richly annotated short video clips (5-10 s long each) extracted from over 240 h of vehicle camera footage, with a video resolution of 1920 × 1080 and a video capture frequency of 30 Hz.The videos were collected at multiple locations in North America and Eastern Europe, covering various weather conditions, lighting conditions, and traffic scenarios.The JAAD dataset annotates video frames at a downsampled frequency of 30 Hz, providing a detection bounding box and a unique ID for each object.In accordance with the literature [55], we divide the 346 short video clips in the JAAD dataset into three groups in our experiments: a training set, a validation set, and a test set, containing 50%, 10%, and 40% of the data, respectively.
PIE dataset [55].The PIE dataset is also designed for studying motion prediction from a vehicle-mounted perspective.It contains over 6 h of vehicle-mounted video footage of typical traffic scenes, divided into 53 video sequences, each approximately 10 min long.Additionally, the dataset provides vehicle information (vehicle speed, direction, and GPS coordinates) from OBD sensors.All the video recordings are in high-definition format (1920 × 1080 pixels) at 30 frames per second.The PIE dataset annotates video frames at a downsampled frequency of 30 Hz, provides a detection bounding box and a unique target ID for each target, and includes rich scene and behavior annotations.Also, following the standard practice [55], the PIE dataset is divided into three groups for experiments: a training set, a validation set, and a test set, containing 50%, 10%, and 40% of the data, respectively.
KITTI dataset [56].The KITTI dataset is one of the most commonly used datasets in the field of autonomous driving and computer vision research.The dataset primarily contains various sensor data collected in real urban environments and can be used for tasks such as trajectory prediction, object detection and tracking, depth estimation, and optical flow estimation.Additionally, the dataset includes real image data from different scenes such as urban areas, rural areas, and highways, sampled and annotated at a frequency of 10 Hz, with an image resolution of 1238 × 374.Since the official KITTI dataset does not provide test set annotations, we use the training set from the object tracking task for trajectory prediction, which includes 21 video sequences, each ranging from 10 to 100 s long.Each sequence contains high-resolution RGB images, detection bounding boxes, unique target IDs, LiDAR point clouds, and the corresponding vehicle motion trajectories.We also divide the 21 video sequences into three groups: a training set, a validation set, and a test set, accounting for 50%, 10%, and 40% of the total data, respectively.
FZDC dataset.For trajectory prediction from the egocentric perspective, this paper proposes the Fuzhou DashCam (FZDC) dataset, which utilizes on-board cameras to record at 25 frames per second across multiple road sections in Fuzhou City, Fujian Province, China.The video resolution is 1280 × 720, capturing various road scenes such as urban main roads, highways, crosswalks, crossroads, and intersections.The collected and processed dataset contains a total of 66 video sequences, each lasting 20-30 s, with video frames annotated at a downsampled frequency of 10 Hz.Additionally, the dataset includes various object categories (e.g., vehicles, pedestrians, and cyclists).Each object is annotated with a bounding box, a unique ID, and a potential collision risk category, making the dataset suitable not only for trajectory prediction tasks but also for collision risk assessment tasks.An example of the trajectory annotation in the FZDC dataset is shown in Figure 7.The 66 video sequences in the dataset are divided into three groups: a training set, a validation set, and a test set, which account for 50%, 10%, and 40% of the total data volume, respectively.We note that the average driving behavior in China is somewhat different from the driving behaviors in more developed countries [57][58][59][60], and the FZDC dataset presents a unique and interesting challenge to accurate trajectory prediction.The FZDC dataset is publicly available from https://github.com/wxe999/FZDC_dataset(accessed on 29 June 2024) for research purposes.

Implementation Details
We conducted all experiments using a desktop server with an Ubuntu 20.04 OS, equipped with a 4.00 GHz Intel Core i9-9900KS CPU, 64 GB of RAM, and a single NVIDIA GeForce RTX 3090 GPU.Where our model uses gated recurrent units (GRU) as the backbone for both the encoder and decoder, the hidden size is set to 256.For the memory bank construction, the feature dimensions of the past memory bank and the intention memory bank are both set to 64.We use the Adam optimizer to train our model, starting with an initial learning rate of 0.001, which is dynamically adjusted according to the validation loss.We use the rectified linear unit (ReLU) as the activation function, and to mitigate overfitting, we integrate batch normalization and dropout layers.Our end-to-end optimization is conducted with a batch size of 128, and the training process concludes after 100 epochs.

Evaluation Metrics
In this paper, we follow the standard practice in recent work (e.g., [24,25]) and primarily assess the performance of our proposed approach using the mean squared error (MSE) between the positions of the upper-left and lower-right corners of the bounding box.The formulation for MSE is as follows: where n denotes the number of predicted objects, y i represents the ground truth, and ŷi denotes the predicted value generated by the model.
Additionally, we also calculate two additional metrics for evaluation: the center mean squared error (C MSE ) and the center final mean squared error (CF MSE ).C MSE can measure the accuracy of the entire trajectory, while CF MSE only measures the accuracy of the endpoints of the trajectory.Both metrics are computed similarly to MSE, but we note that their calculations are based on the centroid of the bounding box.All the metrics are measured in pixels.

Results
Quantitative comparison.To ensure the reliability of the experimental results, we obtained the model output based on the average of three experiments.As presented in Table 1, we conduct a comparative analysis of our model against other state-of-the-art trajectory prediction algorithms on the JAAD and PIE datasets, with the best performance results under each metric indicated in bold.The proposed method achieved the best performance across all the metrics on the JAAD dataset.Compared to our previous work, MGNet ranked second, and our method had an average improvement of 3.2% in the MSE indicator and 3.1% and 3.5% in the C MSE and CF MSE indicators, respectively.Similarly, the proposed method also achieved the best performance under all metrics on the PIE dataset.The MGNet algorithm ranked second here as well, and our method showed an average improvement of 2.1% in terms of MSE and 2.3% and 4.2% in terms of C MSE and CF MSE , respectively.In addition, we perform performance evaluation using the KITTI and FZDC datasets.Based on the analysis in Table 2, it can be seen that the method in this paper achieves the best performance across all the metrics on the KITTI dataset.Compared with the MGNet algorithm, which ranks second in performance, ME-MGNet has an average performance improvement of 7.3% in terms of MSE and 14.3% and 10.2% in terms of C MSE and CF MSE , respectively.In addition, ME-MGNet achieves the best performance for all the metrics on the FZDC dataset.Compared to the algorithm ranked second in performance, ME-MGNet has an average performance improvement of 2.6% in MSE and 3.1% and 3.0% in C MSE and CF MSE , respectively.In summary, the performance comparison results on all four datasets clearly demonstrate the strong performance of ME-MGNet in improving the accuracy of trajectory prediction.Exploration study.As shown in Table 3, we examine the performance impact on the JAAD and the PIE datasets by adjusting the number of stage goal features output by the multi-stage goal generator.In the table, outputting three stage goals means that the model does not use the multi-stage goal generator but only uses the three future stage goals output by the memory filter to guide trajectory generation.The results show that on the JAAD dataset, the best performance is obtained when we set the number of stage goals to 15, while on the PIE dataset, the best results are obtained when 9 stage goals are used.This indicates that the optimal number of output stage goal features depends on the dataset and requires further adjustment.Additionally, it can be observed from the table that when the multi-stage goal generator is not used, that is, when only three goal features are used, the results are the worst.This finding verifies the effectiveness of the multi-stage goal generator proposed in this paper.In addition, we also explore the impact of adjusting the number of scene layout categories when performing K-means clustering on the prediction performance.The results are shown in Table 4.When the number of scene layout categories is 1, it means that the model does not classify the scene layout and only generates a single memory bank to assist in subsequent trajectory prediction.The experimental data show that on the JAAD dataset, the best results are obtained when the number of scene layout categories is set to 20, and then the prediction performance gradually decreases as the number of scene layout categories increases.On the PIE dataset, the best results are obtained when the number of scene layout categories is set to 25.Then, as the number of scene layout categories increases, the prediction performance gradually decreases.Through experimental observations, when the number of scene layout categories is too large, the originally similar scene layouts are further subdivided, resulting in less trajectory data in the memory bank of the corresponding scene layout.Consequently, it becomes impossible to retrieve effective goal features from the memory bank.This indicates that the number of scene layout categories should not be too large and needs to be further adjusted according to the dataset.Additionally, it is observed that when scene layout classification is not used, the results are the worst.This finding verifies the effectiveness of the scene memory bank proposed in this paper.Ablation study.Next, we systematically remove various components from the in the JAAD dataset to evaluate their impact on the experimental results.The results are shown in Table 5, where "BL" represents the baseline model, which only uses CVAE to encode and output a single goal to guide the forward recursive reasoning of trajectory prediction, without the scene memory bank module and the multi-stage goal generator."MB" represents the memory bank module, and "MSG" represents the multi-stage goal generator.It can be clearly seen from the table that the inclusion of the scene memory bank module or the multi-stage goal generator both have positive impacts on the results.Among them, the performance is most significantly improved when the memory bank module is adopted, with an average performance improvement of 8.8% in MSE and 8.6% and 5.9% in C MSE and CF MSE , respectively.When the multi-stage goal generator is used, the average performance is improved by 5.9% in terms of MSE and 7.1% and 4.1% in terms of C MSE and CF MSE , respectively.Finally, the best results were achieved by combining both the memory bank and the multi-stage goal generator, with an average performance improvement of 13.8% in MSE and 13.2% and 10.9% in C MSE and CF MSE , respectively.Based on the results above, we can clearly see the efficacy of the memory bank and the multi-stage goal generator proposed in this paper.8, we present example prediction results on the JAAD, PIE, KITTI, and FZDC datasets to qualitatively demonstrate the prediction performance of our method.In the examples, the blue paths and bounding boxes represent the past trajectory, the red paths and bounding boxes represent the ground truth, and the green paths and bounding boxes represent the prediction results obtained using the ME-MGNet.It can be clearly observed that the test scenes contain real-life traffic scenarios, such as urban main roads, crosswalks, and intersections.The trajectories predicted by our method are largely accurate when compared to the ground truths, and they clearly capture the intentions of other road users.Furthermore, as the prediction time lengthens, the errors between the predicted trajectories and the ground truth increase, but it still maintains a relatively accurate prediction of the intentions rather than the exact trajectories, which is essential for road safety applications.

Discussion
In this paper, we propose two hypotheses based on existing work to improve model performance.First, to address the issue that existing trajectory prediction methods based on memory-augmented networks lack an understanding of the scene context, we redesign the memory bank within the memory-augmented network.Specifically, we divide the trajectory features in the memory bank into several clusters based on their scene layouts.The trajectory features stored in each cluster originate from similar road scenes and exhibit similar behavior patterns.This allows for accurate and efficient retrieval of historical trajectory information to guide future predictions.Second, unlike existing goal-driven networks that guide the generation of future trajectories by predicting a single long-term goal, we propose guiding future trajectory generation by predicting several stage-wise goals.Compared to a single long-term goal, multi-stage goals can more accurately reflect the behavioral intentions of the agent, thereby more precisely guiding the generation of future trajectories.Finally, we compare the proposed model with existing methods on four datasets, JAAD, PIE, KITTI, and FZDC, to verify its effectiveness.Among them, the first three datasets are widely used benchmarks for trajectory prediction, and the favorable experimental results demonstrate the superiority and generality of our method.We also create a new dataset, FZDC, to facilitate future research in this area.We conduct ablation experiments on the JAAD dataset, and the results successfully validated the two hypotheses we proposed.We note that our method can be readily integrated into existing methods based on a conditional autoencoder (e.g., [24,26]) and other trajectory prediction approaches lacking an understanding of the scene-level context to further enhance their performance.
Limitations.As shown in Figure 9, examples of prediction failures of the proposed method in some special cases are presented, highlighting areas for improvement in future research.Specifically, in Figure 9a, due to the small change in the past trajectory combined with the sudden acceleration of the vehicle, the change in the future trajectory suddenly increases, resulting in a large deviation between the prediction result and the actual situation.In Figure 9b, despite minimal changes in the past trajectory, the bumps and the increased speed of the vehicle cause the future trajectory to become abnormally fluctuated, making accurate prediction by the model almost impossible.The pedestrian's distant nature further adds to the difficulty.In Figure 9c, the model incorrectly predicts that the pedestrian intends to cross the road, while the pedestrian decides not to do so.We note that, however, in the context of autonomous driving, it is crucial to place greater emphasis on the intentions of pedestrians crossing the street, as incorporating defensive driving can significantly enhance safety.In Figure 9d, due to vehicle bumps, both the past and future trajectories of the distant pedestrian exhibit large fluctuations, making it difficult for the model to accurately predict the trajectory.In general, typical challenging scenarios arise when there is a sudden change in the movement pattern of either the ego-vehicle or the other agents, as well as when dealing with small and distant objects.

Figure 1 .
Figure 1.Example trajectory prediction results using our method on the JAAD dataset.The blue path represents the past trajectory, and the red, green paths, and bounding boxes represent the ground truths and predictions, respectively.

Figure 2 .
Figure 2. Overview of our ME-MGNet architecture.Arrows in orange, green, and black denote connections during training, inference, and both training and inference, respectively.

Figure 3 .
Figure 3. Illustration of the scene layout classification.The green dashed box represents the training process of the module, during which K-means clustering is performed on the scene-level image features.The orange dashed box represents the prediction process, during which the test image feature is compared to the cluster centroids to find the closest cluster as the classification result.

Figure 4 .
Figure 4. Scene layout clustering results.Each column shows example images from a cluster we obtained using K-means clustering, corresponding to a distinct scene layout.

Figure 5 .
Figure 5. Memory bank initialization.We first train a joint reconstruction autoencoder, as shown in (a), and then use the two encoders in the joint reconstruction autoencoder to encode all the past trajectories and future goals in the scene cluster to initialize the memory bank, as shown in (b).The asterisks (*) indicate reconstruction results, and the stars indicate stage goals.

Figure 6 .
Figure 6.Detailed structure of the multi-stage goal generator.The stage goals output by the memory filter are used as inputs, guiding the generation of lower-level-stage goals in a top-down manner.

Figure 7 .
Figure 7. Example images from the Fuzhou DashCam (FZDC) dataset with trajectory annotations overlaid.The ground truth future trajectory is shown in red, and the past trajectory is shown in blue.

Figure 8 .
Figure 8. Qualitative results of trajectory prediction on JAAD, PIE, KITTI, and FZDC datasets.Red indicates ground truth, green indicates predictions, and blue indicates past trajectories.Best viewed in color.

Figure 9 .
Figure 9. Examples of prediction failures on the JAAD dataset.Subfigures (a-d) are four typical failure examples.Red indicates ground truth, green indicates predictions, and blue indicates past trajectories.Best viewed in color.

Table 1 .
Quantitative results on JAAD and PIE datasets.Lower values are better.The bold text indicates the best results.

Table 2 .
Quantitative results on KITTI and FZDC datasets.Lower values are better.The bold text indicates the best results.

Table 3 .
The impact of varying number of stages in the multi-stage goal generator to the performance on the JAAD and PIE datasets.Lower values are better.The bold text indicates the best results.

Table 4 .
The impact of varying numbers of scene layout categories to the performance on the JAAD and PIE datasets.Lower values are better.The bold text indicates the best results.

Table 5 .
Ablation study of our method on JAAD datasets.Lower values are better.BL: the baseline model, MB: memory bank module, MSG: multi-stage goal generator.The checkmarks indicate if the corresponding components of our method are activated.The bold text indicates the best results.
Qualitative results.As shown in Figure