Pyramidal Predictive Network: A Model for Visual-frame Prediction Based on Predictive Coding Theory

Visual-frame prediction is a pixel-dense prediction task that infers future frames from past frames. Lacking of appearance details, low prediction accuracy and high computational overhead are still major problems with current models or methods. In this paper, we propose a novel neural network model inspired by the well-known predictive coding theory to deal with the problems. Predictive coding provides an interesting and reliable computational framework, which will be combined with other theories such as the cerebral cortex at different level oscillates at different frequencies, to design an efficient and reliable predictive network model for visual-frame prediction. Specifically, the model is composed of a series of recurrent and convolutional units forming the top-down and bottom-up streams, respectively. The update frequency of neural units on each of the layer decreases with the increasing of network levels, which results in neurons of higher-level can capture information in longer time dimensions. According to the experimental results, this model shows better compactness and comparable predictive performance with existing works, implying lower computational cost and higher prediction accuracy. Code is available at https://github.com/Ling-CF/PPNet.


Introduction
The idea that brains are essentially prediction machines is one of the unified theories in cognitive science. It holds that brain functions, such as perception, motor control and memory, are all formed and modulated by prediction. Particularly, it also forms a sensorimotor framework (predictive coding) for understanding how human takes an action based on predictions. It proposes that the most functions in the brain follow a predictive manner, which is expressed by our brain's internal model. Therefore, the brain can continuously predict and form our perception, based on which we can also execute the motor actions. Such internal predictive model, shaped by the neurons' representation, is also always learning and updating itself in order to predict the changing environment better. This idea, if it is properly implemented by learning architectures, will be also useful in practical applications such as video-frame prediction.
The so-called video-frame prediction is to predict the future of a visual frame based on the given context frames. From the perspective of applications, being able to predict the future is of great significance. Adaptive systems that can predict how future scenes can be unfolded based on the internal model learned by the context will offer possibilities. For example, the predictive ability endows robots to foresee the future and even understand human's intention by analysing their movements, actions, etc., to make correct actions ahead of time ( Figure 1). Self-driving cars can anticipate the forthcoming situations and make judgments beforehand [1]. Moreover, there are a number of applications such as anticipating activities and events [2], longterm planning, prediction of pedestrian trajectories in traffic [3], precipitation forecasting [4] and so on. With the predictive ability, applications can become more efficient, they can foresee arXiv:2208.07021v3 [cs.CV] 15 Nov 2022 Figure 1. A robot prediction system. By giving the context image sequences, the robot will predict future frames with a predictive model and make corresponding actions beforehand based on the predictions. a changing future and react accordingly in advance, making their behavior smoother and more energy efficient. For different domains, the methods used may have some subtle differences (For instance, in the field of autonomous driving, the scene may be more complex, so we need a larger and deeper neural network, or other effective preprocessing or post-processing methods), but the overall framework of the model should be unchanged.
Building on the success of deep learning, although a number of models or methods for visual-frame prediction have been proposed, the accuracy of predicted frames is still far from the requirements. This problem is more severe when performing long-term prediction or predicting visual sequences with large changes between frames. Besides, in view of the large computational overhead of existing models, making the model calculate in a more efficient way to promote the implementation of algorithm is another promising direction.
Therefore, in this work, we proposed to combine the theoretical framework of predictive coding and deep learning methods, to design a more efficient network model for the task of visual-frame prediction. This cognitive-inspired framework is a hierarchical processing model, which mimics the hierarchical processing structure of the cerebral cortex. One of the main advantages of such a predictive coding model is that the internal model is updated by a combination of bottom-up and top-down information stream instead of just relying on outside information. This provides a possible framework for simulating and predicting the environment, which is also the essence that early works tried to implement as computational models [5,6].
The main contributions of this work are as follows: 1) We propose and construct a novel artificial neural network model, this model is a hierarchical network, which we call it the pyramidal predictive network (PPNet). It was modified on the basis of a generic framework proposed by "predictive coding". As the name suggests, the update rating of neurons reduces with the increasing of the network level, which mimics the phenomenon of lower oscillations in the higher area of the visual cortex, and makes the model encodes information at various temporal and spatial scales as a result. 2) The loss function was improved to match the video prediction task. Inspired by the attention mechanism (for example, when the prediction differs greatly from the reality, the brain will react more strongly), we introduced the method of adaptive weight in the loss function, that is, the greater the prediction error, the greater the weight was given. According to the results, the proposed methods do get a better prediction using less computational cost with a more compact and more time-dependent architecture. Later we will introduce our methods and the basis in detail.
The rest of this article is organized as follows: First, Section 2 reviews the related work about "Predictive Brains" and existing visual-frame prediction models briefly. Next, Section 3 introduces the network structure and methods in details. Section 4 shows the experimental results by making quantitative and qualitative evaluations of our methods compared with the baseline. Section 5 presents a brief discussion on the proposed method. Finally, Section 6 draws a conclusion and our thoughts about future studies.

Related Work
In order to better integrate predictive coding theory into neural networks, we need a detailed review of both aspects. In this section, the conceptual models of predictive coding and its related learning frameworks, as well as the state-of-the-art methods for visual-frame prediction from the perspective of machine learning will be reviewed.
The predictive coding, which is a computational model of cognition, asserts that our perception mostly comes from the brain's own internal inference model, combining sensory information with expectations. Those expectations can come from the current context, from internal model in the memory or as an ongoing prediction over time. As a theoretical ancestor, Helmholtz firstly proposed the concept of unconscious inference happened in the predictive brain [7]. For example, an identical image can be perceived in different ways. Since the image formed on the retina does not change, perception must be the result of an unconscious process that deduces the cause of sensory information from top to down. Later in the 1940s', using empirical psychology study, Bruner demonstrated that perception is a result of the interaction between sensory stimuli (bottom-up as a recognition model) and conceptual knowledge (top-down as a generative model) [8]. Bar proposed a cognitive framework that the learned representation could be used in generating predictions, rather than passively "waiting" to be activated by sensory input [9]. From the neuroscience perspective, Blom et al. also argued that predictions drive neural representations of visual events ahead of incoming sensory information [10], which suggests that the neural representations was driven by predictions generated by brain rather than the actual inputs.
Depicting the predictive framework using a more rigorous expression, the term of "predictive coding" is imported from the field of signal processing. It is an algorithmic-based cognitive model aiming at the explanation of human cognition using the predictive framework. It has been applied in building computational models to explain different perceptual and neurobiological phenomena of the visual cortex [11]. Specifically, it describes a simple hierarchical computational framework: neurons in higher level propagate predictions down, while neurons in lower level propagate prediction errors up. (as shown in Figure 2) The entire model is updated through a combination of bottom-up and top-down information flows, so it does not rely solely on external information. Besides, the propagation of prediction errors constitutes effective feedback, allowing the model to perform self-supervised learning. The above characteristics make the predictive coding framework available and valuable to apply to the field of signal processing. For example, Whittington et al. proposed that a network developed in the predictive coding framework can efficiently perform supervised learning with simple local Hebbian plasticity. The activity of the prediction error node is similar to the error term in the backpropagation algorithm, so the weight change required by the backpropagation Figure 2. A general framework of predictive coding. The visual cortex receives sensory inputs from outside world or signal errors from lower level to produce a local representation, which is then compared with the prediction made by predictive model. (b): Hierarchical network model for predictive coding proposed by Rao and Ballard [13] algorithm can be approximated by a simple Hebbian plasticity of connections in the prediction encoding network [12].
In the field of visual-frame prediction, there is also a lot of work based on predictive coding. One of most successful applications is the PredNet model proposed by Lotter et al [14]. It is a ConvLSTM-based model which stacked several ConvLSTMs vertically to generate a top-down propagation of prediction. On the other hand, the bottom-up propagation delivers the values of error. This model achieves the state-of-the-art performances in a few tasks such as video-frame prediction. Elsayed et al. [15] implemented a novel ConvLSTM-based network called Reduced-Gate ConvLSTM which gives better performances. However, although these works strictly follow the predictive coding style, the details are not well took into account. The predictive coding computational framework only roughly explain how the brain works, but some details, such as transmission delay, are ignored. The transmission delay has been discussed in the work of Hogendoorn et al. [16] in detail. They pointed out that only when the concept of transmission delay is added, the predictive coding model can be regarded as a temporal prediction model. In addition, other neuroscientific phenomena, such as the different frequencies of oscillations in different levels of cortex, are equally important. Therefore, we designed a video prediction method with a comprehensive consideration of the different biological evidences mentioned above.
Apart from the above methods, more predictive models were proposed building on the recent success of deep learning. The early state-of-the-art machine learning techniques are usually based on the encoder-decoder training. Using an end-to-end training manner, consecutive frames are used as inputs and outputs to train visual offsets or their semantic coherent. On the basis of the Encoder-Decoder network and LSTM, Villegas et al. proposed a novel method which decomposes the motion and content [17] which encodes the local dynamics and the spatial layout separately, so as to simplify the task of prediction. However, the motion referred to is simply obtained by subtracting x t−1 from x t . It describes changes at the pixel level only. Jin et al. [18] also explored the inter-frame variations, which is similar to the MCNet. The innovation is that they used GDL (Gradient Difference Loss) regularization as loss function to sharpen predictions. In addition, Shi et al. also implemented the CNN-LSTM based model for precipitation nowcasting [19]. Different from the previous two works, they embed convolutional neural networks directly into LSTM, which makes better performance in capturing spatial-temporal correlations and is also adopted in our network architecture.
Besides , training in an adversarial fashion is another popular way, since the GAN (Generative Adversarial Network) shows excellent performance in image generation for prediction. For example, Aigner et al. [20] proposed the FutureGAN based on the concept of PGGAN (Progressive Growing of GANs) in 2018. They extended this concept to the task of visual-frame prediction using a 3d convolutional encoder-decoder model to capture the spatial-temporal information. However, 3d convolution undoubtedly consumes more computation than other methods. Before PredNet, Lotter et al. have also proposed a GAN based model named Predictive Generative Network (PGN) which [21] training with a weighted MSE and adversarial loss for visual-frame prediction.
Summarized from the previous, there are two main problems: 1)There is still room for improvement in network structure and training strategy. For instance, the Encoder-RNN-Decoder network only performs prediction in the high-level semantic space, resulting in most of the low-level details being ignored. 2)The computational cost is too large, consuming a lot of resources (especially during training). How to reduce the computational overhead through reasonable pruning is also important. We have previously introduced the characteristics of predictive coding and the related theories, which provide an efficient and reliable theoretical computing framework. Therefore, in order to reduce the consumption of resources and achieve sustainable artificial intelligence, we suggest combining the efficient cognitive framework and advanced data-driven machine learning methods to design an efficient predictive network model, which can not only improve predictive accuracy, but also reduce computational cost. Next, we will introduce our model in detail.

Network Model and Methods
In this section, we will introduce the cognition-inspired model which is specialised for visual-frame frame predictions. As its name (PPNet) refers to, its pyramid-like architecture is beneficial to predict the visual frames as the neurons on the lower levels encode and predict the actual frames, and the neurons on-top encodes the scenarios which usually only change within a few visual frames ( Figure 3). We will explain this idea in the next subsection. Then the detailed architecture as well as the algorithm will be introduced in the next few subsections.

Efficiency in Pyramid Architecture
In this work, we mainly refer to the design concept of PredNet [14] when building the network structure. As early as 2016, Lotter et al. proposed such a typical predictive coding model which follows strictly the dual-way flow at every time-step and achieves outstanding performance. Nevertheless, the processing of information can be improved in at least two aspects.
First, according to predictive processing framework, at least two kinds of neurons are required: an internal representation neuron for generating predictions and an error calculation neuron for computing prediction errors. In the PredNet model, bottom-up inputs at each level are only served as targets of error calculation neurons for comparing with the topdown predictions to generate prediction errors, and the information carried in the upward propagation is only the prediction error itself. However, we argue that it is necessary to use the past and present sensory information (represented here as video frames) as the inputs of representation neuron to generate predictions with higher accuracy. The formed memory can be formulated in a Bayesian framework, which is necessarily to be used to generate predictions. By learning such a Bayesian model, we can maximize the marginal likelihood or the entropy [22]. Second, as a cognitive-inspired model, we suggest such prediction and sensory input can be respectively implemented in at least two information streams in the hierarchical manner. This not only is inspired by our nervous system, but also is the way to integrate inputs from different network layers to get more spatiotemporal information which has been also widely used in the deep learning architecture, such as the ResNet, DenseNet and so on.
Based on the above assumptions, we propose and designed such a predictive model where the update rating of neurons on different levels can differ. Alternatively, it can be also interpreted as a delay in information transmission. In general, it takes time for information to be transmitted from lower level to higher level, so there is a delay in transmission between different layers. However, neurons at the bottom layer do not passively wait for information transmitted from the top layer before making a prediction. The changes in biological synapses are determined only by the activity of presynaptic and postsynaptic neurons [12]. Therefore, in PPNet, once the prediction unit (ConvLSTM) receives sensory input (green), it will immediately combine with the prediction from higher level (if any) to make predictions. As what we mentioned in Sec.2, The delay of information transmission has been discussed in detail in the work of Hogendoorn et al [16]. They argue that traditional predictive coding models such as the one first proposed by Rao and Ballard [13] do not predict the future, but hierarchically predict what is happening. When the concept of transmission delay is added, the predictive coding model changes from hierarchical prediction to temporal prediction.
As a result, the PPNet could be regarded as an equivalent to the large-scale brain network (LSBN) where the higher cognitive function is conducted in the higher level of the deep learning network . In the neuroscience evidence, such cognitive function which is processed in the PFC (prefrontal cortex) can be also used to predict the situated scenarios in our visual-frame prediction application for an agent. Therefore, our model is built considering the balance between biological evidence and the efficiency in computing.

Network Architecture
In this section, we will introduce our network model in detail. The architecture of our model is shown in Figure 3. For the sake of reading and understanding, it is necessary to state the meaning of the symbols in the figure before making a detailed comparison and analysis: • A l t : the green one, which represents the sensory input at level l and time step t • P l t : the orange one, which represents the prediction at level l and time step t. Its prediction object is the sensory input at level l and time step t (A l t+1 ) • E l t : the red one, which represents the prediction error at level l and time step t. It is calculated from previous prediction P l t−1 and current sensory input A l t . Inspired from PredNet, the PPNet also uses ConvLSTM as its basic components as they provide prediction flows with long-term dependency. Similarly, each layer of the network can be roughly divided into three parts: • a predictive unit , which is made up of the recurrent convolutional network (ConvLSTM). It receives a sensory input A l t and a prediction P l+1 t from higher level (if any), to generate a local prediction P l t of next time step. • a generative unit , which consist of a convolutional layer and a pooling layer. This unit is responsible for turning the local input A l t as well as prediction error E l t+1 into the input A l+1 t+1 of next level. • an error representation layer , which is split into separate rectified positive (A l t − P l t ) and negative (P l t − A l t ) error populations. In order to process the prediction only when it is necessary, we show that the propagation of the dual-way can be done in a more efficient way. For a better understanding and comparison, a diagram ( Figure 4) is shown regarding the way of information propagation comparing our model and the PredNet model.
First, the computation of our model begins at the lowest layer after receiving the first sensory input, this is consistent with the design concept mentioned in section 3.1, which is different from PredNet which first starts at the top level by generating prediction without any prior information. Second, in our model, the bottom-up input of a higher-level unit comes from the combination of information from lower-level units of two time-steps. Specifically, the current input A l t is fed into internal representation neuron (ConvLSTM) to generate local prediction P l t at time step t, which is then compared with next time step input A l t+1 to generate Figure 5. The relationship between hyper-parameter p and the loss. Where p = 0 indicates that no weight is added, the original error is directly used as the loss value. The red circle marks the threshold between p = 0 and p = 2, indicating that when p = 2, more attention is paid to errors larger than the current threshold while less attention is paid to errors smaller than the threshold.
the prediction error E l t+1 . In other words, A l t+1 is not only a bottom-up sensory input for internal representation neuron at time step t + 1, but also the target of previous step t, which is different from PredNet (the A l t+1 is just serve as a target at time-step t + 1). Note with both prediction (P l t ) and target (A l t+1 ) can PPNet generate prediction error for upward propagation. That is, at least two continuously sensory inputs A l t and A l t+1 are required to generate prediction error for upward propagation, in which the former is served as an input to produce prediction while the latter is served as a target. As a result, the computation of neurons at different levels is not updated in a synchronized way at different levels, and the update frequency of neurons decreases as the network level increases, which is consistent with the biological evidence: deep neurons oscillate at a lower frequency [23]. For this reason, the bottom-up input of top-level contains information for multiple time-steps at the bottom-level, which makes the PPNet has a stronger temporal correlation in structure, rather than relying solely on the temporal correlation of LSTM. In addition, it allows the PPNet to reduce the computational load by not having to update higher-level neurons.

Training Loss and Adaptive Weight
The training loss in our model is defined as the concatenation of positive and negative errors (Eq. 1). whereŶ denotes prediction and Y is target. ReLU denotes the "rectified linear activation function", which is defined in Eq. 2. concat means concatenating two multidimensional matrices together (for example, concatenating two matrices of dimension (b, c, h, w) into a matrix of (b, 2c, h, w)) The Eq.1 indicates the error population in the neurons incorporates both positive errors and negative errors [13]. Furthermore, to sharpen the predictions, we introduce an adaptive weight into the loss function inspired by attention mechanism.

E t = concat[ReLU(Ŷ − Y), ReLU(Y −Ŷ)]
(1) At the beginning of the visual sequences, the error is usually quite large since it drives the top-down prediction to minimize the error. That is, the greater the prediction error, the stronger the brain response. We argue that the brain's response can be seen as a weighting of the prediction error. Based on this idea, we propose to add more weights to increase the contribution of prediction errors with higher values (for example, at the beginning of sequences). While the one with lower value, their contribution is reduced. A set of experiments raised by Kutas & Hillyard [24] have shown that, when prediction is seriously inconsistent with environment, the brain reacts more strongly. Higher accuracy means less uncertainty, which is reflected in a higher gain on the relevant error units to do the update. In other words, the error units will become more adaptive to drive learning and plasticity if they are given an increasing weight. Therefore, we introduce a method of adaptive weights into our model, where a higher value of prediction error result in a higher weight.
The adaptive weight for every time-step is calculated by directly multiplying the error itself by a coefficient (shown in Eq.3). The E t denotes the prediction error at time step t, while the p is a changeable hyper-parameter. So the training loss is defined in Eq.4, where T denotes the length of input sequences and λ t denotes the weighting factors by time. However, the error with a value less than 1/p will get smaller after being weighted. Figure 5 shows the relationship between p and the loss. When the error is greater than the threshold (e.g., the intersection of the red circle), it will be enlarged. However, it will be reduced if it is less than the threshold. From an attention mechanism perspective, we pay more attention to errors larger than the threshold and pay less attention to errors smaller than the threshold. So the choice of threshold is extremely important. We will further explore the influence of hyper-parameter p in the following experiments

Algorithm
In this section, we will introduce the algorithm to implement the above model based on the architecture and computation process mentioned in section 3.2. To better serve the following description, we reiterate the definition of each parameter as follow: The complete algorithms are listed in Eqs. 5 to 9. The model is trained to minimized the training loss defined by Eq. 4, and our implementation is described in Algorithm 1. The information flows pass through two streams: 1)A top-down propagation where the hidden states H l t of ConvLSTM is updated and the local prediction P l t is generated. 2)A bottomup stream where the prediction errors E l t+1 is calculated and propagated up to higher level along with the local input A l t . Due to the pyramid design, the computation of our network updates at the lowest layer (i.e. layer 0) at the first time-step. However, for the convenience of programming, we refer to the programming method of PredNet, in which we perform the calculation of the top-down information flow first(line 2-11 in Algorithm 1), and then calculate the prediction error and update the sensory input of higher level (line 12-19 in Algorithm 1). Differently, if no sensory input A l t at time-step t and level l, the calculation of this predictive unit is skipped without generating any predictions and the hidden state of ConvLSTM H l t stays the same.

Experiments
In this section, several experiments are presented to show the performances of the PPNet using datasets for autonomous driving. We first introduce the features and pre-processing methods of three datasets: KTH, Caltech Pedestrian and KITTI, which are commonly used in the work of visual-frame prediction. Then the training detail and evaluations comparing PPNet and other state-of-the-art models will be presented in the following subsections.

Dataset and pre-processing
All the aforementioned datasets have to be processed into sequences before they can be used for training. In this part, we introduce the features of these datasets as well as the pre-processing methods. •

KTH:
The KTH dataset is an older dataset made in 2004 for human actions recognition. However, it is still very popular in the research of visual-frame prediction because of its simply scenario end events. • KITTI: The KITTI dataset is one of the most widely-used datasets for autonomous driving. There are various processed data, but we directly download its raw images for training. Approx. 35K frames are used for training and 4.5K for testing. The frames are centercropped and resized to 128 × 160 pixels in the same way as PredNet. Compared to the other two datasets, its variations of interframe are greater. • Caltech Pedestrian: It was originally designed for pedestrian detection, which is also suitable for the work of visual-frame prediction. The frames are directly resized to 128 × 160 pixels which is the same as KITTI. The variations of its interframe are much smaller than KITTI, which might result in the model learns a repetition instead of prediction.

Training Setting
We implemented the PPNet using PyTorch platform and trained it on a Geforce RTX 3070 GPU. The length of input sequence is set to 10 and the number of layers in the network is set to 6. Other hyper-parameters are shown in Table 1. Influenced by the initialization, the time-weight λ t of the prediction error generated at the first time step is set to 0.5, while the rest are set to 1.
In order to pick up a suitable value for the hyper-parameter p proposed previously, we have performed two sets of experiments using part of the KITTI dataset and Caltech Pedestrian dataset to explore its influence, the results are shown in Figure 6. The horizontal lines indicate the results without adding any weight. According to Eq. 3 and Eq. 4, when the value of p is set to 1, the loss function will be equivalent to the mean square error loss. Obviously, the method of dividing the error into positive error and negative error is indeed beneficial Better results can be observed while the value of p is greater than 5 (or 6) compared to the one without any weight. The training loss (mean errors) is getting smaller with the increasing of p, and we got almost the best result while it is close to 10 3 . However, it might result in an opposite performance if keeping increasing its value. So we chose the value around 10 3 in the subsequent experiments.

Evaluation Results
In this section, we use SSIM [25], PSNR [26] and LPIPS [27] for quantitative evaluation. SSIM is an early measure of image similarity, which compares two images from the perspective of brightness, contrast, and structure. PSNR is also a metric for evaluating image quality. It measures the degree of image distortion by calculating the ratio of the maximum signal to background noise. However, the above two evaluation indicators have the same problem: the results may not match the evaluation of the human visual system [28]. To solve this problem, Zhang et al. proposed the LPIPS metric to try to simulate the evaluation of human visual system. Higher values indicate better results for SSIM and PSNR, while lower values indicate better results for LPIPS.  Figure 7. The visual presentation of predicted frames on the KTH dataset. We take 10 frames as input and predict the next 30 frames Table 2 shows the quantitative evaluation results with the state-of-the-art methods on the KTH dataset. Similar to previous work, we made calculations on the average results over the future 10 frames (10 → 20) and 30 frames (10 → 30) respectively, with 10 input frames. Our method does achieve better or comparable results compared with the state-of-the-art works in terms of accuracy assessment. However, in the field of video prediction task, pure quantitative evaluation seems to be weak sometimes. Therefore, we also visualized the predicted results. Figure 7 shows the predicted examples of our method and other proposed methods. Obviously, our method also achieves good results from the perspective of human visual system evaluation, while the Conv-TT-LSTM [33], which has acquired outstanding performance in quantitative evaluation, performs poorly from the perspective of visual presentation (Actually, it also performed poorly in work [34]). This is a common problem in video prediction tasks. There is not an accurate and uncontroversial evaluation metric like image classification or semantic segmentation. As a result, we need to combine the quantitative evaluation and qualitative evaluation to make a better comparison.

Results on Caltech and KITTI dataset
We also validate our methods on Caltech and KITTI datasets, which have more complex scenarios and events. Table 3 shows the quantitative evaluation results. Obviously, even though we only count the predicted frames of future 5 time-steps, the results are still much worse than the performance on KTH. In fact, it has to do with how complex and varied the scene is. The more complex the scene and the greater the variation, the more difficult it is to predict. As shown in Figure 8, we visualized the inter-frames variation of the three datasets separately. The Catech has a similar level of sophistication as KITTI, but KITTI is more variable than Caltech and therefore the methods perform worse on KITTI. Prediction in complex scenes is also an urgent problem to be solved in current video prediction tasks.

Comparison with PredNet
As what we mentioned above, the PredNet strictly follows the computational style of traditional predictive coding framework, and the network structure of PPNet and PredNet is similar (for example, both use ConvLSTM as the backbone). (the PredNet model is redrawn in the same way as our model in Appendix A). Therefore, it is easier to set the same parameters such as network depth and width to retrain the PredNet to make a fair and clear comparison, which can be considered as ablation study, to highlight the rationality and superiority of our model. Table 4 shows the performances of next frame prediction on KTH, Caltech and KITTI respectively. Obviously, our method is superior to PredNet in both prediction accuracy and computational overhead. The pyramid style is effective. By reducing the oscillation frequency, higher-level neurons can not only obtain longer-term information, but also reduce the computational cost. Figure 9 visualizes the long-term predictions on each dataset with different predicted time steps respectively. In general, our results are better than those of PredNet. First, it can be seen from the figure that the inter-frame variation of the KITTI dataset is much larger than that of the other two, both PPNet and PredNet made the fuzzy predictions. However, the PPNet can still make better predictions in the first few steps while PredNet makes blurry predictions and then reproduces them only. This kind of replication is more obvious when using the Caltech dataset for evaluation. Though generating clearer frames compared to our method, the PredNet is just reproducing previous frames instead of making predictions. On the contrary, the PPNet is still able to capture the motion information in the input sequences and make authentic predictions. PredNet captures the motion information on the KTH dataset finally, but it learned only the person's direction and an approximated speed, while other subtle movements, such as the actions of the person's arm and leg, are lost.
In summary, we have presented several experiment results to show a remarkable performance of our method, which is superior to PredNet in terms of prediction accuracy, computa- In addition, we can also get results equal to or better than the other state-of-the-art methods, indicating the superiority of our method.

Propagation of Weighted Error
Additional experiments were performed to explore the influence of the propagation of prediction errors. According to what is mentioned above, the prediction errors will be propagated upward for higher level. Here comes the question: which errors should be pass up, the original errors or the weighted errors? It is necessary to indicate that the result shown in Figure 6 is in which the original errors were transmitted upward and the weighted errors were only propagated backward. We did get a worse result while we propagated the weighted errors both upward and backward after being normalized ( Table 5). As Corlett [38] and Fletcher et al. [39] have speculated: errors might be "false" after being weighted, it will make profound corrections to our model of the world if waves of persistent and highly weighted "false errors" were propagated upward. Using the adaptive weights proposed in Section 3.3, we provide a possible proof for the assumption from the perspective of artificial neural network.

The Efficiency of Pyramid-like Architecture
A set of priors is often already active on a higher level of cognitive hierarchy, poised to impact the processing of new sensory inputs without further delay while the context information have been in place. Similarly, there is a delay in the upward flow of information at the beginning, but it will disappear once the information reaches the highest level in our model, which might result in a trivial reduction of computational cost while the input sequence is long enough. However, longer sequences are not required. LSTM networks may capture spurious long-term dependencies that may have been present in the training data, hence learning in-adequate causal models [40]. Additionally, we have performed a set of experiments on both PPNet and PredNet by processing the same data into sequences with different length to prove our point (Note that the total number of video frames is constant). As shown in Figure 10, the length of input sequence has little effects on the prediction accuracy, but less time was required using a shorter sequence in our proposed PPNet. Therefore, we can process the data into shorter sequences during training, to reduce the consumption of resources and achieve sustainable artificial intelligence.

Conclusions
In this paper, we have demonstrated a pyramidal predictive network for visual-frame prediction based on the predictive coding concept, with much efficient computational manner. This model encodes information at various temporal and spatial scale, with a up-down propagation of prediction and a bottom-up propagation of the combination of sensory input and prediction error. It has a stronger temporal correlation in structure and uses less computation cost. We analyzed the rationality of the model in detail from the perspectives of predictive processing and machine learning. Importantly, this proposed model achieve a remarkable performance compared to state-of-the-art models according to the experimental results.
Nevertheless, there is still room for improvement for the proposed model. In the long-term forecasting process, the false "prediction errors" may cause the model to average the possible future into a single, fuzzy forecast, which is an urgent problem exists in most predictive model. In addition, prediction on directly predicting natural visual frames is still a challenging task due to the curse of dimensionality. Therefore, in the future, we are going to reduce the prediction space to high-level representations, such as semantic and instance segmentation, and depth space, to simplify the prediction task, which will make the intelligent robots easier to predict and perform advanced actions. Figure A1. The network structures of PPNet and PredNet, where the PredNet is redrawn in the same way as PPNet, for a better comparison. Section 3.1: Efficiency in Pyramid Architecture), so in our model, the lowest level neurons first receive sensory input and make predictions, and the information is passed up only after the prediction error is obtained by comparing the current prediction with sensory input of next time step. Besides, the information we transmit upward includes not only predictive error but also sensory input information, the reasons have also been explained in Section 3. The above is the main difference between our model and PredNet in terms of network structure.