1 Introduction

The recent exponential growth in data collection capabilities and the use of supervised deep learning approaches have helped to make tremendous progress in computer vision. However, learning good representations for the analysis and understanding of dynamic scenes, with limited or no supervision, remains a challenging task. This is in no small part due to the complexity of the changes in appearance and of the motions that are observed in video sequences of natural scenes. Yet, these changes and motions provide powerful cues to understand dynamic scenes such as the one shown in Fig. 1(a), and they can be used to predict what is going to happen next. Furthermore, the ability of anticipating the future is essential to make decisions and take action in critical real time systems such as autonomous driving. Indeed, recent approaches to video understanding [17, 22, 31] suggest that being capable to accurately generate/predict future frames in video sequences can help to learn useful features with limited or no supervision.

Predicting future frames to anticipate what is going to happen next requires good generative models that can make forecasts based on the available past data. Recurrent Neural Networks (RNN) and in particular Long Short-Term Memory (LSTM) have been widely used to process sequential data and make such predictions. Unfortunately, RNNs are hard to train due to the exploding and vanishing gradient problems. As a result, they can easily learn short term but not long-term dependencies. On the other hand, LSTMs and the related Gated Recurrent Units (GRU), addressed the vanishing gradient problem and are easier to use. However, their design is ad-hoc, with many components whose purpose is not easy to interpret [13].

Fig. 1.
figure 1

(a) Dynamics and motion provide powerful cues to understand scenes and predict the future. (b) DYAN’s architecture: Given T consecutive \(H \times W\) frames, the network uses a dynamical atoms-based encoder to generate a set of sparse \(N \times HW\) features that capture the dynamics of each pixel, with \(N \gg T\). These features can be passed to its dynamical atoms-based decoder to reconstruct the given frames and predict the next one, or they can be used for other tasks such as action classification.

More recent approaches [20, 22, 35, 37] advocate using generative adversarial network (GAN) learning [7]. Intuitively, this is motivated by reasoning that the better the generative models, the better the prediction will be, and vice-versa: by learning how to distinguish predictions from real data, the network will learn better models. However, GANs are also reportedly hard to train, since training requires finding a Nash equilibrium of a game, which might be hard to get using gradient descent techniques.

In this paper, we present a novel DYnamical Atoms-based Network, DYAN, shown in Fig. 1(b). DYAN is similar in spirit to LSTMs, in the sense that it also captures short and long term dependencies. However, DYAN is designed using concepts from dynamic systems identification theory, which help to drastically reduce its size and provide easy interpretation of its parameters. By adopting ideas from atom-based system identification, DYAN learns a structured dictionary of atoms to exploit dynamics-based affine invariants in video data sequences. Using this dictionary, the network is able to capture actionable information from the dynamics of the data and map it into a set of very sparse features, which can then be used in video processing tasks, such as frame prediction, activity recognition, semantic segmentation, etc. We demonstrate the power of DYAN’s autoencoding by using it to generate future frames in video sequences. Our extensive experiments using several standard video datasets show that DYAN can predict future frames more accurately and efficiently than current state-of-art approaches.

In summary, the main contributions of this paper are:

  • A novel auto-encoder network that captures long and short term temporal information and explicitly incorporates dynamics-based affine invariants;

  • The proposed network is shallow, with very few parameters. It is easy to train and it does not take large disk space to save the learned model.

  • The proposed network is easy to interpret and it is easy to visualize what it learns, since the parameters of the network have a clear physical meaning.

  • The proposed network can predict future frames accurately and efficiently without introducing blurriness.

  • The model is differentiable, so it can be fine-tuned for another task if necessary. For example, the front end (encoder) of the proposed network can be easily incorporated at the front of other networks designed for video tasks such as activity recognition, semantic video segmentation, etc.

The rest of the paper is organized as follows. Section 2 discusses related previous work. Section 3 gives a brief summary of the concepts and procedures from dynamic systems theory, which are used in the design of DYAN. Section 4 describes the design of DYAN, its components and how it is trained. Section 5 gives more details of the actual implementation of DYAN, followed by Sect. 6 where we report experiments comparing its performance in frame prediction against the state-of-art approaches. Finally, Sect. 7 provides concluding remarks and directions for future applications of DYAN.

2 Related Work

There exist an extensive literature devoted to the problem of extracting optical flow from images [10], including recent deep learning approaches [5, 12]. Most of these methods focus on Lagrangian optical flow, where the flow field represents the displacement between corresponding pixels or features across frames. In contrast, DYAN can also work with Eulerian optical flow, where the motion is captured by the changes at individual pixels, without requiring finding correspondences or tracking features. Eulerian flow has been shown to be useful for tasks such as motion enhancement [33] and video frame interpolation [23].

State-of-art algorithms for action detection and recognition also exploit temporal information. Most deep learning approaches to action recognition use spatio-temporal data, starting with detections at the frame level [27, 29] and linking them across time by using very short-term temporal features such as optical flow. However, using such a short horizon misses the longer term dynamics of the action and can negatively impact performance. This issue is often addressed by following up with some costly hierarchical aggregation over time. More recently, some approaches detect tubelets [11, 15] starting with a longer temporal support than optical flow. However, they still rely on a relatively small number of frames, which is fixed a priori, regardless of the complexity of the action. Finally, most of these approaches do not provide explicit encoding and decoding of the involved dynamics, which if available could be useful for inference and generative problems.

In contrast to the large volume of literature on action recognition and motion detection, there are relatively few approaches to frame prediction. Recurrent Neural Networks (RNN) and in particular Long Short-Term Memory (LSTM) have been used to predict frames. Ranzato et al. [28] proposed a RNN to predict frames based on a discrete set of patch clusters, where an average of 64 overlapping tile predictions were used to avoid blockiness effects. In [31] Srivastava et al. used instead an LSTM architecture with an \({\ell }_2\) loss function. Both of these approaches produce blurry predictions due to using averaging. Other LSTM-based approaches include the work of Luo et al. [21] using an encoding/decoding architecture with optical flow and the work of Kalchbrenner et al. [14] that estimates the probability distribution of the pixels.

In [22], Mathieu et al. used generative adversarial network (GAN) [7] learning together with a multi-scale approach and a new loss based on image gradients to improve image sharpness in the predictions. Zhou and Berg [37] used a similar approach to predict future state of objects and Xue et al. [35] used a variational autoencoder to predict future frames from a single frame. More recently, Luc et al. [20] proposed an autoregressive convolutional network to predict semantic segmentations in future frames bypassing pixel prediction. Liu et al. [18] introduced a network that synthesizes frames by estimating voxel flow. However, it assumes that the optical flow is constant across multiple frames. Finally, Liang et al. [17] proposed a dual motion GAN architecture that combines frame and flow predictions to generate future frames. All of these approaches involve large networks, potentially hard to train.

Lastly, DYAN’s encoder was inspired by the sparsification layers introduced by Sun et al. in [32] to perform image classification. However, DYAN’s encoder is fundamentally different since it must use a structured dictionary (see (6)) in order to model dynamic data, while the sparsification layers in [32] do not.

3 Background

3.1 Dynamics-Based Invariants

The power of geometric invariants in computer vision has been recognized for a long time [25]. On the other hand, dynamics-based affine invariants have been used far less. These dynamics-based invariants, which were originally proposed for tracking [1], activity recognition [16], and chronological sorting of images [3], tap on the properties of linear time invariant (LTI) dynamical systems. As briefly summarized below, the main idea behind these invariants, is that if the available sequential data (i.e. the trajectory of a target being tracked or the values of a pixel as a function of time) can be modeled as the output of some unknown LTI system, then, this underlying system has several attributes/properties that are invariant to affine transformations (i.e. viewpoint or illumination changes). In this paper, as described in detail in Sect. 4, we propose to use this affine invariance property to reduce the number of parameters in the proposed network, by leveraging the fact that multiple observations of one motion, captured in different conditions, can be described using one single set of these invariants.

Let \(\mathcal S\) be a LTI system, described either by an autoregressive model or a state space model:

$$\begin{aligned} y_k= & {} \sum _{i=1}^{n} a_iy_{k-i} \quad \qquad \% \text { Autoregressive Representation} \end{aligned}$$
(1)
$$\begin{aligned} \mathbf {x}_{k+1}= & {} \mathbf {A}\mathbf {x}_k; \; y_k = \mathbf {C}\mathbf {x}_k \qquad \% \text { State Space Representation} \\ \text {with} \; \mathbf {x}_k= & {} \begin{bmatrix} y_{k-n} \\ \vdots \\y_k \end{bmatrix}, \; \mathbf {A}=\begin{bmatrix} 0&1&\ldots&0 \\ \vdots&\ddots&\ddots&0 \\ 0&0&\ldots&1 \\ a_n&a_{n-1}&\ldots&a_1 \end{bmatrix}; \; \mathbf {C}=\begin{bmatrix} 0&\ldots&0&1 \end{bmatrix} \nonumber \end{aligned}$$
(2)

where \({{y}}_k\)Footnote 1 is the observation at time k, and n is the (unknown a priori) order of the model (memory of the system). Consider now a given initial condition \(\mathbf {x}_o\) and its corresponding sequence \(\mathbf {x}\). The Z-transform of a sequence \(\mathbf {x}\) is defined as \(X(z) = \sum _{k=0}^\infty x_k z^{-k}\), where z is a complex variable \(z = r e^{j\phi }\). Taking Z transforms on both sides of (2) yields:

$$\begin{aligned} z (\mathbf {X}(z)-\mathbf {x}_o)=\mathbf {A}\mathbf {X}(z) \; \Rightarrow \; \mathbf {X}(z)=z(z\mathbf {I}-\mathbf {A})^{-1}\mathbf {x}_o, \; Y(z)=z\mathbf {C}(z\mathbf {I}-\mathbf {A})^{-1}\mathbf {x}_o \end{aligned}$$
(3)

where \(\mathcal{G}(z) \doteq z\mathbf {C}(z\mathbf {I}-\mathbf {A})^{-1}\) is the transfer function from initial conditions to outputs. Using the explicit expression for the matrix inversion and assuming non-repeated poles, leads to

$$\begin{aligned} Y(z)=\frac{z\mathbf {C}_{\text {adj}}(z\mathbf {I}-\mathbf {A})\mathbf {x}_o}{\text {det}(z\mathbf {I}-\mathbf {A})}\doteq \sum _{i=1}^n \frac{z c_i}{z-p_i} \; \iff y_k = \sum _{i=1}^n c_i p_i^{k}, \; k=0,1,\ldots \end{aligned}$$
(4)

where the roots of the denominator, \(p_i\), are the eigenvalues of \(\mathbf {A}\) (e.g. poles of the system) and the coefficients \(c_i\) depend on the initial conditions. Consider now an affine transformation \(\varPi \). Then, substitutingFootnote 2 in (1) we have, \( y'_k \doteq \varPi ( y_k) = \varPi (\sum _{i=1}^{n} a_i{y}_{k-i}) = \sum _{i=1}^{n} a_i \varPi (y_{k-i}) \). Hence, the order n, the model coefficients \(a_i\) (and hence the poles \(p_i\)) are affine invariant since the sequence \(y'_k\) is explained by the same autoregressive model as the sequence \(y_k\).

3.2 LTI System Identification Using Atoms

Next, we briefly summarize an atoms-based algorithm [36] to identify an LTI system from a given output sequence.

First, consider a set with an infinite number of atoms, where each atom is the impulse response of a LTI first order (or second order) system with a single real pole p (or two conjugate complex poles, p and \(p^*\)). Their transfer functions can be written as:

$$ \mathcal{{G}}_p(z) = \frac{w z}{z - p} \ \text {and} \ \mathcal{{G}}_p(z) = \frac{w z}{z - p} + \frac{w^* z}{z - p^*} $$

where \(w \in \mathbb {C}\), and their impulse responses are given by \(\mathbf {g}_p = w[1, p, p^2, p^3, \dots ]'\) and \(\mathbf {g}_p = w[1, p, p^2, p^3, \dots ]' + w^* [1, p^*, {p^*}^2, {p^*}^3, \dots ]'\), for first and second order systems, respectively.

Next, from (3), every proper transfer function can be approximated to arbitrary precision as a linear combination of the above transfer functionsFootnote 3:

$$ \mathcal{{G}}(z) = \sum _{i} c_i \mathcal{{G}}_{p_i}(z) $$

Hence, low order dynamical models can be estimated from output data \(\mathbf {y} = [y_1, y_2,y_3,y_4, \dots ]'\) by solving the following sparsification problem:

$$ \min _{\mathbf {c} = \{c_i\}} \Vert \mathbf {c}\Vert _o\quad \text {subject to: } \Vert \mathbf {y} - \sum {c}_i\mathbf {g}_p\Vert _2^2 \le \eta ^2 $$

where \(\Vert .\Vert _o\) denotes cardinality and the constraint imposes fidelity to the data. Finally, note that solving the above optimization is not trivial since minimizing cardinality is an NP-hard problem and the number of poles to consider is infinite. The authors in [36] proposed to address these issues by (1) using the \({\ell }_1\) norm relaxation for cardinality, (2) using impulse responses of the atoms truncated to the length of the available data, and (3) using a finite set of atoms with uniformly sampled poles in the unit disk. Then, using these ideas one could solve instead:

$$\begin{aligned} \min _{\mathbf {c}} \frac{1}{2}\Vert \mathbf {y}_{1:T} - {D^{(T)} \mathbf {c}}\Vert _2^2 + \lambda \Vert \mathbf {c}\Vert _1 \end{aligned}$$
(5)

where \(\mathbf {y}_{1:T} = [y_1, y_2, \dots , y_T]'\), \({D^{(T)}}\) is a structured dictionary matrix with T rows and N columns:

$$\begin{aligned} {D}^{(T)} = \left[ \begin{array}{cccc} p_1^0 &{} p_2^0 &{} \dots &{} p_N^0 \\ p_1 &{} p_2 &{} \dots &{} p_N \\ p_1^2 &{} p_2^2 &{} \dots &{} p_N^2 \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ p_1^{T-1} &{} p_2^{T-1} &{} \dots &{} p_N^{T-1}\\ \end{array} \right] \end{aligned}$$
(6)

where each column corresponds to the impulse response of a pole \(p_i\), \(i=1,\dots ,N\) inside or near the unit disk in \(\mathbb {C}\). Note that the dictionary is completely parameterized by the magnitude and phase of its poles.

4 DYAN: A Dynamical Atoms-Based Network

In this section we describe in detail the architecture of DYAN, a dynamical atoms-based network. Figure 1(b) shows its block diagram, depicting its two main components: a dynamics-based encoder and dynamics-based decoder. Figure 2 illustrates how these two modules work together to capture the dynamics at each pixel, reconstruct the input data and predict future frames.

The goal of DYAN is to capture the dynamics of the input by mapping them to a latent space, which is learned during training, and to provide the inverse mapping from this feature space back to the input domain. The implicit assumption is that the dynamics of the input data should have a sparse representation in this latent space, and that this representation should be enough to reconstruct the input and to predict future frames.

Fig. 2.
figure 2

DYAN identifies the dynamics for each pixel, expressing them as a linear combination of a small subset of dynamics-based atoms from a dictionary (learned during training). The selected atoms and the corresponding coefficients are represented using sparse feature vectors, found by a sparsification step. These features are used by the decoder to reconstruct the input data and predict the next frame by using the same dictionary, but with an extended temporal horizon. See text for more details.

Following the ideas from dynamic system identification presented in Sect. 3, we propose to use as latent space, the space spanned by a set of atoms that are the impulse responses of a set of first (single real pole) and second order (pair of complex conjugate poles) LTI systems, as illustrated in Fig. 2. However, instead of using a set of random poles in the unit disk as proposed in [36], the proposed network learns a set of “good” poles by minimizing a loss function that penalizes reconstruction and predictive poor quality.

The main advantages of the DYAN architecture are:

  • Compactness: Each pole in the dictionary can be used by more than one pixel, and affine invariance allows to re-use the same poles, even if the data was captured under different conditions from the ones used in training. Thus, the total number of poles needed to have a rich dictionary, capable of modeling the dynamics of a wide range of inputs, is relatively small. Our experiments show that the total number of parameters of the dictionary, which are the magnitude and phase of its poles, can be below two hundred and the network still produces high quality frame predictions.

  • Adaptiveness to the dynamics complexity: The network adapts to the complexity of the dynamics of the input by automatically deciding how many atoms it needs to use to explain them. The more complex the dynamics, the higher the order of the model is needed, i.e. the higher the number of atoms will be selected, and the longer term memory of the data will be used by the decoder to reconstruct and predict frames.

  • Interpretable: Similarly to CNNs that learn sets of convolutional filters, which can be easily visualized, DYAN learns a basis of very simple dynamic systems, which are also easy to visualize by looking at their poles and impulse responses.

  • Performance: Since pixels are processed in parallel, independently of each otherFootnote 4, blurring in the predicted frames and computational time are both reduced.

4.1 DYAN’s Encoder

The encoder stage takes as input a set of T consecutive \(H\times W\) frames (or features), which are flattened into HW, \(T\times 1\) vectors, as shown in Fig. 1(b). Let one of these vectors be \(\mathbf {y}_l\). Then, the output of the encoder is the collection of the minimizers of HM sparsification optimization problems:

$$\begin{aligned} \mathbf {c}_l^* = \arg \min _{\mathbf {c}} \frac{1}{2} \Vert \mathbf {y}_l - {D}^{(T)}\mathbf {c} \Vert ^2_2 + \lambda \Vert \mathbf {c}\Vert _1 \qquad l = 1, \dots ,HW \end{aligned}$$
(7)

where \({D}^{(T)}\) is the dictionary with the learned atoms, which is shared by all pixels and \(\lambda \) is a regularization parameter. Thus, using a \(T \times N\) dictionary, the output of the encoder stage is a set of sparse HW \(N \times 1\) vectors, that can be reshaped into \(H\times W \times N\) features.

In order to avoid working with complex poles \(p_i\), we use instead a dictionary \(D^{(T)}_{\rho ,\psi }\) with columns corresponding to the real and imaginary parts of increasing powers of the poles \(p_i = \rho _i e^{j\psi _i}\) in the first quadrant (\(0\le \psi _i\le \pi /2\)), of their conjugates and of their mirror images in the third and fourth quadrantFootnote 5: \(\rho _i^k \cos (k\psi _i)\), \(\rho _i^k \sin (k\psi _i)\), \((-\rho _i)^k \cos (k\psi _i)\), and \((-\rho _i)^k \sin (k\psi _i)\) with \(k=0,\dots ,T-1\). In addition, we include a fixed atom at \(p_i = 1\) to model constant inputs.

$$\begin{aligned} {D^{(T)}_{\rho ,\psi }} = \left[ \begin{array}{ccccc} 1 &{}1 &{} 0 &{} \dots &{}0 \\ 1 &{}\rho _1\cos \psi _1 &{} \rho _1\sin \psi _1 &{} \dots &{} -\rho _N\sin \psi _N \\ 1 &{} \rho _1^2 \cos 2\psi _1&{} \rho _1^2 \sin 2\psi _1&{} \dots &{} (-\rho _N)^2\sin 2\psi _N \\ \vdots &{} \vdots &{} \vdots &{}\vdots &{} \vdots \\ 1 &{}\rho _1^{T-1}\cos (T-1)\psi _1 &{} \rho _1^{T-1}\sin (T-1)\psi _1 &{} \dots &{}(-\rho _N)^{T-1}\sin (T-1)\psi _N\\ \end{array} \right] \end{aligned}$$
(8)

Note that while Eq. (5) finds one \(\mathbf {c}^*\) (and a set of poles) for each feature \(\mathbf {y}\), it is trivial to process all the features in parallel with significant computational time savings. Furthermore, (5) can be easily modified to force neighboring features, or features at the same location but from different channels, to select the same poles by using a group Lasso formulation.

figure a

In principle, there are available several sparse recovery algorithms that could be used to solve Problem (7), including LARS [9], ISTA and FISTA [2], and LISTA [8]. Unfortunately, the structure of the dictionary needed here does not admit a matrix factorization of its Gram kernel, making the LISTA algorithm a poor choice in this case [24]. Thus, we chose to use FISTA, shown in Algorithm 1, since very efficient GPU implementations of this algorithm are available.

4.2 DYAN’s Decoder

The decoder stage takes as input the output of the encoder, i.e. a set of sparse HW \(N\times 1\) vectors and multiplies them with the encoder dictionary, extended with one more row:

$$\begin{aligned} \left[ \begin{array}{ccccc} 1 &{}\rho _1^{T}\cos (T\psi _1) &{} \rho _1^{T}\sin (T\psi _1) &{} \dots &{}(-\rho _N)^{T}\sin (T\psi _N)\\ \end{array} \right] \end{aligned}$$
(9)

to reconstruct the T input frames and to predict the \(T+1\) frame. Thus, the output of the decoder is a set of HW \((T+1)\times 1\) vectors that can be reshaped into \((T+1)\), \(H \times W\) frames.

4.3 DYAN’s Training

The parameters of the dictionary are learned using Steepest Gradient Descent (SGD) and the \(\ell _2\) loss function. The back propagation rules for the encoder, decoder layers can be derived by taking the subgradient of the empirical loss function with respect to the magnitudes and phases of the first quadrant poles and the regularizing parameters. Here, for simplicity, we give the derivation for \(D^{(T)}_p\), but the one for \(D^{(T)}_{\rho ,\psi }\) can be derived in a similar manner.

Let \(\mathbf {c}^*\) be the solution of one of the minimization problems in (5), where we dropped the subscript l and the superscript (T) to simplify notation, and define

$$ \mathcal{F} = \frac{1}{2}\Vert \mathbf {y} - {D \mathbf {c}^*}\Vert _2^2 + \lambda \sum _{i=1}^N c_i^* \text {sign}(c_i^*) $$

Taking subgradients with respect to \(\mathbf {c}^*\):

$$ \frac{\partial \mathcal{F}}{\partial \mathbf {c}^*} = 0 = -D^T(\mathbf {y} - D\mathbf {c}^*) + \lambda \mathbf {v} = 0 $$

where \(\mathbf {v} = \left[ \begin{array}{ccc}v_1&\dots&v_N\end{array}\right] ^T\), \(v_i = \) sign\((c_i^*)\) if \(c_i^*\ne 0\), and \(v_i = g\), where \(-1 \le g \le 1\), otherwise. Then,

$$ \mathbf {c}^* = (D_\varLambda ^TD_\varLambda )^{-1}\left[ D^T_\varLambda \mathbf {y} - \lambda \mathbf {v}\right] $$

and

$$ \left. \frac{\partial {{\mathbf {c}^*}}}{\partial {D_{ij}}}\right| _\varLambda = (D^T_\varLambda D_\varLambda )^{-1} \left[ \frac{\partial {D^T_\varLambda \mathbf {y}}}{\partial {D_{ij}}} - \frac{\partial {D^T_\varLambda D_\varLambda }}{\partial {D_{ij}}}{\mathbf {c}^*}\right] $$

where the subscript \(\left. . \right| _\varLambda \) denotes the active set of the sparse code \(\mathbf {c}\), \(D_\varLambda \) is composed of the active columns of D, and \(\mathbf {c}_\varLambda \) is the vector with the active elements of the sparse code. Using the structure of the dictionary, we have

$$ \frac{\partial {\mathbf {c}^*_\varLambda }}{\partial {p_k}} = \sum _{i=1}^M (i-1)p_k^{i-2} \frac{\partial {\mathbf {c}^*_\varLambda }}{\partial {D_{ik}}}; \ \frac{\partial {\mathbf {c}^*_\varLambda }}{\partial {y_j}}= (D^T_\varLambda D_\varLambda )^{-1} \frac{\partial {D^T_\varLambda \mathbf {y}}}{\partial {y_j}};\ \frac{\partial {\mathbf {c}^*_\varLambda }}{\partial {\lambda }} = -(D^T_\varLambda D_\varLambda )^{-1}\text {sign}(\mathbf {c}^*_\varLambda ) $$
Fig. 3.
figure 3

Temporal evolution of a dictionary trained with the KITTI dataset.

Figure 3 shows how a set of 160 uniformly distributed poles within a ring around the unit circle move while training DYAN with videos from the KITTI video dataset [6], using the above back propagation and a \(\ell _2\) loss function. As shown in the figure, after only 1 epoch, the poles have already moved significantly and after 30 epochs the poles move slower and slower.

5 Implementation Details

We implementedFootnote 6 DYAN using Pytorch version-0.3. A DYAN trained using raw pixels as input produces nearly perfect reconstruction of the input frames. However, predicted frames may exhibit small lags at edges due to changes in pixel visibility. This problem can be easily addressed by training DYAN using optical flow as input. Therefore, given a video with F input frames, we use coarse to fine optical flow [26] to obtain \(T = F-1\) optical flow frames. Then, we use these optical flow frames to predict with DYAN the next optical flow frame to warp frame F into the predicted frame \(F+1\). The dictionary is initialized with 40 poles, uniformly distributed on a grid of \(0.05 \times 0.05\) in the first quadrant within a ring around the unit circle defined by \(0.85 \le \rho \le 1.15\), their 3 mirror images in the other quadrants, and a fixed pole at \(p=1\). Hence, the resulting encoder and decoder dictionaries have \(N = 161\) columnsFootnote 7 and T and \(T+1\) rows, respectively. Each of the columns in the encoding dictionary was normalized to have norm 1. The maximum number of iterations for the FISTA step was set to 100.

6 Experiments

In this section, we describe a set of experiments using DYAN to predict the next frame and compare its performance against the state-of-art video prediction algorithms. The experiments were run on widely used public datasets, and illustrate the generative and generalization capabilities of our network.

Fig. 4.
figure 4

Qualitative results for our model trained on the KITTI dataset and tested on the Caltech dataset, without fine tuning. The figure shows examples from Caltech test set S10, sequence V010, with ground truth on the top row and predicted frames below. As shown in the figure, our model produces sharp images and fully captures the motion of the vehicles and the camera.

6.1 Car Mounted Camera Videos Dataset

We first evaluate our model on street view videos taken by car mounted cameras. Following the experiments settings in [17], we trained our model on the KITTI dataset [6], including 57 recoding sessions (around 41k frames), from the City, Residential, and Road categories. Frames were center-cropped and resized to \(128\times 160\) as done in [19]. For these experiments, we trained our model with 10 input frames (\(F=10, T=9\)) and \(\lambda = 0.01\) to predict frame 11. Then, we directly tested our model without fine tuning on the Caltech Pedestrian dataset [4], testing partition (4 sets of videos), which consists of 66 video sequences. During testing time, each sequence was split into sequences of 10 frames, and frames were also center-cropped and resized to \(128\times 160\). Also following [17], the quality of the predictions for these experiments was measured using MSE [19] and SSIM [34] scores, where lower MSE and higher SSIM indicate better prediction results.

Qualitative results on the Caltech dataset are shown in Fig. 4, where it can be seen that our model accurately predicts sharp, future frames. Also note that even though in this sequence there are cars moving towards opposite directions or occluding each other, our model can predict all motions well. We compared DYAN’s performance against three state-of-the-art approaches: DualMoGAN [17], BeyondMSE [22] and Prednet [19]. For a fair comparison, we normalized our image values between 0 and 1 before computing the MSE score. As shown in Table 1, our model outperforms all other algorithms, even without fine tuning on the new dataset. This result shows the superior predictive ability of DYAN, as well as its transferability.

For these experiments, the network was trained on 2 NVIDIA TITAN XP GPUs, using one GPU for each of the optical flow channels. The model was trained for 200 epochs and it only takes 3 KB to store it on disk. Training only takes 10 s/epoch, and it takes an average of 230 ms (including warping) to predict the next frame, given a sequence of 10 input frames. In comparison, [17] takes 300 ms to predict a frame.

Table 1. MSE and SSIM scores of next frame prediction test on Caltech dataset after training on KITTI datset.
Fig. 5.
figure 5

Qualitative results for next frame prediction test on UCF-101. For each sequence, the first row shows the 4 input frames, while the ground truth and our prediction are shown on the second row. We also enlarge the main moving portion inside each frame to show how similar our predictions are compared to the ground truth.

6.2 Human Action Videos Dataset

We also tested DYAN on generic videos from the UCF-101 dataset [30]. This dataset contains 13,320 videos under 101 different action categories with an average length of 6.2 s. Input frames are \(240 \times 320\). Following state-of-art algorithms [18] and [22], we trained using the first split and using \(F= 4\) frames as input to predict the 5th frame. While testing, we adopted the test set provided by [22] and the evaluation script and optical masks provided by [18] to mask in only the moving object(s) within each frame, resized to \(256 \times 256\). There are in total 378 video sequences in the test set: every 10th video sequence was extracted from UCF-101 test list and then 5 consecutive frames are used, 4 for input and 1 for ground truth. Quantitative results with PSNR [22] and SSIM [34] scores, where the higher the score the better the prediction, are given in Table 2 and qualitative results are shown in Fig. 5. These experiments show that DYAN predictions achieve superior PSNR and SSIM scores by identifying the dynamics of the optical flow instead of assuming it is constant as DVF does.

Fig. 6.
figure 6

Qualitative result for our model trained on UCF-101 dataset with \(F = 4\). Other scores were obtained by running the code provided by the respective authors. All scores were computed using masks from [18].

Finally, we also conducted a multi-step prediction experiment in which we applied our \(F = 4\) model to predict the next three future frames, where each prediction was used as a new available input frame. Figure 6 shows the results of this experiment, compared against the scores for BeyondMSE [22] and DVF [18], where it can be seen that the PSNR scores of DYAN’s predictions are consistently higher than the ones obtained using previous approaches.

For these experiments, DYAN was trained on 2 NVIDIA GeForce GTX GPUs, using one GPU for each of the optical flow channels. Training takes around 65 min/epoch, and predicting one frame takes 390ms (including warping). Training converged at 7 epochs for \(F=4\). In contrast, DVF takes severals day to train. DYAN’s saved model only takes 3KB on hard disk.

Table 2. PSNR and SSIM scores of next frame prediction on UCF-101 dataset. Results for [18, 22] were obtained by running the code provided by the respective authors.

7 Conclusion

We introduced a novel DYnamical Atoms-based Network, DYAN, designed using concepts from dynamic systems identification theory, to capture dynamics-based invariants in video sequences, and to predict future frames. DYAN has several advantages compared to architectures previously used for similar tasks: it is compact, easy to train, visualize and interpret, it is fast to train, it produces high quality predictions fast, and generalizes well across domains. Finally, the high quality of DYAN’s predictions show that the sparse features learned by its encoder do capture the underlying dynamics of the input, suggesting that they will be useful for other unsupervised learning and video processing tasks such as activity recognition and video semantic segmentation.