Continual Spatio-Temporal Graph Convolutional Networks

Graph-based reasoning over skeleton data has emerged as a promising approach for human action recognition. However, the application of prior graph-based methods, which predominantly employ whole temporal sequences as their input, to the setting of online inference entails considerable computational redundancy. In this paper, we tackle this issue by reformulating the Spatio-Temporal Graph Convolutional Neural Network as a Continual Inference Network, which can perform step-by-step predictions in time without repeat frame processing. To evaluate our method, we create a continual version of ST-GCN, CoST-GCN, alongside two derived methods with different self-attention mechanisms, CoAGCN and CoS-TR. We investigate weight transfer strategies and architectural modifications for inference acceleration, and perform experiments on the NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400 datasets. Retaining similar predictive accuracy, we observe up to 109x reduction in time complexity, on-hardware accelerations of 26x, and reductions in maximum allocated memory of 52% during online inference.


I. INTRODUCTION
A human action can be described by a temporal sequence of human body poses, each of which is represented by a set of spatial joint coordinates forming a body skeleton. Accordingly, skeleton-based action recognition methods process a sequence of skeletons (instead of an image sequence) to recognize the performed action. Compared with predicting actions from videos, a sequence of skeleton data not only gives the spatial and temporal features of the body poses, but also provides robustness against different background variations and context noise [1]. The estimation of such skeletal data has become a staple in the human action recognition toolkit thanks to publicly available toolboxes such as OpenPose [2].
Early deep learning methods for skeleton-based action recognition either rearrange the body joint coordinates of each skeleton to make a pseudo-image which is used to train a CNN model [3,4,5], or concatenate the human body joints as a sequence of feature vectors and train a RNN model [6,7,8]. However, these methods cannot take advantage of the non-Euclidean structure of the skeletons. Recently, Graph Convolutional Networks (GCNs) have shown prowess in the modeling of skeleton data [9]. ST-GCN [10] was the first GCN-based method proposed for skeleton-based action recognition. It uses spatial graph convolutions to extract the per time-step features of each skeleton and employs temporal convolutions to capture time-varying dynamics throughout the Continual and prior methods during online inference. Numbers denote streams for each method. One stream contains the joint modality; two streams add the bone modality; and four streams add joint and bone motion. * Architecture modification with stride one and no padding.
skeleton sequence. Since its publication, several methods have sprung from ST-GCN, which enhance feature extraction or optimize the structure of the model. 2s-AGCN [11] proposed to learn the graph structure in each GCN layer adaptively based on input graph node similarity and also utilized an attention method which highlights both the existing spatial connections in the graph (bones) and new potential connections between them. MS-AAGCN [12] extended 2s-AGCN by proposing a multi-stream framework which uses four different data streams for training the model. Moreover it enhanced the adaptive graph convolution in 2s-AGCN with a spatio-temporal channel attention module to highlight the most important skeletons, nodes in each skeleton, and features of each node. Following the idea of utilizing different data streams, another multi-stream framework is proposed by [13] which constructs a spatio-temporal view invariant model (STVIM) to capture the spatial and temporal dynamics of joints and bones in skeletons using Geometric Algebra. MS-G3D [14] has proposed multi-scale graph convolutions for long-range feature extraction, and DGNN [15] modeled the spatial connections between the graph nodes with a directed graph and utilized both node features and edge features simultaneously. Similarly, HSR-TSL [16] is a model with hierarchical spatial reasoning and temporal stack learning network which employs a hierarchical residual graph neural network to capture two-level spatial features, and a temporal stack learning network (TSLN) composed of multiple skip-clip LSTMs to capture temporal dynamics of skeleton sequences. Recently, STF-Net [17] has proposed to capture both robust movement patterns from the skeleton joints and parts topology structures and temporal dependency, using a multi-grain contextual focus module (MCF) and a temporal discrimination focus module (TDF) integrated into a GCN network.
Unfortunately, the high computational complexity of these GCN-based methods makes them infeasible in real-time applications and resource-constrained online inference settings. Multiple approaches have been explored to increase the efficiency of skeleton-based action recognition recently: GCN-NAS [18] and PST-GCN [19] are neural architecture search based methods which try to find an optimized ST-GCN architecture to increase the efficiency of the classification task; Tripool [20] is a novel graph pooling method which optimizes a triplet pooling loss to learn an optimized graph topology, by removing the redundant nodes, and learn hierarchical graph representation.
ShiftGCN [21] replaces graph and temporal convolutions with a zero-FLOPs shift graph operation and point-wise convolutions as an efficient alternative to the feature-propagation rule for GCNs [22]; ShiftGCN++ [23] boost the efficiency of ShiftGCN further via progressive architecture search, knowledge-distillation, explicit spatial positional encodings, and a Dynamic Shift Graph Convolution; SGN [24] utilizes semantic information such as joint type and frame index as side information to design a compact semantics-guided neural network (SGN) for capturing both spatial and temporal correlations in joint and frame level; TA-GCN [25] tries to make inference more efficient by selecting a subset of key skeletons, which hold the most important features for action recognition, from a sequence to be processed by the spatio-temporal convolutions.
Yet, none of the above-described GCN-based methods are tailored to online inference, were the input is a continual stream of skeletons and step-by-step predictions are required. During online inference, these methods would need to rely on sliding window-based processing, i.e., storing the T − 1 prior skeletons, appending the newest skeleton to get a sequence of length T , and then performing their prediction on the whole sequence.
In this paper, we reduce such redundant computations by reformulating the ST-GCN and its derived methods as a Continual Inference Network, which processes skeletons one by one and produces updated predictions for each time-step without the need to include past skeletons in every input as is the case for the prior GCN-based methods. This is achieved by using Continual Convolutions in place of regular ones for aggregating temporal information, leading to highly reduced number of floating point operations (see Figure 1). In particular, we propose the Continual Spatio-Temporal Graph Convolutional Network (CoST-GCN), CoAGCN, and CoTR-S and evaluate them on the skeleton-based action recognition datasets NTU RGB+D 60 [26], NTU RGB+D 120 [27], and Kinetics Skeleton 400 [28] with striking results: Our continual models achieve up to 108× FLOPs reduction, 26× speedup, and 52% reduction in max allocated GPU memory compared to the corresponding non-continual models. The remainder of the paper is structured as follows: Section II provides an introduction to skeleton-based action recognition and of the related methods, from which we derive a continual counterpart, Section II-D describes Continual Inference Networks, and Section III-D presents our proposed Continual Spatio-temporal Graph Convolutional Networks. Experiments on weight transfer strategies, performance benchmarks, and comparisons with prior works are offered in Section IV, and a conclusion is given in Section V.

A. Spatio-Temporal Graph Convolutional Network
GCN-based models for skeleton-based action recognition [10,19,25] operate on sequences of skeleton graphs. The spatio-temporal graph of skeletons G = (V, E) has the human body joint coordinates as nodes V and the spatial and temporal connections between them as edges E. Figure 2 (right) illustrates such a spatio-temporal graph where the spatial graph edges encode the human bones and the temporal edges connect the same joints in subsequent time-steps. We model this graph as a tensor X ∈ R C (0) ×T ×V , where C (0) is the number of input-channels of each joint, T denotes the number of skeletons in a sequence, and V is the number of joints in each skeleton. A binary adjacency matrix A ∈ R V ×V encodes the skeleton-structure with ones in positions connecting two vertices in a skeleton and zeros elsewhere.
The ST-GCN [10] and AGCN [11] methods refine the spatial structure of each skeleton by employing a partitioning method which categorizes neighboring nodes of each body joint into three subsets: (1) the root node itself, (2) the root's neighboring nodes which are closer to the skeleton's center of gravity (COG) than the root itself, and (3) the remaining neighboring nodes of the root node. An example of this subset partitioning is shown in Figure 2 (left). Accordingly, the graphstructure of each skeleton is represented by three normalized binary adjacency matrices A p ∈ R V ×V | p = 1, 2, 3 , each of which is defined aŝ where D p denotes the degree matrix of the neighboring subset p. Inspired by the GCN aggregation rule [22], the spatial graph convolution receives the hidden representation of the previous layer H (l−1) as input, where H (0) = X, and performs the following graph convolution (GC) transformation: where σ(·) denotes a ReLU non-linearity, W is the weight matrix which transforms the features of the neighboring subset p and BN (·) denotes batch normalization. Moreover, a learnable matrix M (l) p ∈ R V ×V is multiplied element-wise with its corresponding adjacency matrixÂ p as an attention mechanism that highlights the most important connections in each spatial graph. In order to retain the model's stability, the input to a layer is added to the transformed features through a residual connection Res(H (l−1) ) which is defined as: where W (l) is a learnable mapping matrix which transforms the layer's input to have the same channel dimension as the layer's output.
The graph convolution block is followed by a temporal convolution, TC(·), which propagates the features of the graph nodes through different time steps to capture the motions taking place in an action. In the temporal graph, each node only has two fixed neighbors which are its corresponding nodes in the previous and next skeletons. The adjacency matrices and partitioning process are not involved in temporal feature propagation. In practice, the temporal convolution is a standard 2D convolution which receives the output of the graph convolution obtained in Eq. (2) and performs a transformation with a kernel of size C (l) × K × 1 to keep the node feature dimension unchanged and aggregate the features through K consecutive time steps.
The whole spatio-temporal convolution block has the form H (l) = σ Res(H (l−1) ) + BN(TC(GC(H (l−1) ))) . (4) The ST-GCN model is composed of multiple such spatiotemporal convolutional blocks. A global average pool and fully connected layer perform the final classification.

B. Adaptive Graph Convolutional Neural Networks
The fixed graph structure used in Eq. (2) is defined based on natural connections in the human body skeleton which restricts the model's capacity and flexibility in representing different action classes. However, for some action classes such as "touching head" it makes sense to model a connection between hand and head even though such a connection is not naturally present in the skeleton. AGCN [11] allows for such possibilities by adopting an adaptive graph convolution which utilizes a data-dependent graph structure as follows: where M (l) p is defined as: The attention matrix in this definition is composed of two learnable matrices which are optimized along with other model parameters in an end-to-end manner. B (l) p ∈ R N ×N is a squared matrix that can be unique for each layer and each sample, and C (l) p ∈ R N ×N is a similarity matrix whose elements determine the strength of the pair-wise connections between nodes. This matrix is computed by first transforming the feature matrix . The obtained feature maps are then reshaped to C de T × V and multiplied to obtain the C (l) p ∈ R N ×N matrix as follows: where softmax normalizes the matrix values. The additive attention mechanism in Eq. (5), thus, lets the adaptive graph convolution in Eq. (7) model the skeleton structure as a fully connected graph.

C. Skeleton-based Spatial Transformer Networks
S-TR [29] is an attention-based method which models dependencies between body joints at each time step using the self-attention operation found in Transformers [30]. In this method, a Spatial Self-Attention (SSA) module is designed to adaptively learn data-dependent pairwise body joint correlations using multi-head self-attention.
The SSA module at each layer l applies trainable query, key, and value transformations W of node i at time step t to obtain the query, key, and value vectors The correlation weight for each pair of i, j nodes at time t is obtained using a query-key dot product The updated feature vector of node i at time t has size C (l) and is obtained using a weighted feature aggregation of value vectors:h For each attention head, the feature transformation is performed with a different set of learnable parameters while the transformation matrices are shared across all the nodes. The output features of the SSA module are finally computed by applying a learnable linear transformation on the concatenated features from S attention heads: SSA has similarities to a graph convolution operation on a fully connected graph for which the node connection weights are learned dynamically. The first three layers of the S-TR model extract features with GC and TC blocks as defined in Eq. (4) while in the remaining layers of the model SSA substitutes GC.

D. Continual Inference Networks
First introduced in [31] and subsequently formalized in [32], Continual Inference Networks are Deep Neural Networks that can operate efficiently on both fixed-size (spatio-)temporal batches of data, where the whole temporal sequence is known up front, as well as on continual data, where new input steps are collected continually and inference needs to be performed efficiently in an online manner for each received frame.

Definition (Continual Inference Network). A Continual Inference Network is a Deep Neural Network, which
• is capable of continual step inference without computational redundancy, • is capable of batch inference corresponding to a noncontinual Neural Network, • produces identical outputs for batch inference and step inference given identical receptive fields, • uses one set of trainable parameters for both batch and step inference.
Recurrent Neural Networks (RNNs) are a common family of Deep Neural Networks, which possess the above-described properties. 3D Convolutional Neural Networks (3D CNNs), Transformers, and Spatio-Temporal Graph Convolutional Networks are not Continual Inference Networks since they cannot make predictions time-step by time-step without considerable computational redundancy; they need to cache a sliding window of prior input frames and assemble them into a fixed-size sequence that is subsequently passed through the network to make a new predictions during online inference.
Recently, Continual 3D CNNs were made possible through the proposal of Continual 3D Convolutions [31]. Likewise, shallow Continual Transformers based on Continual Dotproduct Attentions were introduced in [32]. We continue this line of work by extending Spatio-Temporal Graph Convolutional Networks (ST-GCNs) with a Continual formulation as well. To do so, let us first present and expand on the theory on Continual Convolutions.

III. CONTINUAL SPATIO-TEMPORAL GRAPH
CONVOLUTIONAL NETWORKS In the section we present and expand the theory on Continual Convolutions with notes on temporal stride. Then, we describe how the Continual Spatio-Temporal Graph Convolutional Networks are constructed.

A. Continual Convolution
The Continual Convolution operation produces the exact same output as the regular convolution does, but performs the computation in a streaming fashion while caching intermediary results.
Consider a single channel 2D convolution over an input X ∈ R T ×V with temporal dimension T and a dimension of V vertices. Given a convolutional kernel with weights W ∈ R K×V , where K is the temporal kernel size, and a bias w 0 , a regular convolution would compute the output y (t) for timestep t ∈ K..T as Considering this computation in the context of online processing, where T − → ∞ and one input slice X (t) is revealed in each time step, we find that K − 1 previous slices, i.e. (K − 1) · V values, need to be stored between time-steps.
An alternative computational sequence is used in Continual Convolutions. Here, the input slice X (t) is convolved with the kernel W in the same time-step it is received. This is specified in Eq. (12a). The intermediate results are then cached in memory m (K − 1 values stored between time-steps) and aggregated according to Eq. (12b).
A graphical representation of this is shown in Fig. 3.

B. Delayed Residual
The temporal convolutions of regular Spatio-Temporal Graph Convolution blocks usually employ zero-padding to ensure equal temporal shape for input and output feature maps. This zero-padding is discarded for Continual Convolutions to avoid continual redundancies [31]. To retain weight compatibility between the regular and continual networks, a delay to the residual connection is necessary. This delay amounts to steps, where k T , d T , and p T are respectively the temporal kernel size, dilation, and zero-padding of the corresponding regular convolution.

C. Temporal Stride
In Section III-A, it is assumed that one output is produced for each input received. However, many spatio-temporal networks including ST-GCN [10], AGCN [11], and S-TR [29], use temporal stride > 1 in their temporal convolutions. For offline computation, this has the beneficial effect of reducing the computational and memory complexity, but in the online computational setting, it also reduces the prediction rate. This is illustrated in Fig. 4. For a neural network with L layers, each with a temporal stride s, the effective network stride is given by and the corresponding network prediction rate is Since a ST-GCN network has two layers with stride two, the corresponding Continual ST-GCN (CoST-GCN) has a prediction rate one fourth the input rate.

D. Continual ST-GCN construction
Many well-performing methods for skeleton-based action recognition, including the ST-GCN [10], AGCN [11], and S-TR [29], share a common block structure, which can be described by Eq. (4). Here, the main difference between methods lies in how the graph information is processed, i.e. in their definition of GC(·).
The regular skeleton-based methods successively extract complete spatio-temporal skeleton features from the whole sequence with each block before classifying an action. Considering one block in isolation, the spatio-temporal feature extraction is given by a spatial (graph) convolution followed by a regular temporal convolution. Here, graph convolutions operate locally within a time-step 1 , whereas the temporal convolution does not. Since the next block l takes as input H (l−1) , the output of the prior block and thereby its temporal Fig. 4: Temporal stride in a Continual Convolution layer l 1 with temporal stride larger than one (right) reduces the prediction rate compared to a layer with stride one (left). The rate reduction is inherited by subsequent layers.
convolution, the output of the next spatial (graph) convolution becomes a function of multiple prior time-steps. With regular temporal convolutions, features produced by multiple blocks cannot be trivially disentangled and cached in time. Accordingly online operation with per-skeleton predictions can be attained by caching T − 1 prior skeletons, concatenating these with the newest skeleton, and performing regular spatiotemporal inference. However, this comes with significant computational redundancy, where the complexity of online framewise inference is the same as for clip-based inference.
To alleviate this issue, we propose to employ Continual Convolutions in the temporal modeling of Spatio-temporal Graph Convolutional Networks. By restricting the GC(·) function to only operate locally within a time-step, we can define a Continual Spatio-Temporal block by replacing the original temporal 2D convolution with a continual one. To retain weight-compatibility with regular (non-continual) networks we moreover need to delay the residual to keep temporal alignment. Given H l−1 )) outputs the delayed residual in a first-in-first-out manner corresponding to the delay of the Continual Temporal Convolutional as computed by Eq. (13). A graphical illustration of such a block is seen in Fig. 5. It should be noted that the restriction of temporal locality does influence the computations of some skeleton-based action recognition methods. For example, the AGCN originally computes one vertex attention weighting based on the whole spatio-temporal feature-map, whereas a Continual AGCN (CoAGCN) computes separate vertex attentions for each time-step.
The resulting Continual Spatio-temporal Graph Convolutional Network is defined by stacking multiple such blocks 2 followed by Continual Global Average Pooling [31] and a fully connected layer. The Continual Inference Networks retain the same computational complexity as regular networks during clip-based inference, but can perform online frame-by-frame predictions much more efficiently, as detailed in Section III-E. We should note that all methods, which share the the same

F. Limitations
While the computational complexity can be greatly reduced for spatio-temporal GCNs during online processing, i.e., where a prediction is made each time a new skeleton is estimated in a live system, no acceleration occurs during offline processing compared to the original model. When the whole skeleton sequence is available beforehand the inference results and computational complexity are identical to prior works. The benefits of the Continual ST-GCN augmentation are thus limited to stream processing for networks which employ temporal convolutions. Accordingly, some networks such as AGCN, whose attention was originally based on the whole spatiotemporal sequence, may need modification to avoid peeking into the future.

IV. EXPERIMENTS
A. Datasets a) NTU RGB+D 60 [26]: A large indoor-captured dataset which is widely used for evaluating skeleton-based action recognition methods. This dataset contains 56,880 action clips and their corresponding 3D skeleton sequences captured by three Microsoft Kinect-v2 cameras from three different views. The clips are performed by 40 different subjects and constitute 60 action classes. The NTU RGB+D 60 dataset comes with two benchmarks, Cross-View (X-View) and Cross-Subject (X-Sub). The X-View benchmark provides 37,920 skeleton sequences coming from the camera views #2 and #3 as training data, and 18,960 skeleton sequences coming from the first camera view as test set. The X-Sub benchmark provides 40,320 skeleton sequences from 20 subjects as training data and 16,560 skeleton sequences from the other 20 subjects as test data. In this dataset, each skeleton has 25 body joints with three different channels each, and each action clip comes with a sequence of 300 skeletons.
b) NTU RGB+D 120 [27]: An extension of the NTU RGB+D 60 dataset containing an additional 57,600 skeleton sequences from extra 60 classes. NTU RGB+D 120 is currently the largest dataset providing 3D body joint coordinates for skeletons and in total, it contains 114,480 skeleton sequences from 120 action classes. The action clips in this dataset are performed by 106 subjects and 32 different camera setups are used for capturing the videos. This dataset comes with two benchmarks: Cross-Subject (X-Sub) and Cross-Setup (X-Set). The X-Sub benchmark provides the skeleton sequences of 53 subjects as training data and the remaining skeleton sequences from the other 53 subjects as test data. In the X-Set benchmark, the skeleton sequences with even camera setup IDs are provided as training data and test data contains the remaining skeleton sequences with odd camera setup IDs. c) Kinetics Skeleton 400 [28]: A widely used dataset for action recognition containing 300,000 video action clips of 400 different classes which are collected from YouTube. Skeletons were extracted from each frame of these video clips using the OpenPose toolbox [2]. Each skeleton is represented by 18 body joints and each body joint contains spatial 2D coordinates and the estimation confidence score as its three features. We use the dataset version provided by [10], which contains 240,000 skeleton sequences as training data and 20,000 skeleton sequences as test data, in our experiments.

B. Experimental Settings
All models were implemented within the PyTorch framework [33] using the Ride library [34]. Models were trained using a SGD optimizer with learning rate 0.1 at batch size 64, momentum of 0.9, and a one-cycle learning rate policy [35] using a cosine annealing strategy. For models which could not fit a batch size of 64 on a Nvidia RTX 2080 Ti, the learning rate was adjusted following the linear scaling rule [36]. Our source code is available at www.github.com/lukashedegaard/ continual-skeletons.

C. Conversion and Fine-tuning Strategies
Though regular and Continual CNNs are weight-compatible, the direct transfer of weights is imperfect if the regular CNN was trained with zero-padding [31]. As in most CNNs, it is common practice to utilize padding in skeleton-based spatio-temporal networks to retain the temporal feature size in consecutive layers (though temporal shrinkage is not a concern given the long input clips).
Another common design choice, which has a significant impact in on the performance of Continual Inference Networks, is the utilization of temporal stride larger than one. For regular networks, this has the benefit of reducing the computational complexity per clip prediction. In Continual Inference Networks, however, it reduces the prediction rate, and actually increases the complexity per prediction (see Section III-C). In the continual case, it would thus be computationally beneficial to reduce the stride of all layers to one. However, this results in a stride-inflicted model-shift.
Thus far, the model-shift inflicted by padding removal and stride reduction, as well as how to best perform the conversion from a regular CNN to a Continual CNN in such cases has not been studied. In this set of experiments, we explore strategies on how to best convert and fine-tune regular networks to achieve good frame-by-frame performance. We use a standard ST-GCN [10] trained on joints only as our starting-point, and explore the accuracy achieved by: 1) Converting to from regular network with equal padding and stride four (Reg p=eq s=4 ) to a Continual Inference Network, where zero-padding is omitted (Co p=0 s=4 ). 2) Reducing the network stride to one without fine-tuning (Co p=0 s=1 ). 3) Fine-tuning the Co p=0 s=1 network (= Co * ). 4) Fine-tuning a conversion-optimal regular network which has no zero-padding and a stride of one (Reg p=0 s=1 ). 5) Converting from Reg p=0 s=1 to Continual (= Co * ). As seen in Table I, the direct transfer of weights was found to have a modest negative impact on the accuracy (by −0.3%) due the removal of zero-padding. This is considerably less than was found in [31]. Our conjecture is that the smaller amount of zeros relative to clip size used in skeleton-based recognition (8 zeros per 300 frames or 2.67%) compared to videobased recognition (e.g., 2 zeros per 16 frames or or 12.5%) makes the removal of zero-padding less detrimental since zeros contribute relatively less to the downstream features. Lowering the stride to one and removing zero-padding reduced accuracy by a substantial amount but allowed the Continual Inference Network to operate at much lower FLOPs. This accuracy drop is alleviated equally effectively by either (a) initializing the Co p=0 s=1 with standard weights and fine-tuning in the continual regime or (b) first fine-tuning the conversionoptimal regular network (Reg p=0 s=1 ) and subsequently converting to a Continual Inference Network, though the latter had lower training times in practice. We fine-tuned the networks using the settings described in Section IV-B. As visualised in Fig. 6, we found 20 epochs of fine-tuning using the settings described in Section IV-B recover accuracy on NTU RGB+D 60 with additional training yielding only marginal differences. Following this approach the (padding zero, stride one) optimized Continual ST-GCN (CoST-GCN * ) achieves a similar prediction accuracy while reducing the computational complexity by a factor 107.7× relative to original ST-GCN!

D. Conversion of Attention Architectures
As we explored in Section IV-C, the ST-GCN network architecture can easily be modified and fine-tuned to achieve high accuracy for frame-by-frame predictions with exceptionally low computational complexity. A natural follow-up question is whether this conversion is equally successful for more complicated spatio-temporal architectures that employ attention mechanisms. To investigate this, we conduct a similar transfer for two recent ST-GCN variants, the Adaptive GCN (AGCN) [11] and the Spatial Transformer Network (S-TR) [29]. While S-TR is easily converted to a Continual Inference Network (CoS-TR) by replacing convolutions, residuals and pooling operators with Continual ones, the AGCN requires additional care. In the original version of AGCN, the vertex attention matrix C p (see Eq. (7)) is computed from the global representations in the layer over all time-steps. Since this operation would be acausal in the context of a Continual Inference Network, we restrict it to utilize only the framespecific subset of features. As a fine-tuning strategy, we first make the conversion from regular network to a conversionoptimal network, and subsequently convert and evaluate the continual version.
Our results are presented in Table II. Here we see that all three architectures can be successfully converted to continual versions. The fine-tuned conversion-optimal models (marked by * ) generally exhibit a higher computational complexity than their source models due to their stride decrease. While the ST-GCN * attained increased performance by lowering stride, AGCN * and S-TR * suffer slight accuracy deterioration. This may be due to smaller receptive fields of their attention mechanisms, which likely benefit from observing a larger context. Unlike the transfer from the original models with padding and stride four to continual models, the continual models with weights from ST-GCN * , AGCN * , and S-TR * , i.e. II: NTU RGB+D 60 transfer accuracy and performance benchmarks. Noted is the top-1 validation accuracy using joints as the only modality. Max mem. is the maximum allocated memory on GPU during inference noted in megabytes. Max. mem, FLOPs, and throughput on CPU account for one new prediction with batch size one while throughput on GPU uses the largest fitting power of two as batch size. Parentheses indicate the improvement / deterioration relative to the original model.  CoST-GCN * , CoAGCN * , and CoS-TR * attain the exact same accuracy as their source models on both the X-Sub and X-View benchmarks, with two orders of magnitude less FLOPs per prediction during online inference.

E. Speed and Memory
Diving deeper into the differences between regular and continual networks, we conduct throughput benchmarks on a MacBook Pro 16" with a 2.6 GHz 6-Core Intel Core i7 CPU and a NVIDIA RTX 2080 Ti GPU. Here, we measure the prediction-time as the time it takes to transfer an input of batch size one from CPU to GPU (if applicable), perform inference, and transfer the results back to CPU again. On CPU, a batch size of one is used, while for GPU, the largest fitting power of two is employed (i.e. {128, 64, 256, 256} for the {Reg, Reg * , Co, and Co * } models). We measure the maximum allocated memory during inference on GPU for batch size one.
As seen in Table II, the change in speed relative to the original models follow a similar trend to those seen for FLOPs. The non-continual stride one variants (denoted by * ) exhibit roughly half the speed of the original models, while the continual models enjoy more than a magnitude speed up on both CPU and GPU. As expected, the continual stride one models (Co * ) attain the largest inference throughput. These relative speed-ups are lower than the relative FLOPs reductions due to the read/writes of internal intermediary features in the Continual Convolutions since these are not accounted for by the FLOPs metric while still adding to the runtime. This gap could be reduced on hardware with in-or near-memory computing.
Considering the maximum allocated memory at inference, we find that the continual models reduce memory by 20-52%. While the Continual Convolution and -Pooling layers do add some internal state that adds to the memory consumption, the intermediary features that are passed between network layers are much smaller, i.e. one frame instead of 75 to 300 frames.

F. Comparison with Prior Works
Most current state-of-the-art methods for skeleton-based action recognition are not able to efficiently perform frameby-frame predictions in the online setting, since they are constrained to operate on whole skeleton-sequences. Some RNNbased methods, e.g. Deep-LSTM [26] and VA-LSTM [7], can be used for redundancy-free frame-wise predictions, but their reported accuracy has been sub-par relative to newer methods that sprung from ST-GCN. The recently proposed AGC-LSTM [38] does report results on-par with CNN-based methods, and might also be able to provide redundancy-free frame-wise results, but we cannot validate this due to the lack of publicly available source code and details in the published paper.
While ShiftGCN and ShiftGCN++ offer impressively low FLOPs, it should be noted that the shift operation is not accounted for by the FLOPs metric. The FLOP count of shift-based methods are therefore low compared to other methods and may not reflect on-hardware performance. Due to the temporal shifts back and forth in time, ShiftGCN and ShiftGCN++ cannot be easily transformed into Continual Inference Networks in their current form, though a Continual Shift operation could be devised, where temporal shifts only occur backwards in time. Nevertheless, ShiftGCN++ offers a remarkable accuracy/FLOPs trade-off, which was attained us- Many works have shown that the inclusion of multiple modalities leads to increased accuracy [10,11,14,15,21]. In our context, these modalities amount to joints, which are the original coordinates of the body joints, and bones, which are the differences between connected joints. Additional joint motion and bone motion modalities can be retrieved by computing the differences between adjacent frames in time for the joint and bone streams respectively. Models are trained individually on each stream and combined by adding their softmax outputs prior to prediction. We evaluate and compare our proposed continual models, which are CoST-GCN, CoAGCN, CoS-TR, with prior works on the NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400 datasets as presented in Tables III, IV, and V. The CoST-GCN and CoS-TR models transfer well across all datasets both with ( * ) and without padding and stride modifications. For CoAGCN, we find that the change to stride one deteriorates accuracy. We surmise that the attention matrix in Eq. (7) may need a larger receptive field (basing the attention on more nodes as in AGCN) to provide beneficial adaptations; a per-step change in attention might provide more noise than clarity in middle and late network layers. As found in prior works, the multi-stream approach with ensemble predictions gives a meaningful boost in accuracy across all experiment.
The Continual Skeleton models provide competitive accu- racy at multiple orders of magnitude reduction of FLOPs per prediction in the online setting compared to the original non-continual models. While none of our results beat prior state-of-the-art accuracy in absolute terms, this was never the intent with the method. Rather, we have successfully shown that online inference can be greatly accelerated for models in the ST-GCN family with state-of-the-art accuracy/complexity trade-offs to follow. For instance, our one and two-stream CoS-TR * achieve Pareto optimal 3 results on all subsets of the NTU RGB+D 60 and NTU RGB+D 120 datasets meaning that no other model improves on either accuracy and FLOPs without reducing the other. Pareto optimal models have been highlighted in Tables III, IV, and V accordingly. Our approach may be used similarly to accelerate other architectures for skeletonbased human action recognition with temporal convolutions.

V. CONCLUSION
In this paper, we proposed Continual Spatio-Temporal Graph Convolutional Networks, an architectural enhancement for networks operating on time-dependent graph structures, which augments prior methods with the ability to perform predictions frame-by-frame during online inference while attaining weight compatibility for batch inference. We re-implement and benchmark three prominent methods for skeleton-based action recognition, the ST-GCN, AGCN, and S-TR, as novel Continual Inference Networks, CoST-GCN, CoAGCN, and CoS-TR, and propose architectural modifications to maximize their frame-by-frame inference speed. Through experiments on three widely used human skeleton datasets, NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400, we show up to 26× on-hardware throughput increases, 109× reduction in FLOPs per prediction, and 52% reduction in maximum memory allocated during online inference with similar accuracy to those of the original networks. During offline inference of full spatio-temporal sequences, the models operate identically to prior works. Our proposed architectural modifications are generic in nature and can be used for a variety of problems involving time-dependent graph structures, like traffic control. Proposal of methods based on Continual ST-GCNs which can address challenges posed by such problems can be an interesting future research direction. It is our hope, that this innovation will make online processing of time-varying graphs viable on recourse-constrained devices and systems with realtime requirements.