Are 3D convolutional networks inherently biased towards appearance?

Large-scale


Introduction
Human activity recognition from videos is a long standing research problem in computer vision.
Due to the effectiveness of 3D convolutional networks trained on large-scale video data, a number of works have performed empirical analysis to get insights into the working of such networks. One common finding is the bias in 3D convolutional networks towards scenes and objects (Hiley et al., 2020;Huang et al., 2018), which suggest that activity recognition happens not based on actor-centric dynamics but by recognizing the context in which activities occur. Others suggested various methods to measure and separate motion in videos and analyzed the model's performance with respect to observed dynamics (Sevilla-Lara et al., 2021;Bertasius et al., 2021;Yang et al., 2020). In contrast, we directly look at the motion and appearance aspects in 3D models themselves and define new metrics to analyze temporality of the models and their layers.
Specifically, we question whether 3D convolutional networks are inherently biased towards static appearances and investigate whether such networks are able to capture both appearance and dynamic aspects given the right data.
As starting point for our investigation, we confirm the clear connection between activities and their static context by considering a large-scale dataset of human actions Kinetics (Kay et al., 2017) and locations Places365 (Zhou et al., 2017). We provide observations about the sparse relation between activities and their locations classesthe natural imbalance between actions and locations distributions. Additionally, we show the direct effect of activity-location exclusivity on the ability of the network to recognize an action class: a model better recognizes an action that takes place in a unique and consistent location.
Second, we propose new layer-wise and kernel-wise measurements that are designed to reveal the temporal properties of 3D convolutional models. The measurements use the weights of a network to determine how dynamically receptive individual kernels and layers are when trained on a large-scale video data. We observe a consistent drop of temporality for deeper convolutional layers in all 3D models trained on Kinetics. Said findings indicate that the trained models either reserve later layers for appearance only, or they are the result of the models adapting to the (lack of) discriminative motion statistics in the data.
Third, aiming to resolve the described disjunction, we investigate our hypothesis that the bias towards appearance and away from temporal dynamics is not inherent in 3D convolutional networks. To that end, we propose a new method to generate motion trajectories and construct two datasets in which motion and appearance are explicitly decoupled. Decoupling allows us to control the relation between these aspects in the temporality experiments. We find that 3D convolutional network are not biased by their design towards appearance. On the contrary, they successfully manage to learn complex motion patterns. Upon applying the proposed temporality measures on the trained networks, we find that increasing variability of motion patterns in the training data results in a significant rise in the temporality, especially in the deeper layer of the network. Additionally, compared to motion, increasing appearance variance stronger penalized models performance. Our investigation sheds new light on deep learning for video activity recognition by showing that appearance bias in modern 3D models reflects data properties rather than architectural limitations. We demonstrate that models can adapt to both motion and appearance content of the data, effortlessly learning complex movement trajectories.
In short, our contributions are as follows: • we confirm the clear connection between activities and their locations in 3D convolutional networks; • we define new temporality measurements, designed to reveal the temporal properties of 3D convolutional networks; • we present two new datasets, in which we explicitly decouple motion and appearance; • with the new datasets and measurements, we reveal new findings on motion-appearance relation in 3D networks.

Background and related work
Initially, human recognition datasets focused on specific domains such as sports, cooking, household chores and outdoor activities. Therefore, background environment in such datasets generally remained consistent across different action categories (Damen et al., 2018;Soomro and Zamir, 2014;Zhou et al., 2018b;Yoshikawa et al., 2018). Deep video understanding requires a network that captures its visual aspects (i.g. actors, objects and locations) together with temporal dynamics, and perhaps, the most dominant trend in video understanding of the last decade is the successful conversion of 2D image networks to 3D and video networks. Well-known approaches include recurrent networks over frame-level features (Srivastava et al., 2015), tracking the relations of interest points (Kalita et al., 2018;Yi et al., 2020), focusing on human body poses and gestures, and localizing key frames (Ng et al., 2015;Wang et al., 2011;Laptev, 2005;Aihara and Aoki, 2014;Ke et al., 2018;El-Ghaish et al., 2018;Agahian et al., 2020).
With the quantitative success of deep 3D networks came the necessity for model interpretation and explainability. Visualization and analysis of deep networks helps to understand the nature of learned features, their hierarchy, level of abstraction and generalization. Due to the hybrid nature of 3D spatiotemporal features their study and interpretation is not straight-forward and many approaches have been proposed to address that property. Gradient-based methods provide pixel-wise visual explanations, highlighting regions of high importance for the classification task, for example, Guided Backpropagation (Springenberg et al., 2015), GradCAM (Selvaraju et al., 2017), Deconvolution (Zeiler, 2014). Similarly, visualizations of video-based solutions compute spatio-temporal heatmaps for input voxels. T. Nagarajan (2019) focused on visualizing human-object interactions, Meng et al. (2019) proposed interpretable attention mechanism that distinguish the most relevant parts of the video, Class Feature Pyramids method discovers 3D kernels that are informative for a specific class (Stergiou et al., 2019). Such methods reveal local implied spatiotemporal properties of a given video that are relevant for the classification task. Shuffling and removing frames is one of the techniques to test how deep models capture motion information: Zhou et al. (2018a) discover static and temporal action categories depending on the frame order,  report performance changes when reversing the order of frames and show that importance of 'arrow of time' varies in different datasets. Huang et al. (2018) propose to measure the amount of temporal information by isolating the appearance only in a video and training a model to generate temporal dynamics in a class-agnostic fashion. Sevilla-Lara et al. (2021) split videos into static and temporal categories using human annotations and observe statistical difference of the action categories. Choi et al. (2019) observe the scene bias problem and propose new training methods to mitigate it. Weinzaepfel and Rogez (2019) analyze popular 3D models and observe a superior performance of shallow human-centered models in contextless activity recognition.
Where current works focus on the temporality of action categories, we look at 3D networks themselves and focus on the key building blocks -3D kernels. We find that 3D networks copy biases from biased dataset, but can adapt to varying amount of motion in the data.

I3D is biased towards appearance on action datasets
We first try to answer the following video understanding question: how is the temporality of videos encoded in the network's parameters? Throughout our experiments, we use the I3D model pre-trained on the Kinetics dataset (Kay et al., 2017) as the main backbone, due to its canonical nature and widespread use in video recognition. All models are trained on four Tesla V100-PCIE-16gb GPUs, using the Pytorch library (Paszke et al., 2019). Unless specified otherwise, we follow the data preprocessing and training from .
As a starting point, we have quantified appearance bias in 3D models in Appendix A, before we investigate whether this bias is inherent. Choi et al. (2019), Girdhar and Ramanan (2019) have observed scene and object bias in action datasets: tasks and static surroundings are highly correlated. The results in the appendix show three things: (i) more than half of the Kinetics samples take place in 16 location categories, (ii) I3D performs worse on video samples that take place in more common locations compared to more unique locations, and (iii) representations trained on actions also learn relevant scene representations. A network trained on Kinetics can be directly used to obtain effective representations for scene recognition. In fact, fine-tuning the entire network on training scenes does not improve results.
These experiments support that Kinetics dataset extensively covers scenes and enables learning useful features for scene recognition.
Our next step is to find out which layers of 3D convolutional networks are responsible for temporal modeling rather then appearance modeling. We perform an examination at the layer level and the individual filter level and propose new measures to quantify the temporality of each convolutional kernel. We use these measures to analyze how much each layer is responsible for modeling dynamic representations of videos.

New measures for temporal dynamics in 3D filters
3D convolutions jointly learn visual and temporal representations, which raises the following question: how can we find correlates of temporal modeling in a 3D network and its parameters? We propose two novel metrics that aim to show the temporality of a convolutional layer and its kernels.
3D filters responsible for recognizing temporal changes should have weights receptive to those changes. Under this assumption, we define the first measure of temporality of a layer as an average standard deviation of its filters across the temporal dimension (ASDT), (see Fig. 1). A 3D convolution filter is defined by a 5-dimensional tensor ( , , , , ) denoting respectively height, width, temporal span, the number of channels in the previous layer, and the number of channels in the next layer. As an example, the second 3D convolutional layer of the I3D model is of size (3,3,3,64,192). If , , , is the weights tensor and , , , , are the indices that cover aforementioned dimensions, ASDT is defined as: where, , is the mean over the time dimension ASDT can be applied to both fully convolutional networks (such as I3D) and spatiotemporal-separable 3D convolution networks as it only measures variance of weights across the temporal dimension. For example, we can similarly compute ASDT for the S3D model (Xie et al., 2017) by taking the weights from its 1D temporal convolutional layers. For more details on calculation of ASDT for different architectures, please, refer to Appendix B.
ASDT defines the temporality of 3D convolutional networks for individual layers. To look at the feature maps of the model, we need to consider individual kernel of the convolution operations. For that purpose we propose a second measurement that aims to estimate which feature maps are more receptive to dynamics in the video and which are more responsible for appearance. In other words, to determine which channels of a layer are temporal and non-temporal. Each output channel represents a feature map that reflects a certain visual or dynamical property of the input, it is constructed as a sum of filtered feature maps of the previous layer. We aim to define an individual temporality measure for every 3D filter that together represent the temporality of the feature map. We notice that computing central self-convolution ( (.)) for the filter (or strictly speaking, the sum of Hadamard products of its 2D slices Horn, 1990) satisfies such requirement, see Fig. 2. Indeed a fully non-temporal filter would have a number of identical 2D kernels stacked together and would have a high value of selfconvolution. On the opposite, filters with high temporal discrepancy would have low (even negative) self-convolution values. If , , is a single kernel of a layer, the self-convolution (sc) is defined as:  and for any two matrices and the Hadamard product ⊙ is defined as: Without loss of generalization and for the simplicity of visualization we can define a kernel to be temporal if ( ) > and static if ( ) < , where is a threshold parameter. This way, every feature map is constructed by a varying proportion of temporal and static kernels. That allows us to compare different layers of a network based on their distribution of temporal/static feature maps. For demonstration purposes we will refer to feature maps with majority kernels being temporal as Temporal channels and Non-temporal channels vice versa. This way the proposed measure classifies every channel into one of the categories, we will call it the self-convolution measurement.

Analysis of temporal dynamics
Using the two introduced measurements, we investigate the temporality of the 20 layers in I3D with convolutions across the temporal dimension. Fig. 3 visualizes the dynamics of the ASDT depending on the layer depth for I3D model: we can see a consistent drop when going deeper into the network. We also observe similar trends for other 3D convolutional models trained on Kinetics as shown in Appendix B. This suggests that if sensitivity for the temporal dynamics of an input is reflected in the model parameters, the kernels of the first layers are more receptive to such dynamics. In the architecture of I3D, the number of feature maps grows with the number of layers, hence the absolute amount of temporal information carried by the layers weights may be constant. This way, the observed decrease of the average temporality might be still achieved with the constant number of temporal feature maps assuming that the extra maps in the deeper layers are predominantly static. To test that hypothesis, we use the second metric to evaluate the feature maps of the layers. Now for every 3D layer, we classify each channel as temporal if more than half of its kernels are temporal and static otherwise as defined by the self-convolution measurement (we use = 0 for our measurement). In Fig. 4, we plot the distribution of the output channels and observe similar dynamics, i.e. the number of temporal channels drops towards the outer segments of the model. This finding clarifies the ASDT measurement of I3D (the average drop of temporality cannot be explained by lateral growth of the layers) and both measurements suggest that motion related features are more likely to be learned in the earlier layers of I3D, when trained on the Kinetics dataset. To summarize, the location experiments showed how the I3D model relies on the static information when classifying Kinetics actions. The ASTD and self-convolution metrics show that the average temporality of the I3D layers decreases with depth and the absolute number of temporal feature maps also drops with layer depth. In short: I3D is biased towards scenes and static appearance in general when trained on Kinetics.

I3D is not inherently biased towards appearance
The results of the previous sections show that the I3D model trained on Kinetics is biased towards appearance and that temporal dynamics is only marginally relevant, especially in the final 3D layers. In the following section, we show that this bias is not inherent to I3D model itself, but rather reflects the properties of the training data. For this purpose we need to decouple motion and appearance in the videosto directly access the variability of movement and background information and its effect on the resulting model. To deepen the analysis of temporality in 3D models we introduce two new datasets with controlled differentiation of motion and appearance.

Decoupled motion-appearance datasets
BrushMNIST is a synthetic dataset we composed to study the ability of 3D convolutional networks to learn motion and appearance. Variable appearances are achieved by using different instances of the MNIST dataset that move in the video clip, and variable motion patterns are introduced as different drawing trajectories of the MNIST digits. Simply speaking we use one digit as a brush and another digit as a trajectory example -shape. This corresponds to 100 different classes corresponding to 10 brushes and 10 shapes. Each sample consists of black and white video clips with 224 × 224 pixels resolution and 150 frames each. In every video clip a random MNIST digit (28 × 28 pixels) is moving in a trajectory that corresponds to another random MNIST digit that is resized to cover the 224 × 224 frame (See Fig. 5 for examples). The generation of random trajectories from a digit sample is described in detail in Appendix D. Since MNIST consists of 60,000 training samples, BrushMNIST dataset theoretically consists of 60,000 × 60,000 training samples. Our dataset differs from Moving MNIST (Srivastava et al., 2015), which consists of short video clips of groups of digits moving in straight lines following geometrical trajectories and bouncing off the frame edges (no labels). Moving MNIST is used for predicting future trajectories by extrapolating learned representations. The BrushMNIST is designed for a video sequence classification task, where trajectories' shapes define the video categories.
Places+Faces+MNIST (  It is similar to BrushMNIST, but uses Places365 images as background and represents the appearance variance, LFW faces are used as brushes and MNIST samples as trajectories (Fig. 6 for examples). We define different instances of the dataset P+F+M_ _ , where is the number of places categories and is the number of trajectories, and the total number of classes is × . Trajectories beyond 10 are constructed by drawing number's digits: one after another. For the better viewing the datasets and the samples GIFs are available at https://github.com/Petr-Byv/Brush_MNIST. The benefit of using the datasets is two-fold: we know exactly how motion and appearance are tied for each sample, and we can also control the complexity by balancing dynamics and appearances.

Analysis of temporal dynamics on decoupled datasets
In the following section, we study the behavior of the I3D model on the previously defined decoupled datasets -BrushMNIST and Places+Faces+MNIST. By utilizing the temporality measures we show how varying the amount of dynamics in the training data effects the model parameters.

BrushMNIST experiments
For the experiments on BrushMNIST we train an I3D model with randomly initialized weights and changing the number of input channels from 3 to 1 to feed black-and-white videos.

First, we conduct 3 validation experiments:
• Basic setup - Table 1(a). It shows motion and appearance related errors: motion is learned successfully and the majority of errors comes from the appearance miss-classification. This indicates that motion patterns of writing digits add information which helps to a better recognize BrushMNIST shapes.  • Footprint test -Table 1(b). To clarify the importance of motion information we removed the temporal information and fed only the footprint of a motion. This modification of BrushMNIST worsened the performance: errors are now equally distributed across Brush and Shape digits.
• Frame order - Table 1(c). It shows that opposite to the Kinetics set, when feeding reversed samples of the BrushMNIST the recognition of motion is dropped to the chance level -11.3% (since the Brush digits are still recognized correctly). Fig. 7 shows how the model learns temporal dynamics. Each row shows footprint of the samples with the same shape, but different brushes. The red highlights the frames (and therefore individual small numbers) that correspond to the maximum activation in the feature space along the time dimension. This shows the consistency of learned motion patterns across the shapes: the model learns to attend the signature segments of the number shapes, for example, drawing a line from left to right in the middle of a '5' or the intersection of the lines of a '4'. Fig. 8 shows ASDT measure for I3D trained on BrushMNIST. For this dataset temporality of layers does not continuously drop with depth, which proves that directly infusing dynamics into the training samples changes the temporal responsibility of models layers. The appearance bias observed in previous experiments is caused by the training data itself and not the model architecture.

Places+Faces+MNIST experiments
We can tune the motion and appearance aspects of a P+F+M dataset by changing the number of background classes and the number of motion patterns. The total combinatorial number of classes would be the multiplication of these to numbers. For the constancy we experiment with all the combinations of 2 1 , … , 2 8 numbers of motion and appearance classes. This way the simplest dataset has 4 classes (2 1 × 2 1 ) and the most complex -2 8 × 2 8 = 65536 classes. This allows us to answer the question: does I3D better scale up to motions or appearances variability? Table 2 shows the behavior of the I3D depending on different motion/appearance factors of a P+F+M set. We plot the model's accuracy depending on the number of motion and appearance classes. The table shows that the model can deal with the growing number of motion patterns, and vice-versa performance deteriorates faster when adding more appearance classes. This finding suggests that I3D model is not intrinsically biased towards appearance, but rather flexibledynamical-motion information is learned to the extend presented by the training dataset.
We also analyze the temporality of I3D when trained on P+F+M compared to training on Kinetics with help of measures proposed in Section 3.1. Fig. 10 shows an example ASDT measure for different layers after training on P+F+M with 16 motions and 8 locations. Compared to I3D trained on Kinetics, we observe a very different picture: the temporality of a layer does not decrease with depth, but the opposite trend is clear. Similarly, compared to I3D trained on Kinetics the self-convolution measure shows a drastic difference: more temporal channels are observed in all the layers, with the deeper ones effected the most compared to the Kinetics weights (Fig. 9). This result proves the utility of previously proposed measurements as they reveal how the nature of training data affects the temporality of model's parameters.

Does pre-training on millions of videos resolve the appearance bias in 3D convolutional networks?
So far, we have found that the ASDT profile of a model reflects the bias towards or away from temporality as a function of the layer Table 2 I3D model performance on Places+Faces+MNIST datasets. We varied the amount of motion M and appearance A (columns and rows respectively). Values and heatmap corresponds to model's performance on the given combination of motion and appearance classes. We see that I3D performance is more sensitive to the increase of appearance classes. Model effectively solves the growing number of motion variants. depth in 3D convolutional networks. In the Appendix we also analyze state-of-the-art models pre-trained on Kinetics such as R(2+1)D, S3D, ip-CSN-152 and I3D -all show a similar behavior, their ASDT temporality drops with layer depth, as shown in Fig. B.13. We also analyzed the HMDB51 (Human Motion Database) -a dataset with a direct focus on the motion aspect. Nonetheless motion specialty does not resolve the appearance bias, as seen in Fig. 11(a) the temporality drop in the later layers persists. While Kinetics remains the main videoset for pretraining, recently a few datasets have been introduced with millions of video clips, including Howto100 m (Miech et al., 2019) and IG-65M . 1 They have shown competitive results when used for pre-training of deep 3D models. To see whether training on million of videos alleviates the bias towards static appearance, we analyze the ASDT profile for 3D models trained on these datasets, i.e. the temporality of said datasets in Fig. 11. Interestingly, it shows that natural video collections can induce a different temporality profile in 3D models, for example, Howto100 m leads to a flat temporality in all the layers of the S3D model (Miech et al., 2020). This confirms authors' assumption, that Howto100M dataset holds more dynamical information which makes it stand out from both Kinetics and IG-65M. We conclude that ASDT measure helps to estimate the temporality of future video datasets and progress towards more dynamical video representations. Based on the ASDT measures, we recommend to pretrain 3D convolutional networks on Howto100 m if an appearance bias needs to be avoided.

Conclusions
In the paper we study motion and appearance aspects in 3D convolutional models for video analysis. We propose measurements to estimate the temporality of network's weights, and these measures reveal a common pattern for different models trained on Kinetics -a clear decrease of temporality as a function of the layer depth. More to that, our experiments point to the training data, and not 3D architecture itself, for being responsible for previously observed appearance/scene bias in video classification solutions. With the proposal of decoupled motion-appearance datasets we observe that 3D convolutions adapt to videos visual statistics and are powerful enough to effectively recognize complex motion patterns. By modifying the amount of motion in the training data we managed to show the validity of proposed temporality measure. We believe that the presented results and proposed measures give useful insights for computer vision researchers, which can help in construction of future video datasets and engineering of deep video models. We will make the code and measures publicly available. Fig. 11. ASDT for state-of-the-art video models. Where training on HMDB51 or IG-65M still induces an appearance bias in later layers, this bias is mostly gone when training on Howto100 m. (See Xie et al., 2017;Kuehne et al., 2011;Miech et al., 2020;Tran et al., 2019).

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Scene bias
We quantified the bias towards scenes in I3D using a source and a target datasets: Kinetics-400 (source) is a collection of approximately 300,000 video urls from a video hosting service that cover general human action classes. We use the first version of the dataset with 400 classes . Places365 (target) is a collection of 18 million images with 365 scene categories (Zhou et al., 2017).
Both Kinetics and Places365 are built upon extensive language corpora and aim to cover the visual world fully and precisely. For this reason we expect the scene categories from the Places365 dataset to adequately describe the locations of human activities from Kinetics and to test that we study the relations between actions from the Kinetics dataset and environments from the Places365 dataset.
First, we view the distribution of Places instances in the activities' samples. Each sample of the video set is a sequence of frames that belong to a certain activity category from Kinetics. We extract the place information of every frame from that sequence, and to do so, we use a VGG model trained on Places365 (Zhou et al., 2017). This way each video sample produces a list of location predictions. Among all the frames we count the most common location prediction and assign it to the video sample. We call the proportion of frames that belong to the same location the common place score -it can be computed for every video sample from the Kinetics dataset. Fig. A.12 shows the percentile graph of the most common location scores for Kinetics video samples. For example, for 80% of the videos samples the common place score is 0.25 or greater, indicating that 25% or more frames of each video belong to the same place category. We assume that location categories for the video clips are reasonably consistent, meaning that from the Places365 perceptive the environment of a single Kinetics sample does not usually change. Table A.3 shows the distribution of places in all the samples from Kinetics. For the calculations we select the most dominant location category for every video sample. We see that, more than 50% of samples from Kinetics were captured in 16 categories of locations. Among the most popular are house, discotheque, outdoor cabin, library, barn, junkyard, garden and general store.
The I3D model performs better on samples with more exclusive place-activity relation -56.3% classification accuracy for most common places and 63.2% for less common, which means that the model can more effectively utilize the surroundings to recognize activity category and videos that are related to the same environment are more likely to be confused.
The short study above indicates a relation between activities and scenes, further, we tested if representations learned on actions transfer well to scenes. We trained a classification layer on top of fixed I3D features to classify categories from the Places dataset and compare the performance to the I3D's 2D backbone -the Inception model. The video model takes a sequence of frames as an input, therefore to get the prediction we repeat the frames -every input is a fully static video that represents a single place. We test two training regimes: fixed weights -where all the parameters up to the last convolutional layer are non-trainable and non-fixed weights -where we train all

B.1. ASDT calculation
In Section 3.1 we define the ASDT that measures temporality of the temporal layers. It takes deviation across the temporal component, and for more recent architectures such as ir-CSN, or S3D, where spatial and temporal convolutional components are separated we consider only the temporal subunits. Algorithm 1 describes the model agnostic ASDT calculation procedure.

B.2. ASDT for different 3D architectures
We plot the ASDT measure with respect to layer number (considering only layers with temporal component) for I3D, R3D, ir-CSN152  Fig. B.13. The pattern is similar across different models, temporality tends to drop towards later layers. interestingly when taking an S3D trained on the Howto100 m dataset (Miech et al., 2020) the pattern changes (see Fig. 11(b)) towards more temporal deeper layers. We also calculate and show ASDT for ir-CSN-152 trained on IG-65 m and Sports-1 m datasets 2 (Figs. 11(c), B.14).

Appendix C. Self-convolution measure
In Section 3.1 we defined the self-convolution measure for fully 3D convolutional layers. It uses Hadamard product to estimate the agreement between kernel's 2D faces. Algorithm 2 describes the selfconvolution calculation procedure that classifies every feature space of a layer as TEMPORAL or STATIC: Fig. C.15 shows self-convolution profiles for models trained on different PFM instances.  In Section 4.1 we introduce datasets with decoupled motionappearance. The brush samples from BrushMNIST were directly sampled from MNIST training set. To generate random classifiable motion patterns -the shapes, we also used MNIST digits as movement templates. For a given random digit we define a function that maps a 28 × 28 pixels image to a list of coordinates [( 1 , 1 ), ( 2 , 2 ), ( 3 , 3 )..., ( , )], where is the number of frames in the desired video and represented by an ordered list of 2d vectors -( ⃗ 1 , ⃗ 2 , … ⃗ ). We also an anchor starting point for every digit from 0 to 9 -( 0 , 0 ). We used

Appendix D. Generating trajectories from MNIST digits
OpenCV (Bradski, 2000) function skeletonize to generate pixel-wide MNIST shapes. Algorithm 3 describes the procedure for generating the list of pixel positions: