Multi-Cue Gate-Shift Networks for Mouse Behavior Recognition

ABSTRACT Automatic identification of mouse behavior plays an important role in the study of disease or treatment, especially regarding the short-term action of mice. Existing three-dimensional (3D) convolutional neural networks (CNNs) and two-dimensional (2D) CNNs have different limitations when addressing the task of mouse behavior recognition. For instance, 3D CNNs require a large calculation cost, while 2D CNNs cannot capture motion information. To solve these problems, a low-computational and efficient multi-cue gate-shift network (MGSN) was developed. First, to capture motion information, a multi-cue feature switching module (MFSM) was designed to utilize RGB and motion information. Second, an adaptive feature fusion module (AFFM) was designed to adaptively fuse the features. Third, we used a 2D network to reduce the amount of computation. Finally, we performed an extensive evaluation of the proposed module to study its effectiveness in mouse behavior recognition, achieving state-of-the-art accuracy results using the Jiang database, and comparable results using the Jhuang database. An absolute improvement of +5.41% over the benchmark gate-shift module was achieved using the Jiang database.


Introduction
Mice are widely used in biomedical science research and their responses to disease or treatment are often measured by recording their behavior patterns. In most cases, the recordings are manually tagged. Annotating mouse recordings manually can be challenging, so having a reliable and automated behavior recognition system to complete the task using computers would be beneficial. With a high-performance system, we can solve the problem of manual annotation and improve efficiency. Several animal motion recognition systems have been proposed as a result of the existing research. These systems are mainly divided into two types: traditional methods based on manual feature extraction and deep learning methods using neural networks.
Studies on the mouse behavior recognition system using the traditional method of manual feature extraction have investigated the following. In 2005, Dollár et al. (2005) used a classification of sparse spatio-temporal features to identify mouse behavior. In 2010, Jhuang et al. (2010) proposed a system for automatically analyzing the behaviors of caged mice. This system combined motion information between adjacent frames with mouse speed and position information, then used this input to support vector machine hidden Markov models (SVMHMM) to obtain the classification results. In 2012, another study created an application of AdaBoost with spatio-temporal and trajectory features to classify mouse behavior (Burgos-Artizzu et al. 2012).
The method of constructing a complex model based on manual feature expression can no longer meet the requirements of high precision and speed, but the introduction of deep learning brings a new development direction for animal behavior recognition. For example, in 2016, Kramida et al. (2016) proposed the use of VGG features and LSTM networks to identify mouse movement. In 2019, embedded networks were used to extract features for rats and scene contexts participating in social behavior events. These LSTM networks were then used for behavior recognition (Zhang, Yang, and Wu 2019). In 2019, Nguyen et al. (2019) proposed using I3D and R(2 + 1D) models to address challenges with mouse behavior recognition. This produced one of the most advanced deep learning models for human action recognition at that time, which played a significant role in mouse behavior recognition.
Deep neural networks have made significant progress in human action recognition (Feichtenhofer 2020;Feichtenhofer et al. 2019;Tran et al. 2015a;Wang et al. , 2016aZhu et al. 2017). Time modeling is also important for capturing motion information in video for action recognition. Currently, mainstream action recognition methods are realized through two mechanisms. The common method learns motion features from RGB frames using either 3D-CNN (Hara, Kataoka, and Satoh 2018;Karpathy et al. 2014;Stroud et al. 2020;Tran et al. 2015aTran et al. , 2015b or time convolution implicitly (Li et al. 2021;Qiu, Yao, and Mei 2017;Tran et al. 2018;Wu et al. 2020;Xie et al. 2018). However, 3D-CNN often has a large amount of computation and poor performance because of the lack of sufficiently large datasets. The other method uses a two-stream convolution network (Carreira and Zisserman 2017;Feichtenhofer, Pinz, and Zisserman 2016;Shi et al. 2019;Simonyan and Zisserman 2014), in which one stream extracts spatial information from RGB frames, while the other stream extracts motion information from optical flow. This method can effectively improve the accuracy of action recognition and performs well on small datasets.
Inspired by the human action recognition method, we applied a human action recognition deep learning model to mouse behavior recognition. This study uses a human action recognition model with the Gate-Shift Module (GSM) (Sudhakaran, Escalera, and Lanz 2020) as the baseline model. The GSM is a lightweight module that can transform a 2D-CNN into an efficient extractor of spatiotemporal features. Our network is two-stream, consisting of two modules: multi-cue feature switching module (MFSM) and adaptive feature fusion module (AFFM). MFSM is a feature-switching module for RGB and optical flow, its purpose is to replace useless features with features of other cues. AFFM is capable of adaptive fusion of features after feature switching. After fusion, the features change from two-stream to single-stream; thus, the two-stream convolution network changes back to a single-stream convolution network. Therefore, the accuracy of behavior recognition can be effectively improved with only a small increase in the calculation.
The contributions of the proposed method are summarized as follows: (1) We propose a new MFSM that can replace feature maps with other cues that have a better effect on the final result for mouse behavior recognition; (2) We propose an AFFM that can make the features perform adaptive fusion after feature switching; (3) We perform extensive ablation experiments on the proposed module to study its effectiveness in mouse motion recognition; (4) We achieve improved results using the Jiang database and competitive results using the Jhuang Database, but only show a small increase in parameters and floating-point operations per second (FLOPs).

Two-Stream Networks
The basic principle of the two-stream model structure is to first calculate the dense optical flow every two frames in the video sequence to obtain temporal information. Then the convolutional neural networks (CNN) model is trained based on video image, spatial, and temporal, and the two branches of the network are used to judge each of the action categories. Finally, the training results from the two networks were directly fused to obtain the final classification results. The advantage of a two-stream convolution network architecture is its high precision, but slow speed. Feichtenhofer, Pinz, and Zisserman (2016) followed the architecture of two-stream convolution network fusion for video action recognition. To make better use of the spatiotemporal information from the two-stream model, the author improved the fusion strategy of spatiotemporal networks. They proposed five different fusion schemes for the fusion of spatial and temporal networks and three methods for the fusion of temporal networks. Wang, Qiao, and Tang (2015) listed the accuracy of a two-stream network using several of the latest CNN network architectures. Wang et al. (2016b) found that previous research results only accounted for short-term actions with an insufficient understanding of the time structure for long-term actions and small training samples. Therefore, a sparse time-sampling strategy and a video supervision strategy were used. The video was segmented by time domain and randomly selected segments were used to compensate for the first deficiency, while cross training, regularization, and data expansion were used to compensate for the second deficiency. This network structure is called a Temporal Segment Network (TSN). Due to the recent successful application of residual networks (ResNet) (He et al. 2016) in deep learning, Feichtenhofer et al. proposed a novel spatio-temporal residual network model, which combines ResNet and a two-stream model (Christoph and Pinz 2017). The temporal and spatial characteristics of behavior are hierarchically learned through residual connections between spatial and temporal flows.
In the early stage of feature extraction, we use a two-stream network that combines an RGB image and an optical flow image. The purpose is to use the complementary advantages of the two cues to conduct mouse behavior recognition, which further improves the accuracy of prediction. After feature fusion, the two-stream network is transformed into a single-stream network, which can effectively control computing cost and the number of parameters, and improve recognition performance.

Feature Fusions
In many studies, the fusion of different modal features is an important method for improving accuracy. Combining the features of different modalities can achieve a better recognition effect by using their complementarity. Chaaraoui, Padilla-Lopez, and Florez-Revuelta (2013) proposed a method combining 2D shape human pose estimation with bone features. Integrating effective 2D contours and 3D bone features can yield visual features with high discrimination value, and the additional discrimination data provided by the contour can be utilized to improve the robustness of human action recognition errors. Sanchez-Riera et al. (2016) combined RGB features with depth features for gesture recognition and general object recognition, then evaluated the two schemes of early and late fusion. Li, Leung, and Shum (2016) proposed a multi-feature sparse fusion model that extracts multiple features of human body parts from skeleton and depth data. When using sparse regularization technology to automatically identify the feature structure of key parts, the learned weighted features are more discriminative for multi-task classification. Chen, Jafari, and Kehtarnavaz (2014) extracted depth image features and RGB video features of human actions using a depth camera and inertial body sensor to evaluate two recognition frameworks: feature-level and decision-level fusion.
In the current paper, we propose an effective AFFM that enables the network to directly learn how to filter the features of different modals to retain only useful information for combination. At each spatial location, the features of different modals are adaptively fused, and some features may be filtered out because they have conflicting information at that location, while others may dominate.

Methods
In this section, we present Multi-cue Gate-Shift Networks (MGSN) for mouse behavior recognition, which includes the two modules: MFSM and AFFM. We first introduce the two submodules and then outline how they are integrated into MGSNs.

Multi-Cue Feature Switching Module
The MFSM requires the use of a BN layer, we first introduce a batch normalization (BN) layer (Ioffe and Szegedy 2015). The function of the BN layer is to enhance generalization and speed up network training and convergence. We used x m;c to represent the c-th feature map in the m-th feature network. After normalization of the BN layer, x m;c performs an affine transformation, where μ m;c and σ m;c represent the mean and mean standard deviation of all pixel positions of the feature map c in the m-th feature network respectively. γ m;c and β m;c are trainable scaling factors and offsets, respectively; ε is a small constant that prevents division by zero. The function of factor γ m;c is to evaluate the correlation between x 0 m;c and x m;c in training. If γ m;c ! 0, the loss gradient x 0 m;c will approach zero, which means that x 0 m;c will lose its influence on the result, and therefore x 0 m;c will become redundant. We got inspired by Wang et al. (2020) to replace the feature maps with smaller γ m;c with those of other feature networks, because these feature maps would lose their influence on the result and become redundant feature maps. To solve this problem, we propose the following formula: In Equation (2), if the scaling factor γ m;c of feature map c is less than the threshold θ (θ we set in the experiment is 1e-2), the current feature map c is replaced with the average of the feature map c of other feature networks. In other words, if one feature map of a cue loses its influence on the result, it is replaced by the average value of other characteristic network feature maps. In our implementation, we applied the above formula to the process of feature extraction and each cue switch feature maps after convolution and nonlinear activation. We represent the scaling factor that must be switched as γ m 0 ;c , and apply a sparse constraint on γ m 0 ;c to avoid unnecessary switching. This not only enables the replacement of useless feature maps, but also avoids the occurrence of useless switching. We divide the entire feature map into M equal sub-parts, and only perform the feature switching for different cues in each different sub-part. We denote the scaling factors that can be replaced by γ m . Contrary to γ m the switching in Equation (2) is a directed process within only one sub-part of the feature maps, which ideally will not only retain cue-specific propagation in the other M À 1 sub-parts, but also avoid unavailing switching since γ m . Figure 1 illustrates the feature-switching process.

Adaptive Feature Fusion Module
We refer to different feature fusion methods and finally use the adaptive feature fusion method (Liu, Huang, and Wang 2019) to design our AFFM. In contrast to previous MFFMs based on element summation or splicing, Figure 1. An illustration of our multi-cue fusion strategy. The sparsity constraints on scaling factors are applied to disjoint regions of different cues. If a feature map's scaling factor is lower than the specified threshold, the feature map will be replaced by that of other cues at the same position.
in our AFFM, for each pixel ði; jÞ on the fused feature map, the weights of the features from each clue at the pixel position are adaptively learned.
Let x m ij be the value of the feature vector at the pixel ði; jÞ position of the feature map of the m-th clue. The feature fusion method proposed in this study is as follows: where α ij and β ij represent the normalized weight coefficients of the two different cues when the features are adaptively fused at the pixel ði; jÞ position of the fused feature map. The value y of the feature vector after feature fusion at the pixel ði; jÞ position can be calculated using Eq.(3). In our method, a 1 × 1 convolution layer was added after the feature maps of the two cues, and two convolution maps were obtained. The values a and b at the pixel ði; jÞ positions in the two convolution maps were taken as the weight coefficients of the features of the two cues and then normalized by the SoftMax function. Inspired by Wang, Wang, and Lin (2019), we force the value of each weight coefficient to be normalized to the interval ½0; 1�, and the sum of each weight coefficient to be normalized to one, that is, α ij þ β ij ¼ 1 and α ij ; β ij 2 ½0; 1� were defined. Finally, the normalized weight coefficients α ij and β ij of the feature fusion for the two cues were obtained. This was represented by the following formula: With this method, α ij and β ij can thus be learned through standard backpropagation with the features are adaptively aggregated at each cue. AFFM is shown in the Figure 2.

Overview
MGSNs use MFSM and AFFM to feature map switching and adaptive fusion of the features for the two cues, so that the complementary advantages of the two cues can be fully utilized. The output features are identified using Gate-Shift Networks. Thus, the output of the GSM can be viewed from a spatiotemporal perspective, channel interdependence and motionsensing information. Figure 3 shows the MGSN architecture for InceptionV3.

The Architecture of Multi-Cue Gate-Shift Networks
We used TSN as the reference architecture for behavior recognition, which uses the C2D backbone to perform the time pool of frame-level features. We choose to use BN-Inception and InceptionV3 as the backbone options for TSN, but we made a few modifications to the feature extraction part in the front of the backbone. We changed the input to RGB and optical flow twostream input and insert for MFSM and AFFM. Subsequently, we inserted GSM into the backbone.

Algorithm Pseudocode
Algorithm 1 Multi-cue Feature Switching Require: The whole feature maps of a cue F, the weight parameter γ of the BN layer and the threshold θ of whether to switch or not 1: while Traverse F do (Continued) if γ > θ then 3: Perform affine transformation for the part of F that conforms to γ > θ based on Eq.2 4: else 5: Perform multi-cue feature switching for the part of F that conforms to else based on Eq.2 6: end if 7: end while 8: return F Algorithm 2 Adaptive Feature Fusion Require: The whole feature maps of two cues F 1 , F 2 1: After the two cues pass through convolution layer, batch normalization layer and activation function, weight 1 and weight 2 are obtained. 2: Concatenate weight 1 and weight 2 on channel dimension to obtain weight v 3: Reduce the number of channels weight v through 1 × 1 convolution to obtain weight 4: Apply softmax function on the second dimension based on Equation (4) and (5) to get α and β in the Equation (3) 5: Perform adaptive feature fusion to obtain F 0 based on Equation (3) 6: return F 0

Datasets
Our paper uses two datasets, namely, the Jhuang et al. (2010) and Jiang et al. (2018) datasets. The Jhuang dataset includes eight behavioral categories: drink (drink from the water supply), eat (take food from the feeding door), groom (the mouse combs its fur), hang (the mouse hangs on the top of the cage), head (slight movement of the limbs or head), rear (standing position, forelimb off the ground), rest (the mouse stays stable or sleeps), and walk (the mouse walks or runs in the cage). In addition to the above public data set, the Jiang dataset was also used, which included six behavior categories: dig (lift wood chips with forelimbs or head), eat (the rat gets food from the food box), groom (forelimbs sweep across the face or torso), rear (standing position, forelimbs off the ground), head (slight movement of limbs or head), and walk (movement). Sample video frames from the Jiang and Jhuang databases are shown in Figure 4. The number of frames in the two datasets is shown in Figure 5.

Implementation Details
In our experiments, BN-Inception and InceptionV3 were chose as the CNN backbones. MFSM and AFFM were added to BN-Inception and InceptionV3. All models used for the comparisons were initialized using ImageNet pretrained weights. We trained the entire network end-to-end using Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01 and momentum of 0.9. A cosine learning rate schedule was used to adjust the learning rate. The network was trained for 100 epochs using the Jiang and Jhuang databases. The first 10 epochs were used for gradual warmup. The batch size was 16 for these databases. The classification layer for both databases apply dropout at a rate of 0.5. We applied random scaling, cropping, and flipping to augment data during training. The dimensions of the input were 224 × 224 for BNInception and InceptionV3. We used the center crop during inference.

Descriptions of Existing Methods Used for Comparison
We compare our proposed MGSN with other methods using motion information or temporal modeling methods. Results are shown in Tables 1 and 2. These methods all use ResNet as the backbone and 16 frames as input. A TDN ) was proposed to extract multi-scale temporal information. ACTION-Net ) proposed a plug-and-play ACTION module  that can extract appropriate spatio-temporal patterns, channel-wise features, and motion information to recognize actions. A Temporal Adaptive Module (TAM) (Liu et al. 2021) proposes an adaptive temporal modeling method, while the Temporal Excitation and Aggregation (TEA) ) block proposes to use both short-and long-range information. The above methods use motion information and are the most advanced methods available, which are of great significance. Table 1 shows the performance comparison between MGSNs and the most advanced methods from the Jiang database. Eight frames were used as input in the experiment. We used the various behavior recognition methods shown in Table 1 to conduct the experiments on the Jiang database and compared them with the methods used in the current study. Table 1 lists the comparison between the most advanced methods and our methods and the accuracy of using different backbones. As can be seen in the confusion matrix in Table 1. Comparison to state-of-the-art accuracy in the Jiang database (Red test denotes the best accuracy, blue is the second best accuracy, green is the third best accuracy).

Method
Backbone FLOPs(G) Accuracy TDN (CVPR2021)  ResNet-50 36.00 98.90% PAN (TIP2020)  ResNet-50 35.70 95.62% ACTION-Net (CVPR2021)  ResNet-50 34.75 98.91% TAM (ICCV2021) (Liu et al. 2021) ResNet-50 82.00 97.18% TDN (CVPR2021)  ResNet-101 66.00 98.03% TAM (ICCV2021) (Liu et al. 2021) ResNet-101 82.00 97.81% TEA (CVPR2020)  Res2Net-50 35.00 98.12% GSM (CVPR2020) (Sudhakaran, Escalera, and Lanz 2020) InceptionV3 26.82 98.44% GSM (CVPR2020) (Sudhakaran, Escalera, and Lanz 2020) BN-Inception 16.56 98.28% SFV-SAN pipeline (Jiang et al. 2018) N/A N/A 96.50% JHuang (Jhuang et al. 2010 Figure 6(b), BN-Inception is more accurate in all categories except for the main category. As shown in Table 1, MGSNs have a maximum absolute gain of 5.41% (83.11% vs. 77.70%) on the baseline GSM. In the same case, the backbone networks have different degrees of gain. The top three recognition accuracies in the table are MGSNs that use different modules. In addition, a state-of-the-art recognition accuracy of 83.11% was achieved by using InceptionV3, which is larger than BN-Inception. The TDN, PAN, ACTION-Net, and TAM methods use resnet50 as the backbone. The TDN has achieved good recognition accuracy. It was used by ResNet-101 to reach the previous highest accuracy of 81.31%, which is higher than the accuracy of GSM based on BN-Inception (80.79%). However, the accuracy of our MGSN based on BN-Inception exceeded that of all previous methods, with an accuracy of 82.70%. The recognition accuracy of our MGSN based on InceptionV3 was further improved, exceeding that of all the current methods by 83.11%. Our method is much lower than the aforementioned methods in terms of the amount of calculation, and can also obtain good results in the case of ground calculation. Table 2 shows the performance comparison between MGSNs and the most advanced methods used in the Jhuang database. We trained the network using eight frames and sampled two clips. We use the various behavior recognition methods shown in Table 2 to conduct the experiments on the Jhuang database and to compare them with our methods. Among the methods in Table 2, our method attained a high degree of accuracy (the one marked in green is the third accuracy). However, because all of the methods showed very high accuracy in the Jhuang database, our method has no obvious advantage, but it also exceeds many of the most advanced methods available. Our method is only lower than that of TDN and ACTIONNet. The reason our method is lower is that the FLOPs of our method are much lower than those of TDN and ACTION-Net. Our method has a significant advantage in terms of computation and can achieve good results with low computation. The confusion matrices for the two backbones are shown in Figure 6(c,d).

Ablation Studies
In this section, we summarize the ablation analysis performed on the Jiang database. Exploration studies were performed on the Jiang database to investigate whether MFSM and AFFM showed performance improvement from the baseline GSM. The specific implementation details are described in Section 5.2.

Study on the Impact of Different Backbones
Different backbones were used to explore its impact. For overall accuracy, InceptionV3 performed best when the two modules were inserted together, resulting in a 0.41% higher accuracy than BN-Inception and 5.41% absolute gain over the baseline GSM. In Table 4, the accuracy, parameters, and FLOPs of the two backbones are presented. Inceptionv3 has better accuracy and is accompanied by a larger model.

Exploring Whether to Insert MFSM and AFFM
We then compared performance improvement by inserting the MFSM into InceptionV3 and BN-Inception. Table 4 shows the ablation results. Baseline was the standard GSM architecture, with accuracies of 77.70% and 80.79%. We then inserted MFSM. This improved the recognition performance by 2.37% and 0.18% for InceptionV3 and BNInception, respectively. Inserting AFFM also improved the recognition performance by 2.03% and 0.97%, respectively. The final model, in which MFSM and AFFM are inserted into InceptionV3 and BN-Inception, resulted in recognition accuracies of 83.11% and 82.70%, that is, a + 5.41% and +1.91% absolute improvement over the GSM baseline. Only 0.04% and 5.2% overhead in the parameters and complexity of InceptionV3, respectively. Similar to InceptionV3, only 0.1% and 7.0% overhead in parameters and complexity on BN-Inception, respectively. Table 5 reports the comparison of our AFFM with three methods using the same backbone: addition, concatenation, and self-attention. For a more fair comparison, all experiments were conducted under the same experimental conditions, and the three methods were compared at the same location. The accuracy of our method outperformed the other fusion methods. While selfattention attains the closest performance to our method (82.56% vs. 83.11%), our method has fewer fusion parameters and calculations. The above conclusions can be drawn from the results in Table 5. Table 3 shows the various classes and overall recognition accuracy of the Jiang database. As can be seen in the BN-Inception column in Table 3, dig, eat, and walk have the best precision with the insertion of MFSM and AFFM simultaneously, while the head has the best precision when inserting only AFFM. This is because the MFSM is more sensitive to motion information, while the behavior of the head is relatively static, so AFFM alone is more accurate. As can be seen in the InceptionV3 column in Table 3, inserting our modules in addition to the groom and walk classes resulted in better precision. Our module improves the recognition of most categories of behavior. For overall accuracy, InceptionV3 performed best when the two modules were inserted together, with a 0.41% higher accuracy than BN-Inception and 5.41% absolute gain over the baseline GSM. The benchmark GSM was more accurate for the groom and walk classes, but with our module, it was less accurate for these two classes. We tallied the predictions  and found that both were most likely to misidentify the head. By analyzing our modules and prediction results, we conclude that the motion information in the optical flow has a significant influence on the MFSM. Because the motion information of groom, walk, and head are very similar, the recognition of these two classes is not as good as that of the baseline GSM. Using the Jiang database, the recognition performance of most types was improved, and the recognition performance of joint addition was better than that of single-module addition, which again proves the inference of a synergistic effect.

Conclusion
In this study, we proposed a MGSN for mouse behavior recognition. The core contribution of the MGSN was to include MFSM and AFFM to make full use of the complementary advantages of the two cues with little overhead. We performed an extensive evaluation to study the MGSN's effectiveness in mouse behavior recognition, achieving state-of-the-art accuracy results using the Jiang database, and obtaining competitive results using the Jhuang database. When MFSM and AFFM were inserted into the GSM baseline for InceptionV3, an absolute gain of +5.4% in recognition accuracy was obtained using the Jiang database with only 0.1% and 7.0% overhead in parameters and FLOPs, respectively.

Disclosure statement
No potential conflict of interest was reported by the author(s).