A spatio-temporal attention fusion model for students behaviour recognition

Student behavior analysis can reflect students' learning situation in real time, which provides an important basis for optimizing classroom teaching strategies and improving teaching methods. It is an important task for smart classroom to explore how to use big data to detect and recognize students behavior. Traditional recognition methods have some defects, such as low efficiency, edge blur, time-consuming, etc. In this paper, we propose a new students behaviour recognition method based on spatio-temporal attention fusion model. It makes full use of key spatio-temporal information of video, the problem of spatio-temporal information redundancy is solved. Firstly, the channel attention mechanism is introduced into the spatio-temporal network, and the channel information is calibrated by modeling the dependency relationship between feature channels. It can improve the expression ability of features. Secondly, a time attention model based on convolutional neural network (CNN) is proposed, which uses fewer parameters to learn the attention score of each frame, focusing on the frames with obvious behaviour amplitude. Meanwhile, a multi-spatial attention model is presented to calculate the attention score of each position in each frame from different angles, extract several saliency areas of behaviour, and fuse the spatio-temporal features to further enhance the feature representation of video. Finally, the fused features are input into the classification network, and the behaviour recognition results are obtained by combining the two output streams according to different weights. Experiment results on HMDB51, UCF101 datasets and eight typical classroom behaviors of students show that the proposed method can effectively recognize the behaviours in videos. The accuracy of HMDB51 is higher than 90%, that of UCF101 and real data are higher than 90%.


Introduction
Artificial intelligence technology and big data technology have promoted the transformation of modern education system [1,2].Adaptive personalized learning driven by artificial intelligence technology is the most potential application scenario in the field of education.As the main place of classroom teaching in colleges and universities, multimedia classroom has been gradually upgraded to smart classroom.Classroom is also the main battlefield of "golden course" construction.Teachers play a decisive role in the construction of "golden course".How to do fusion innovation, how to effectively improve the quality of "golden course" construction, and how to effectively analyze and evaluate classroom dynamic generative teaching data have been widely concerned by education experts and front-line teachers.At present, the research focus is on the theoretical analysis, technical application and value discussion of the dynamically generated content.There are few researches on the teaching and learning data recording, data analysis and teaching application of the dynamically generated content.However, the key points and difficulty of these researches lie in the automatic detection and recognition of students' classroom behavior.
Behaviour recognition [3] has been widely used in many fields, such as video surveillance, smart home, video retrieval, intelligent human-computer interaction, etc. Video has the characteristics of complex environment, large transformation range of visual angle and human behaviour, which makes the feature representation of video have a lot of redundant information in spatio-temporal.Therefore, it is very important for behaviour recognition to effectively utilize the information of key areas on the frames with obvious behaviour amplitude in the video.
Behaviour recognition methods in the video can be divided into traditional methods [4,5] and deep learningbased methods [6,7].Traditional methods have made some progress in the field of behaviour recognition, but they rely heavily on artificial feature design, and the generalization ability of the algorithm is insufficient.Deep learning-based methods can automatically learn the features of videos for classification, especially, the dualstream method [8] can effectively combine the spatiotemporal information in videos and has relatively better performance.Dai et al. [9] proposed the dual-stream model for the first time, which input single-frame image and multi-frame density optical flow field image into spatial flow and temporal flow respectively.Then it fused and classified the features of the two streams.Wang et al. [10] proposed temporal piecewise network, using sparse sampling and video supervision strategies to further improve the recognition accuracy.However, the dualstream method can not effectively utilize the key spatiotemporal information of video, and it ignores the information difference of different channels when extracting video features.In order to obtain the information of saliency regions in the video, references [11,12] used object detection or posture estimation to extract multiple key regions or body parts in the video, and then input them into the network for behaviour recognition.However, object detection or posture estimation in advance will increase the overall calculation cost.Moreover, the results of detection and estimation can affect the performance of recognition.
The behaviour recognition method based on attention mechanism [13] can automatically learn the key information in the video.Hu et al. [14] designed a channel attention network to model features from channels to highlight key channel information.Sharma et al. [15] proposed the spatial attention model to highlight the saliency areas in each frame.Du et al. [16] used the temporal attention model designed by recurrent neural network (RNN) to assign corresponding weights to different frames, which could effectively utilize the key frames of the video.Yang et al. [17] used bidirectional LSTM to design a spatio-temporal attention model.The above methods have the following deficiencies: a) The time attention model designed by RNN or LSTM has many parameters.RNN has a fixed serial structure, so video frames must be processed in accordance with the sequence of time, and the recognition efficiency is low.
b) When extracting spatial saliency information, it will lead to the problem of inaccurate information of the extracted regions using only one spatial attention model to extract multiple behaviour regions of a frame.
To solve the above problems, this paper proposes a new students behaviour recognition method based on spatiotemporal attention fusion model.The main contributions of this paper are as follows.
1) The channel attention is integrated into the spatiotemporal network, and the channel information of the features is recalibrated while considering the spatiotemporal features, which enhances the expression ability of the features.
2) Attention model based on CNN is proposed to focus on the frame with a strong understanding on the temporal domain.Compared with the temporal attention of RNN model, this model calculates the attention score of each frame in the temporal dimension of the video.The model has fewer parameters and the calculation cost is small.It can realize the parallel operation of multiple frames and improve the overall operation efficiency.
3) A multi-spatial attention model is proposed to learn the weight of each frame from different angles by using multiple models to obtain multiple discriminant behaviour regions, which reduces the interference of background information.
4) The temporal and spatial features are fused to further enhance the feature representation of the video.Experiment results on UCF101, HMDB51 datasets and eight typical classroom behaviors of students show that the proposed model is an end-to-end and efficient behaviour recognition model.

Spatio-temporal attention mechanism for behaviour recognition
The video can be regarded as a combination of spatial and temporal.In spatial, RGB images contain the appearance information about the scenes and objects.In temporal, the optical flow image includes the behaviour information of the object.In this paper, the appearance flow with RGB image and the behaviour flow with optical flow image are used as the design basis.A new behaviour recognition model is proposed to enhance the feature representation, distinguish the features of different channels, and focus on the multiple saliency areas of behaviour in the frames with strong discriminant power, so as to realize the behaviour recognition.The overall structure of the proposed recognition model is shown in figure 1.In order to obtain appropriate input fragments, the new model performs sparse sampling on the video.The implementation method is as follows: dividing the video into N segments at equal intervals, sampling one frame randomly for each segment, and inputting the RGB image and optical flow image into the spatio-temporal network.

SE-BN-Inception module
Multi-channel feature vectors are generated when features of video frames are extracted using convolutional networks.Each channel of the vector describes the current frame in a specific way, and different channels represent information of varying importance.However, the previous deep learning-based feature extraction methods ignored the differences of different channels, resulting in poor feature representation capability.The channel attention mechanism can learn the importance of each feature channel, increase the channel features that are useful for current recognition according to the importance, and suppress the channel features with weak recognition power.This paper introduces the channel attention implementation network SE-net (Squeeze-andexcitation network) to the BN-inception [18].The SE-BN-Inception module is obtained to calibrate the information of different channels and enhance the expression ability of video features.SE-net is shown in figure 2

Spatio-temporal attention module
The spatio-temporal attention module is composed of CNN-based temporal attention model [19], multi-spatial attention model and the fusion of spatio-temporal features.The temporal attention model and the multispatial attention model focus on key frames and multiple saliency behaviour regions from the temporal and spatial dimensions of the video, respectively.The fusion of spatio-temporal features can effectively combine the extracted key spatio-temporal information, further enhance the feature representation of video, and improve the accuracy of behaviour recognition.

CNN-based temporal attention model
Behaviour is a process of constant change.Different frames in a video have different contributions to behaviour recognition, so the frames with rich information and obvious behaviour changes should be selected for classification.The temporal attention model can give more attention to the key frame.However, the previous temporal attention model is designed and implemented based on RNN, which has many network parameters, complex structures and it cannot be represents the selected frame number of the video.C represents the feature dimension degree.

H W ×
represents the number of grid cells of the feature map.For the feature vector i x of i-th frame of the video, it is first linearly mapped through the full connection layer, and the mapped feature is i x ˆ.The linear mapping of the same video frame uses the same parameters as shown in equation (1).
Where conv represents the convolution operation.
it considers the importance of each selected frame in the video.

Multi-spatial attention model
Video consists of sequential images, and each frame can be divided into regions with saliency behaviour and other regions in spatial.For behaviour recognition videos, the saliency behaviour areas are usually the moving parts of the human body and the position of the moving objects, such as the behaviour of drinking water.The behaviour can be accurately recognized by using the features of the arm, head area and the cup.Therefore, the focus should be placed on areas with significant behaviour in each frame.Generally, object detection [20], posture estimation [21] and other methods are used to extract the information of key regions for behaviour recognition, which results in large workload and complex implementation.
Spatial attention mechanism can solve the above problems.However, in references [22,23], only one spatial attention model is used to extract information of different saliency regions.Some of the extracted saliency regions are inaccurate.In order to accurately extract the spatial information of different regions of the frame that interact with the behaviour, this paper proposes a multispatial attention model, the specific structure is shown in figure 4.Where w 2 , w 3 , b 2 , b 3 are the learning parameters in the network.The size of the convolution kernel of the second convolution layer is 5×5 and the convolution step is 1.
, l denotes the model number of the spatial attention.In formula (5),

Spatio-temporal feature fusion
Spatio-temporal feature fusion is used to judge the categories of human behaviours by combining the temporal and spatial features extracted from video.The fusion of spatio-temporal features can represent the change information of key frame's saliency area of behaviour, which further enhances the expression ability of features and carries out more accurate recognition of behaviour.For example, when playing golf, frames with obvious swing behaviour will get more attention through the temporal attention model.Through spatial attention model, the arm, golf club, ball and other key areas are extracted.The spatial and temporal features can be fused to focus on several saliency motion areas on the frame with obvious swing action, so as to better recognize the behaviour.The fusion of features is shown in figure 5. l spatial features j s f and one temporal feature are obtained for each video.First, it maps each spatial feature to a temporal feature.That is, l features l F are obtained by adding the spatial feature j s f of the video and the temporal feature t f of the video respectively.Then it connects these l features to get the spatio-temporal feature F of the video: Where concate denotes the connect operation.

Experimental data sets and evaluation criteria
The data sets used in this paper are two publicly available video data sets UCF101 and HMDB51 [24].Then we also select real classroom behaviours.The UCF101 data set contains 101 behaviours and 13320 videos.The data set has a strong diversity in behaviour acquisition, including camera motion, object appearance motion, attitude change and background change.The movement category is divided into five groups: human-object interaction, body movement, person-to-person interaction, playing musical instruments and sports.The data set has problems such as large intra-class differences and small inter-class differences.HMDB51 data set contains 6676 videos and 51 types of actions.The video samples are mainly from public data such as movies, Youlube and Google video, but many videos are with poor quality.Therefore, it is challenging to perform behaviour recognition on the two data sets.For the two data sets, this paper adopts the official division method, that is, each data set is divided into three splits, 70% of the videos are training sets and 30% are testing sets.
In this paper, 60 students majoring in software engineering in 2020 from one university are selected as the research objects.The involved two courses are "Fundamentals of Programming" and "Data Structure".Two complete lectures are recorded for each course.The analysis algorithm in this paper is based on the video data as the data input object, and the camera adopts the television broadcast system (PAL), which is 25f/s (frames per second).There are four classroom teaching videos, each of which lasts 50 minutes.One classroom teaching video of each course is divided into two training sets, and According to the various manifestations of students classroom behavior, we focus on the basic behavior categories that can reflect students' basic states and constitute complex learning activities.In this study, eight classroom behaviors are recognized and analyzed including concentration, interaction, bowing their heads, playing with mobile phones, sleeping, reading, writing and mind wandering.The performance of the proposed algorithm is evaluated.Therefore, it is necessary to annotate the training sets and testing sets, and manually complete the coding of four videos.
In this paper, top-1 recognition accuracy is adopted as the evaluation standard.The recognition accuracy of each data set is obtained by weighted average of the action recognition accuracy of its three splits.

Experimental Analysis
In this paper, the performance of behaviour recognition under different segments of video, different spatial attention models and different fusion weights are compared.Then the performance of behaviour recognition with channel attention network is analyzed experimentally.Finally, the effectiveness of the proposed method is analyzed by comparing the proposed method with the state-of-the-art methods.

Performance analysis of behaviour recognition in different video segments
In this paper, the sparse sampling method is used to sample the frames in the video and take them as the input data of the network.To analyze the influence of different video segment number on behaviour recognition performance, this paper carries out a comparative experiment on the first split of HMDB51 data set.3, 4, 5 and 6 segments are sparsely sampled from the video for behaviour recognition, and the experimental results obtained on the appearance flow are shown in figure 6.The experimental results show that the recognition accuracy increases with the increase of the number of video segments.When the number of video segments is 6, the network has the highest recognition accuracy, because the network can learn more information from an increasing number of samples.As can be seen from figure 6, when the number of video segments is greater than 5, the rising trend of recognition accuracy gradually slows down with the increase of the number of segments.Moreover, due to the limited computer video memory, more segments cannot be tested.In this paper, each video is divided into six segments for subsequent experiments.

Performance analysis of behaviour recognition under different spatial attention models
The multi-spatial attention model proposed in this paper can extract multiple saliency behaviour regions for behaviour recognition.With the increase of the spatial attention model number, the extracted saliency areas of behaviour also increased gradually.In order to analyze the impact of spatial attention model number on behaviour recognition performance, a comparative experiment is carried out on the first split of HMDB51, and the results are shown in figure 7.As can be seen from figure 7, when the number of spatial attention models is less than 4, the recognition accuracy gradually improves with the increase of the spatial attention model number.When the number of spatial attention models is 4, the performance of behaviour recognition is the best.When the number of spatial attention models is 5, the recognition rate decreases.Due to the limited computer video memory, the experiment cannot run when the number of spatial attention models is greater than 5. Therefore, this paper adopts four spatial attention models to carry out subsequent experiments.

Performance analysis of behaviour recognition with different fusion weights
The influence of different fusion weights of appearance flow and motion flow on behaviour recognition performance is analyzed through experiments, and the results are shown in table 1.As can be seen from table 1, the recognition accuracy of single motion flow is higher than that of appearance flow.Fused flow is better than single flow.When the appearance flow and motion flow are combined with 1/4 and 3/4 weight, the behaviour recognition results are the best.Therefore, in this paper, the fusion weight of appearance flow and motion flow is selected as 1:3 for subsequent experiments.

Comparison analysis with state-of-the-art behaviour recognition methods
In order to further verify the proposed method in this paper, we conduct comparison with some classical behaviour recognition methods, and the results are shown in table 4. As can be seen from table 4, compared with the traditional method IDT [25], the proposed method has a higher recognition accuracy, indicating that the proposed spatio-temporal attention model can effectively extract the key spatio-temporal information in the video and improve the effect of behaviour recognition.The end-to-end structure of the proposed method makes the calculation more concise.Compared with the dual-stream model [7] and the temporal segmentation network (TSN) [10], the proposed method improves the recognition accuracy by 3.2% and 0.8% on UCF101 data set, and 6.5% and 2.5% on HMDB51 data set, respectively.It shows that the spatio-temporal attention model can effectively extract more behaviour features on key frames, and the behaviours in the video can be more accurately recognized by these information.Compared with the TDD [26], the deeply trained C3D network [27], the spatio-temporal residual model ST-ResNet [28], the spatio-temporal pyramid model [29], ARTNet [30] and TSM [31], it can be seen that the proposed method has a better recognition effect.The proposed method takes the dual-flow features, and the recalibration of channel features into account which highlights the key channel information.The proposed spatio-temporal attention model fully mines the key spatio-temporal information of video, it obtains the video features with enhanced expression ability, and establishes the comprehensive behaviour description.

Comparison analysis of behaviour recognition methods using attention mechanism
In order to verify the validity of the spatio-temporal attention model proposed in this paper, the proposed algorithm without SE-net is compared with other behaviour recognition methods with attention mechanism.The results are shown in table 5.It can be seen from table 5 that the proposed method in this paper has a higher accuracy.Compared with the temporal attention model [32] generated by the RNN method, the accuracy of proposed algorithm without SE-net on the HMDB51 dataset has been improved by 6.3%.This is because temporal attention only extracts the key frames, while the proposed method not only extracts the key frames, but also pays attention to the saliency areas of motion in the spatial dimension, indicating that the combination of temporal and spatial information can effectively improve the recognition accuracy.The recognition effect of the proposed algorithm without SE-net is better than that of RSTAN [16] and ISTPAN [33], which indicates that with the same backbone, the spatial and temporal attention model proposed in this paper is simple in structure, but it can effectively extract the key spatial and temporal information of the video.Compared with attention cluster [34], Bi-LSTM attention [17] and R-STAN [35], the proposed algorithm without SE-net has better performance.References [34,17,35] all use ResNet as backbone for behaviour recognition, and ResNet network performance is better than BN-Inception.However, this paper uses BN-Inception as backbone and still gets good recognition effect.This shows that the spatio-temporal attention model proposed in this paper can effectively make up for the deficiency of BN-Inception, it can accurately extract the key spatio-temporal information in the video, and improve the accuracy of behaviour recognition.After adding the SE-net, the recognition accuracy of the proposed method in the three data sets is further improved, indicating that the proposed method can improve the performance of behaviour recognition by calibrating the information of feature channels combined with channel attention network.

Conclusion
Traditional behaviour recognition methods ignore the difference of channel information, cannot distinguish video redundant frames, background, etc, which results in the poor feature expression ability and the low recognition rate.In order to improve the efficiency of students in class, this paper proposes a new students behaviour recognition method based on spatio-temporal attention fusion model.In this paper, channel attention is first integrated into the spatio-temporal structure, and channel information is calibrated through the modeling of channel features to improve the ability of feature expression in videos.The temporal attention model and multi-spatial attention model based on CNN are presented to focus on multiple saliency areas of behaviour on the frames to further enhance the feature representation of the video.In this paper, comparison experiments are carried out on UCF101, HMDB51 data sets and real classroom behaviours.Compared with the advanced methods, the proposed method has achieved a higher recognition accuracy.In the future, we will apply more advanced deep learning methods for students behaviour recognition.

Figure 1 .F
Figure 1.Structure of the proposed spatio-temporal network (a).Firstly, the input features are pooled globally along the channel dimension.The dependencies between the channels are then modeled through the two fully connection layers.The first fully connection layer reduces the input channel dimension by 1/16 to reduce computation.And then it increases the nonlinearity by activating the ReLU function.The second fully connection layer returns the channel to its original dimension.The normalized weights are obtained by a sigmoid function.Finally, the weight is weighted to the features of each channel through feature redirection operation.As shown in figure2(b), SE-BN-Inception consists of nine Inception operations.The SE-net is added after each inception.Because the output of the fully connection layer is not sensitive enough to space and position, the output of the convolution layer preserves the spatial structure of the image to a certain extent, so BN-Inception is retained to the last convolution layer.

Figure 2 .
Figure 2. Structure of SE-net and SE-BN-inception parallelized over time.In order to solve this problem, this paper proposes a temporal attention model based on CNN, which uses CNN to generate the attention score of each frame.The attention score is used to determine the importance of each frame in the video relative to the behaviour recognition.It selectively focuses on the key frames.The video feature representation is further enhanced in time dimension.The temporal attention model designed in this paper not only has fewer parameters and a simple structure, but also can calculate the attention score of all frames in parallel, it makes full use of the advantages of GPU hardware.The CNN-based temporal attention model is shown in figure 3.

Figure 3 .
Figure 3. Temporal attention model based on CNN

Where 1 w
and 1 b are the learning parameters in the model.The map feature of the whole video is EAI Endorsed Transactions on Scalable Information .The video feature dimension is changed to 1×N through a convolution layer with size of 1×1.It uses the softmax function along the time dimension of the video frame to get the time attention score t i α of each frame in the video: contribution of i-th frame to the recognition.After the attention score t i α of i-th frame is obtained, the time feature of i-th frame is obtained by multiplying it with features.The time features of all frames is summed to get the temporal feature t f of the whole video.

i v and i u are the
input and output signals.α and β are trainable parameters, and m and var represent mean and variance.feature.Since l spatial attention is used, l spatial features can be extracted per frame.The j-th spatial feature in selected frame of each video is summed to obtain the j-th spatial feature j s f of the whole video.
This experiment is performed on the GPU with PyTorch. the used backbone in this article isBN-Inception.BNinception model is an upgraded version of GoogleNet model, which has a good balance between accuracy and efficiency.The network is initialized using model parameters pre-trained on the ImageNet dataset.In order to keep the optical flow data consistent with RGB data, this paper first adopts TV-L1 algorithm to obtain optical flow data, and then quantifies optical flow data to [0,255] through linear transformation.a) Training stage.Firstly, the size of the input frame is adjusted to 240×320, and then the size of the clipping area is adjusted to 224×224 by using fixed corner clipping and horizontal flip.It adds the Dropout layer before the full connection layer of the classification network.The dropout values are set to 0.8 and 0.7 for appearance and behaviour flow, respectively.The parameters are optimized by small batch random gradient descent algorithm, and the batch size is 32.The weight attenuation coefficient is set to 0.0005.The momentum is set to 0.9.The appearance flow starts with a learning rate of 0.001.After 30 epochs and 60 epochs, it is reduced to 1/10 of the original epoch, and a total of 80 epochs are trained.The initial learning rate of the behaviour flow is 0.001, which is reduced to 1/10 after 190 epochs and 300 epochs, respectively.340 epochs are trained.b) Test stage.25 frames are selected from each sample using mean sampling.For each frame image, data is enhanced by cropping and flipping, and 10 test samples are obtained.Classification results are obtained by averaging the output category probability of 10 samples.

Figure 7 .
Figure 7.Comparison of recognition accuracy with number of different video segment spatial attention models

Table 2 .
Comparison of recognition accuracy between TSN and TSN+SE-net on UCF101 and HMDB51

Table 3 .
Comparison of recognition accuracy between TSN and TSN+SE-net on real classroom behaviour

Table 4 .
Comparison of average recognition accuracy with other methods/%

Table 5 .
Comparison of average recognition accuracy with attention-based methods/%