MSTA-SlowFast: A Student Behavior Detector for Classroom Environments

Detecting students’ classroom behaviors from instructional videos is important for instructional assessment, analyzing students’ learning status, and improving teaching quality. To achieve effective detection of student classroom behavior based on videos, this paper proposes a classroom behavior detection model based on the improved SlowFast. First, a Multi-scale Spatial-Temporal Attention (MSTA) module is added to SlowFast to improve the ability of the model to extract multi-scale spatial and temporal information in the feature maps. Second, Efficient Temporal Attention (ETA) is introduced to make the model more focused on the salient features of the behavior in the temporal domain. Finally, a spatio-temporal-oriented student classroom behavior dataset is constructed. The experimental results show that, compared with SlowFast, our proposed MSTA-SlowFast has a better detection performance with mean average precision (mAP) improvement of 5.63% on the self-made classroom behavior detection dataset.


Introduction
Intelligent education has become one of the inevitable trends in the future development of education [1]. The classroom is an important part of building intelligent schools. When evaluating the quality of classroom teaching, students' classroom behavior can be used as important reference content. Students' classroom behaviors can reflect students' learning state well [2]. At the same time, the behaviors in the recorded teaching videos can be analyzed accordingly after class, which can help teachers to adjust teaching methods and progress in time to achieve better teaching results.
In traditional classrooms, teachers need to observe students' classroom behavior manually. However, this approach cannot attend to all students at the same time, makes it difficult to form timely and effective feedback, and brings a certain burden on teachers' teaching work. With the increasing sophistication of artificial intelligence, the detection of student behavior in the classroom through deep learning and computer-vision-enabled techniques is gaining attention [3]. The use of computer-assisted instruction and the automated detecting and analyzing of student behavior in the classroom has also become a research hotspot in smart education [4][5][6].
Classroom behavior detection is generally divided into approaches based on object detection [7], pose recognition [8], and video behavior recognition or detection [9]. With growing advances in video behavior detection technology, classroom behavior detection based on instructional videos has become possible. In the field of video behavior identification, deep learning's ongoing development has produced some excellent outcomes. Among them, SlowFast [10] achieves good detection results in Kinetics [11] and Charades [12] behavior recognition datasets, and AVA (Atomic Visual Actions) [13] spatio-temporal behavior detection dataset. SlowFast also has great application scenarios in real-world

•
Classroom instructional videos were collected to mark common student behaviors in the classroom, and a classroom behavior dataset was constructed as a basis for detecting student behaviors. • A student behavior detection model based on an improved SlowFast network was proposed. The model's ability to acquire spatial, channel, and temporal features was improved, and the detection accuracy was increased, with the introduction of Multi-Scale Spatial-Temporal Attention (MSTA) and Efficient Temporal Attention (ETA) modules. • Finally, to verify the effectiveness of the revised approach, experiments were carried out. The findings showed a significant improvement in the improved model's mean Average Precision (mAP), which could be utilized to detect classroom conduct.

Video Behavior Detection
Mainstream behavior detection algorithms can be generally classified into behavior recognition, temporal behavior detection, and spatio-temporal behavior detection. Among them, behavior recognition mainly identifies the category of behavior. Temporal behavior detection identifies the time period in which the behavior in the video occurs and determines the category of the behavior in the video. Spatio-temporal behavior detection focuses on identifying the coordinate position of the person in the video and identifying the duration of the person's behavior with the category of the behavior. In this paper, classroom behavior detection focuses on the location and category of classroom behavior occurrence, so spatio-temporal behavior detection is used for this purpose.
For the problem of spatio-temporal feature extraction in the field of video behavior understanding, researchers had already proposed many effective backbone network structures. For example, 3D convolutional neural networks (C3D) [17] use three-dimensional convolution to extract the spatio-temporal features of actions, which can identify actions more accurately. Karen et al. [18] proposed a dual-stream network, where one pathway extracts spatial features through RGB images while the other pathway extracts temporal features through optical flow images.
With the proposed AVA [13] for the Atomic Vision Action Video dataset, the focus of the spatio-temporal behavior detection task has gradually shifted toward behav- ioral interactions, and many behavior detection algorithms for this dataset have emerged. Christoph et al. [10] proposed the SlowFast network, based on 3D convolution, to obtain behavioral features. The network has performed well in both behavior recognition and behavior detection tasks. It consists of two pathways with distinct temporal rates that are responsible for the acquisition of spatial and temporal information, respectively. In place of the double-branch approach, Christoph [19] presented an extended 3D convolutional network (X3D), which gradually modifies the model's width parameter to require less computational work while producing superior results. Li et al. [20] analyzed the effect of time dependence on behavior detection by placing the behavior detection in a Long-Short Term Context (LSTC).
Meanwhile, a number of researchers have suggested new spatiotemporal detection methods. Okan et al. [21] proposed a new spatio-temporal behavior detection framework, named YOWO (You Only Watch Once), which is well suited for real-time spatio-temporal behavior detection in videos because it integrates temporal and spatial information into the framework and uses only one network to directly extract both. Fan et al. [22] proposed an MViT (Multiscale Vision Transformers) model for video and image recognition by combining multi-scale feature pyramid structures to achieve the extraction of video features at different levels, and encoding the features using Transformer to enable the model to better understand the visual content. Bertasius et al. [23] proposed a new detection network, TimeSformer, implemented by a convolution-free approach, which employs a self-attentive module instead of convolution.

Behavior Detection in Classroom Scenarios
Classroom scenarios with severe occlusion and numerous student targets pose a great challenge for classroom behavior detection. Recently, computer vision, target detection, and image classification techniques have also been applied to classroom behavior detection tasks.
By employing object detection to identify classroom behavior, the behavior that needs to be identified is treated directly as a target object, and the network is then utilized to extract spatial features to identify the behavior. Liu et al. [24] used the YOLOv3 algorithm for student anomalous behavior recognition with the addition of RFB and SE-Res2net modules to improve the model for small target and crowd occlusion problems in the classroom environment. Tang et al. [25] performed classroom behavior detection based on pictures, adding a feature pyramid structure and an attention mechanism to the YOLOv5 classroom behavior detection model to address the problem of high occlusion in the classroom environment.
Pose recognition is usually used to identify human behavior by using localized human key point detection. Lin et al. [26] used the OpenPose framework to collect skeletal information from students and classify the extracted skeletal information into behaviors by means of a neural network. Yu et al. [27] collected classroom data using the Microsoft Kinet device for face recognition and then collected human skeleton information to extract features for behavior classification.
Recently, some researchers have implemented classroom behavior detection through video behavior detection techniques; Huang et al. [28] proposed a deep spatio-temporal residual convolutional neural network, and combined target detection and target tracking algorithms to detect the classroom behaviors of multiple students in teaching videos in realtime, and achieved good operational results. To realize real-time recognition of classroom behaviors for multi-student objectives, Xiao et al. [29] used the YOLOX algorithm to extract the student behavior at a moment in the instructional video and used CNN (Convolutional Neural Network) to learn the spatio-temporal information.
The object-based detection approach ignores the temporal characteristics of the behavior and cannot combine contextual semantic information. The human keypoint-based behavior detection is more computationally intensive and has stricter scene requirements, resulting in its poor stability in different scenes. Video-based behavior detection can cap- ture the action information of behavior more comprehensively and achieve more accurate detection of behavior, but the computational effort also increases. Meanwhile, the above study found that classroom behavior detection has certain shortcomings [25]. First, there are relatively few publicly available classroom scenario datasets. Second, some of the algorithms are only capable of detecting a single behavior detection target at the same time, so they cannot be used in classroom scenarios where the task of behavior recognition is performed on multiple students at the same time.

Methods
The SlowFast algorithm has made certain research progression in behavior detection, but its detection accuracy is still lacking in the classroom environment, and the accuracy rate is not high for actions with a small sample size and is more difficult to identify. Therefore, on the basis of a SlowFast network, firstly, an MSTA (Multi-scale Spatial-Temporal Attention) module is introduced into the Slow path to effectively extract multi-scale spatial information, establish remote channel dependence, and add temporal attention. Secondly, the ETA (Efficient Temporal Attention) module for temporal dimension is introduced into the Fast pathway to effectively calculate temporal attention and strengthen the ability to perceive temporal features of actions. Figure 1 shows the structure of the modified MSTA-SlowFast model. extract the student behavior at a moment in the instructional video and used CNN (Convolutional Neural Network) to learn the spatio-temporal information.
The object-based detection approach ignores the temporal characteristics of the behavior and cannot combine contextual semantic information. The human keypoint-based behavior detection is more computationally intensive and has stricter scene requirements, resulting in its poor stability in different scenes. Video-based behavior detection can capture the action information of behavior more comprehensively and achieve more accurate detection of behavior, but the computational effort also increases. Meanwhile, the above study found that classroom behavior detection has certain shortcomings [25]. First, there are relatively few publicly available classroom scenario datasets. Second, some of the algorithms are only capable of detecting a single behavior detection target at the same time, so they cannot be used in classroom scenarios where the task of behavior recognition is performed on multiple students at the same time.

Methods
The SlowFast algorithm has made certain research progression in behavior detection, but its detection accuracy is still lacking in the classroom environment, and the accuracy rate is not high for actions with a small sample size and is more difficult to identify. Therefore, on the basis of a SlowFast network, firstly, an MSTA (Multi-scale Spatial-Temporal Attention) module is introduced into the Slow path to effectively extract multi-scale spatial information, establish remote channel dependence, and add temporal attention. Secondly, the ETA (Efficient Temporal Attention) module for temporal dimension is introduced into the Fast pathway to effectively calculate temporal attention and strengthen the ability to perceive temporal features of actions. Figure 1 shows the structure of the modified MSTA-SlowFast model.

SlowFast Network
The SlowFast network is a dual-stream network based on the 3D CNN model, which includes two pathways. The Slow pathway mainly acquires spatial semantic information by using a 3D CNN model with a low frame rate. Additionally, the Fast pathway mainly acquires action information using a high-frame-rate 3D CNN model, but with a smaller convolution width and less number of channels. Meanwhile, the different spatio-temporal features are fused by lateral connections. Both paths have a 3D ResNet [30] network structure.
The SlowFast network settings include , , and parameters, which represent the video sampling step, the frame rate ratio of the two pathways, and their channel number ratio, respectively. Specifically, the Slow pathway to Fast pathway frame-rate ratio is 1: ( > 1) and the channel number ratio is 1: ( < 1). The Fast pathway weakens its ability to process spatial information by using smaller convolutions and fewer channels, thus reducing the computational effort and improving its expressiveness in the time domain.

SlowFast Network
The SlowFast network is a dual-stream network based on the 3D CNN model, which includes two pathways. The Slow pathway mainly acquires spatial semantic information by using a 3D CNN model with a low frame rate. Additionally, the Fast pathway mainly acquires action information using a high-frame-rate 3D CNN model, but with a smaller convolution width and less number of channels. Meanwhile, the different spatiotemporal features are fused by lateral connections. Both paths have a 3D ResNet [30] network structure.
The SlowFast network settings include τ, α, and β parameters, which represent the video sampling step, the frame rate ratio of the two pathways, and their channel number ratio, respectively. Specifically, the Slow pathway to Fast pathway frame-rate ratio is 1 : α (α > 1) and the channel number ratio is 1 : β (β < 1). The Fast pathway weakens its ability to process spatial information by using smaller convolutions and fewer channels, thus reducing the computational effort and improving its expressiveness in the time domain.
The network fuses the features extracted from the Fast pathway into the Slow pathway through multiple lateral connections. Generally, the feature maps of the Fast pathway output are converted from αT, S 2 , βC to T, S 2 , αβC by using time dimensional convolution, and then fused with the feature maps of size T, S 2 , C of the Slow pathway.
The model needs to detect the student position in the key frame by the detector during the detection and pass the detection result into the network, and faster R-CNN [31] is used as the human detector in this paper. The network finally calculates the RoI (region- of-interest) features through the RoIAlign algorithm and sends them to the multi-label classification prediction based on Sigmoid.

MSTA Module
The model typically uses the attention mechanism to pick out more crucial details and concentrate more on important areas of the image. A SENet (Squeeze-and-Excitation Network) [32] uses a channel attention mechanism, and each channel's weight was then adaptively calculated using a fully connected layer after being converted to a single value using GAP (Global Average Pooling). However, it ignores the importance of spatial information. A CBAM (Convolutional Block Attention Module) [33] enriches the attention graph by effectively combining spatial and channel attention, and uses GAP and a global maximum pool to enhance feature diversity. However, SlowFast as a 3D CNN not only needs to acquire channel and spatial information but more importantly, to perform behavior recognition by temporal information. Therefore, inspired by [34,35], we construct a Multi-scale Spatial-Temporal Attention (MSTA) module, that can capture and utilize channel, temporal and differently-sized spatial information more effectively, and establish channel and spatial remote dependencies at the same time. Figure 2 depicts the structure of MSTA, that consists of multi-scale spatial feature extraction, channel attention, and temporal attention.
The model needs to detect the student position in the key frame by the detector during the detection and pass the detection result into the network, and faster R-CNN [31] is used as the human detector in this paper. The network finally calculates the RoI (regionof-interest) features through the RoIAlign algorithm and sends them to the multi-label classification prediction based on Sigmoid.

MSTA Module
The model typically uses the attention mechanism to pick out more crucial details and concentrate more on important areas of the image. A SENet (Squeeze-and-Excitation Network) [32] uses a channel attention mechanism, and each channel's weight was then adaptively calculated using a fully connected layer after being converted to a single value using GAP (Global Average Pooling). However, it ignores the importance of spatial information. A CBAM (Convolutional Block Attention Module) [33] enriches the attention graph by effectively combining spatial and channel attention, and uses GAP and a global maximum pool to enhance feature diversity. However, SlowFast as a 3D CNN not only needs to acquire channel and spatial information but more importantly, to perform behavior recognition by temporal information. Therefore, inspired by [34,35], we construct a Multi-scale Spatial-Temporal Attention (MSTA) module, that can capture and utilize channel, temporal and differently-sized spatial information more effectively, and establish channel and spatial remote dependencies at the same time. Figure 2 depicts the structure of MSTA, that consists of multi-scale spatial feature extraction, channel attention, and temporal attention. The MSTA module first extracts the multi-scale spatial features, dividing the feature map into parts. Each part contains ′ feature channels, where ′ = / . For the division of each channel feature map, multi-scale spatial information is extracted using the 3D convolution of different sizes. The calculation process is shown in Equation (1), where denotes the segmented feature map, denotes the convolutional kernel size, denotes the group size, and = 2 ( −1)/2 .
After that, the channel attention weights need to be extracted. The channel weight is calculated by SEWeight for different sizes of feature maps . After, is rescaled using the Softmax algorithm and then multiplied with the feature map of the corresponding scale. The calculation process is shown in Formulas (2)   The MSTA module first extracts the multi-scale spatial features, dividing the feature map X into N parts. Each part contains C feature channels, where C = C/N. For the division of each channel feature map, multi-scale spatial information is extracted using the 3D convolution of different sizes. The calculation process is shown in Equation (1), where X i denotes the segmented feature map, K i denotes the convolutional kernel size, G i denotes the group size, and After that, the channel attention weights need to be extracted. The channel weight Z i is calculated by SEWeight for different sizes of feature maps S i . After, Z i is rescaled using the Softmax algorithm and then multiplied with the feature map S i of the corresponding scale. The calculation process is shown in Formulas (2) and (3).
Then, the temporal attention weights are calculated by applying them to the feature map Y. Specifically, the overall features in each time dimension are encoded into a global feature t using global pooling. On this basis, the overall feature map is subjected to the excitation operation, that is, the correlation between the temporal dimensions is constructed Sensors 2023, 23, 5205 6 of 13 through two full connection layers and the weights g of the same dimensions are output. The calculation process is shown in Formula (4).
Finally, the feature maps are then multiplied by the temporal dimensional weights to provide feature maps with richer multi-scale information. Since the spatial information extracted by the Fast pathway is less, the improvement of the model in this paper is that the MSTA module is introduced in the slow paths, replacing the 1 × 3 × 3 convolution in the middle layer of the res5 residual module.

ETA Module
The Fast pathway mainly obtains temporal features of the action and relatively little spatial information. The Efficient Temporal Attention (ETA) module is added to the Fast pathway to enhance model detection performance and help the model better capture action information. The ETA module is built with reference to ECA (Efficient Channel Attention) [36] and uses one-dimensional convolution to efficiently implement local crosstemporal dimensional interactions, avoid dimensionality reduction, and extract temporal channel correlations. Figure 3 shows the structure of the ETA module.
Then, the temporal attention weights are calculated by applying them to the fea map . Specifically, the overall features in each time dimension are encoded into a g feature using global pooling. On this basis, the overall feature map is subjected t excitation operation, that is, the correlation between the temporal dimensions is structed through two full connection layers and the weights of the same dimen are output. The calculation process is shown in Formula (4).

=
( , ) = ( ( , )) = ( 2 ( 1 )) Finally, the feature maps are then multiplied by the temporal dimensional we to provide feature maps with richer multi-scale information. Since the spatial inform extracted by the Fast pathway is less, the improvement of the model in this paper is the MSTA module is introduced in the slow paths, replacing the 1 × 3 × 3 convoluti the middle layer of the res5 residual module.

ETA Module
The Fast pathway mainly obtains temporal features of the action and relatively spatial information. The Efficient Temporal Attention (ETA) module is added to the pathway to enhance model detection performance and help the model better captur tion information. The ETA module is built with reference to ECA (Efficient Channe tention) [36] and uses one-dimensional convolution to efficiently implement local c temporal dimensional interactions, avoid dimensionality reduction, and extract temp channel correlations. Figure 3 shows the structure of the ETA module. The ETA is calculated as follows: first, the GAP is performed to obtain a 1 × 1 × 1 vector, ∈ × × × . Afterward, the weight of each time dimension is obtained by i acting information across time dimensions. Fast one-dimensional convolution usi convolution kernel of size is mostly responsible for achieving this; the formula follows: where is the Sigmoid function, denotes the feature of the th adjacent chann the th time dimension, and Ω denotes the set of adjacent channels, where the volution kernel's size, , is derived adaptively by Formula (6). | | denotes the ne odd number to .

= ( ) = | log 2 + |
The global final objective features are created by multiplying the original fea maps by the weight of the temporal domain. The ETA module avoids dimensionalit duction while taking into consideration the impact of cross-temporal context interact The ETA is calculated as follows: first, the GAP is performed to obtain a 1 × 1 × 1 × T vector, X ∈ R W×H×T×C . Afterward, the weight of each time dimension is obtained by interacting information across time dimensions. Fast one-dimensional convolution using a convolution kernel of size k is mostly responsible for achieving this; the formula is as follows: where σ is the Sigmoid function, y j i denotes the feature of the jth adjacent channel of the ith time dimension, and Ω k i denotes the set of k adjacent channels, where the convolution kernel's size, k, is derived adaptively by Formula (6). |t| odd denotes the nearest odd number to t.
The global final objective features are created by multiplying the original feature maps by the weight of the temporal domain. The ETA module avoids dimensionality reduction while taking into consideration the impact of cross-temporal context interactions. In the network, ETA is added to the res5 module of the Fast pathway to enhance the model's ability to perceive temporal features.

Dataset
We created a spatiotemporal-oriented classroom student behavior (SCSB) dataset because there are not any publicly accessible classroom datasets that can be used to deal with the issue of video-based classroom behavior detection. Spatiotemporal-oriented behavior detection aims to find the time and space in which the behavior of interest is located from the video and requires multiple frames to be correlated in order to determine the continuous behavior. The dataset is mainly annotated with reference to the publicly available AVA dataset for spatio-temporal behavior detection [13]. The AVA dataset is taken from 437 movies, annotated for 80 categories, and provides temporal labels for one frame per second for bounding boxes and actions.
Approximately 250 min of classroom instructional videos were filmed in classroom scenes, primarily from the front side of the classroom. The videos were cut and filtered, and more than 600 of them were labeled, each containing 7-20 students, and all were 10 s in length. Seven common classroom behaviors were selected for labeling: looking at the board, looking down, turning head/turning around, talking, standing up, raising hands, and lying on the table. Figure 4 depicts the dataset's creation process [37].

Dataset
We created a spatiotemporal-oriented classroom student behavior (SCSB) dataset be cause there are not any publicly accessible classroom datasets that can be used to dea with the issue of video-based classroom behavior detection. Spatiotemporal-oriented be havior detection aims to find the time and space in which the behavior of interest is located from the video and requires multiple frames to be correlated in order to determine th continuous behavior. The dataset is mainly annotated with reference to the publicly avail able AVA dataset for spatio-temporal behavior detection [13]. The AVA dataset is taken from 437 movies, annotated for 80 categories, and provides temporal labels for one fram per second for bounding boxes and actions.
Approximately 250 min of classroom instructional videos were filmed in classroom scenes, primarily from the front side of the classroom. The videos were cut and filtered and more than 600 of them were labeled, each containing 7-20 students, and all were 10 in length. Seven common classroom behaviors were selected for labeling: looking at th board, looking down, turning head/turning around, talking, standing up, raising hands and lying on the table. Figure 4 depicts the dataset's creation process [37]. Step 1: Video frame extraction. As shown in Figure 5, the videos were first filtered and cut into videos of 10 s in length for easy labeling, and then the cut videos were divided into frames according to the frame rate of 30 frames per second [37].
Step 2: Extract keyframes. One frame out of every 30 frames per second was extracted as a key frame for that frame, which was used to label student position and student class room behaviors.
Step 3: Annotate student locations. The extracted keyframes were input into the de tector, and the Faster RCNN was employed to detect the students in the keyframes, and the detected student location information was stored in the txt file.
Step 4: Annotate student actions. Due to the characteristics of the time-oriented stu dent classroom behavior dataset, the VIA annotation tool was selected for the multi-labe annotation of student behaviors. The txt file results obtained from the detector were con verted into JSON data format, and the VIA annotation tool was used to fine-tune the stu dent detection boxes and annotate the classroom behaviors. Finally, an annotation file in AVA format was generated. Step 1: Video frame extraction. As shown in Figure 5, the videos were first filtered and cut into videos of 10 s in length for easy labeling, and then the cut videos were divided into frames according to the frame rate of 30 frames per second [37]. The total annotation of the final dataset is 51,387. The dataset contains seven kinds of common student actions in the classroom environment, which can reflect the students' behavior in classroom scenarios. Figure 6 displays the number of labeled categories. Figure 7 depicts the dataset's head-turning, hand-raising, and head-lowering behavioral processes. 26  Step 2: Extract keyframes. One frame out of every 30 frames per second was extracted as a key frame for that frame, which was used to label student position and student classroom behaviors.
Step 3: Annotate student locations. The extracted keyframes were input into the detector, and the Faster RCNN was employed to detect the students in the keyframes, and the detected student location information was stored in the txt file.
Step 4: Annotate student actions. Due to the characteristics of the time-oriented student classroom behavior dataset, the VIA annotation tool was selected for the multilabel annotation of student behaviors. The txt file results obtained from the detector were converted into JSON data format, and the VIA annotation tool was used to fine-tune the student detection boxes and annotate the classroom behaviors. Finally, an annotation file in AVA format was generated.
The total annotation of the final dataset is 51,387. The dataset contains seven kinds of common student actions in the classroom environment, which can reflect the students' be- havior in classroom scenarios. Figure 6 displays the number of labeled categories. Figure 7 depicts the dataset's head-turning, hand-raising, and head-lowering behavioral processes. The total annotation of the final dataset is 51,387. The dataset contains seven kinds of common student actions in the classroom environment, which can reflect the students' behavior in classroom scenarios. Figure 6 displays the number of labeled categories. Figure 7 depicts the dataset's head-turning, hand-raising, and head-lowering behavioral processes.

Evaluation Indicators
In this study, the evaluation measures for classroom behavior detection tasks include Precision, Recall, and mAP. The formulae are as follows:  common student actions in the classroom environment, which can reflec behavior in classroom scenarios. Figure 6 displays the number of labeled ure 7 depicts the dataset's head-turning, hand-raising, and head-lowering b cesses.

Evaluation Indicators
In this study, the evaluation measures for classroom behavior detectio Precision, Recall, and mAP. The formulae are as follows: = +

Evaluation Indicators
In this study, the evaluation measures for classroom behavior detection tasks include Precision, Recall, and mAP. The formulae are as follows: where TP indicates that both the behavioral class and the predicted behavioral class are positive samples. FP indicates that the true behavioral class is negative, but the predicted behavioral class is positive. FN is an example where the true value of the behavioral class is positive, but the predicted behavioral class is negative.

Ablation Experiments and Analysis
The MSTA and ETA modules introduced in this paper can significantly enhance the algorithm's ability to detect behavior. Each enhanced module is chosen for ablation experiments in order to test the efficacy of the improved approach presented in this work. A pre-trained model was used in the experiments, and the MSTA and ETA modules are added sequentially to the original SlowFast while retaining the same experimental setup in order to assess each module's impact on improvement. Table 1 displays the results. The SlowFast backbone network employs 3D ResNet 50, with α taken as 8 and β as 1/8. According to the experiment results, mAP improved by 5.03% when MSTA was used compared to the original SlowFast. This shows that by substituting the res5 module for the MSTA module in the Slow pathway, the model is better able to receive spatial information, channel information, and temporal information. The addition of the ETA module to the Fast pathway increased the model mAP by 3.44%, indicating that the method enhances the model's ability to focus on temporal features by adding a temporal attention mechanism to the Fast pathway. It enhances the model's ability to recognize changes in the action, increasing model accuracy. After introducing both MSTA and ETA, the models achieved better detection results with a 2.35% improvement in Precision, 3.12% improvement in Recall, and 5.63% improvement in mAP. It indicates that better classroom behavior detection can be achieved by adding MSTA in the Slow pathway and also adding ETA time attention in the Fast pathway.
The recognition of each behavior type is shown in Table 2 both before and after model modification. It demonstrates that the original algorithm has a superior recognition effect for behaviors with a high sample count (such as looking at the blackboard or lowering your head) and behaviors with more obvious characteristics (such as standing, or lying on a table). However, the accuracy rate of behavior detection with a small sample size and which involved difficulty to distinguish, such as head turning and conversation, was low. The improved model, while maintaining the behavior detection effect with high detection accuracy, greatly improved the detection accuracy of the three behaviors of turning/turning, talking, and raising hands. Figure 8 displays the results of the comparison.

Comparison Experiments and Analysis
A comparative experiment was performed to test the effect on model detection when α was taken to different values. As shown in Table 3, the SlowFast backbone network was taken as 3D ResNet 50, and α was taken as 8 and 4. The experimental results show that both MSTA and ETA made significant improvements on the SlowFast network when α was taken as different values. Additionally, when α was 4, the model detection effect was better, but the model computation was larger due to the number of sampling frames of the Slow path when α was taken as 8. When α was 4, the FLOPs increased by 33.87 and 32.39 G before and after the model improvement, respectively, compared with that when α was 8. The computational effort of the improved model is reduced because MSTA uses grouped convolution for multi-scale spatial feature extraction, which reduces the computational effort, and the ETA does not cause a dramatic increase in computational effort. The same number of datasets were utilized under the same configuration conditions to compare the improved SlowFast with the LSTC and Slow-only networks, in order to confirm that it had a better detection effect. The experimental results were mainly evaluated by the mAP evaluation index, and Table 4 displays the precise experiment results. The algorithm used in this paper had an mAP of 91.10% when detecting student behavior in the classroom. Comparing the improved model to SlowOnly and LSTC, it achieved better detection results. This indicates that the improved model performs well in terms of its accuracy in time-oriented classroom behavior detection, and is able to meet the task of detecting students' classroom behavior in the classroom setting. Figure 9 shows the results of the classroom behavior detection.

Comparison Experiments and Analysis
A comparative experiment was performed to test the effect on model detection when α was taken to different values. As shown in Table 3, the SlowFast backbone network was taken as 3D ResNet 50, and α was taken as 8 and 4. The experimental results show that both MSTA and ETA made significant improvements on the SlowFast network when α was taken as different values. Additionally, when α was 4, the model detection effect was better, but the model computation was larger due to the number of sampling frames of the Slow path when α was taken as 8. When α was 4, the FLOPs increased by 33.87 and 32.39 G before and after the model improvement, respectively, compared with that when α was 8. The computational effort of the improved model is reduced because MSTA uses grouped convolution for multi-scale spatial feature extraction, which reduces the computational effort, and the ETA does not cause a dramatic increase in computational effort. The same number of datasets were utilized under the same configuration conditions to compare the improved SlowFast with the LSTC and Slow-only networks, in order to confirm that it had a better detection effect. The experimental results were mainly evaluated by the mAP evaluation index, and Table 4 displays the precise experiment results. The algorithm used in this paper had an mAP of 91.10% when detecting student behavior in the classroom. Comparing the improved model to SlowOnly and LSTC, it achieved better detection results. This indicates that the improved model performs well in terms of its accuracy in time-oriented classroom behavior detection, and is able to meet the task of detecting students' classroom behavior in the classroom setting. Figure 9 shows the results of the classroom behavior detection.

Conclusions
In this paper, we proposed a video classroom behavior detection method based on an improved SlowFast network. To provide model detection accuracy, the attention mechanism was used to improve the network structure. First, MSTA blocks were introduced into the Slow pathway to effectively extract multi-scale spatial information, temporal information and establish long-range channel dependencies. Secondly, the ETA blocks were introduced into the Fast pathway to effectively calculate temporal attention. It was experimentally demonstrated that after the introduction of the two modules, the improved model could achieve a mAP of 91.10% on the self-made student classroom behavior detection dataset, which was 5.63% higher than the original model. It has been shown that the enhanced method suggested in this paper can significantly enhance the model detection effect. The classroom behavior detection requirements using video in a classroom environment can be satisfied using MSTA-SlowFast.

Discussion
The MSTA-SlowFast model proposed in this paper detects classroom behaviors of instructional video species with practical applications. The analysis of the detected behaviors can be used to achieve the assessment of students' classroom concentration. Meanwhile, our study can help teachers and school administrators to understand students' behaviors in time for intervention and management.
Compared with existing studies related to classroom behavior detection, our work implements video-based classroom behavior detection and creates a spatio-temporal-oriented classroom behavior detection dataset. However, our study still has shortcomings. Since SlowFast is implemented using 3D CNN convolution, its detection speed needs to be improved. Moreover, classroom behavior detection is not satisfactory when the video species is more rear-rowed and heavily occluded. As our next step, we will make improvements toward these two aspects.

Conclusions
In this paper, we proposed a video classroom behavior detection method based on an improved SlowFast network. To provide model detection accuracy, the attention mechanism was used to improve the network structure. First, MSTA blocks were introduced into the Slow pathway to effectively extract multi-scale spatial information, temporal information and establish long-range channel dependencies. Secondly, the ETA blocks were introduced into the Fast pathway to effectively calculate temporal attention. It was experimentally demonstrated that after the introduction of the two modules, the improved model could achieve a mAP of 91.10% on the self-made student classroom behavior detection dataset, which was 5.63% higher than the original model. It has been shown that the enhanced method suggested in this paper can significantly enhance the model detection effect. The classroom behavior detection requirements using video in a classroom environment can be satisfied using MSTA-SlowFast.

Discussion
The MSTA-SlowFast model proposed in this paper detects classroom behaviors of instructional video species with practical applications. The analysis of the detected behaviors can be used to achieve the assessment of students' classroom concentration. Meanwhile, our study can help teachers and school administrators to understand students' behaviors in time for intervention and management.
Compared with existing studies related to classroom behavior detection, our work implements video-based classroom behavior detection and creates a spatio-temporal-oriented classroom behavior detection dataset. However, our study still has shortcomings. Since SlowFast is implemented using 3D CNN convolution, its detection speed needs to be improved. Moreover, classroom behavior detection is not satisfactory when the video species is more rear-rowed and heavily occluded. As our next step, we will make improvements toward these two aspects.
Author Contributions: Writing-original draft, S.Z.; validation, P.W. and X.W.; writing-review and editing, C.S. and F.Y.; investigation, J.Z. and H.L. All authors have read and agreed to the published version of the manuscript.