Abstract
In order to obtain the state of students’ listening in class objectively and accurately, we can obtain students’ emotions through their expressions in class and cognitive feedback through their behaviors in class, and then integrate the two to obtain a comprehensive assessment results of classroom status. However, when obtaining students’ classroom expressions, the major problem is how to accurately and efficiently extract the expression features from the time dimension and space dimension of the class videos. In order to solve the above problems, we propose a class expression recognition model based on spatio-temporal residual attention network (STRAN), which could extract facial expression features through convolution operation in both time and space dimensions on the basis of limited resources, shortest time consumption and optimal performance. Specifically, STRAN firstly uses the residual network with the three-dimensional convolution to solve the problem of network degradation when the depth of the convolutional neural network increases, and the convergence speed of the whole network is accelerated at the same number of layers. Secondly, the spatio-temporal attention mechanism is introduced so that the network can effectively focus on the important video frames and the key areas within the frames. In order to enhance the comprehensiveness and correctness of the final classroom evaluation results, we use deep convolutional neural network to capture students’ behaviors while obtaining their classroom expressions. Then, an intelligent classroom state assessment method(Weight_classAssess) combining students’ expressions and behaviors is proposed to evaluate the classroom state. Finally, on the basis of the public datasets CK+ and FER2013, we construct two more comprehensive synthetic datasets CK+_Class and FER2013_Class, which are more suitable for the scene of classroom teaching, by adding some collected video sequences of students in class and images of students’ expressions in class. The proposed method is compared with the existing methods, and the results show that STRAN can achieve 93.84% and 80.45% facial expression recognition rates on CK+ and CK+_Class datasets, respectively. The accuracy rate of classroom intelligence assessment of students based on Weight_classAssess also reaches 78.19%, which proves the effectiveness of the proposed method.
Similar content being viewed by others
References
Huang XH, Wang SJ, Liu X, Zhao GY, Feng XY, Pietikinen M (2019) Discriminative spatiotemporal local binary pattern with revisited integral projection for spontaneous facial micro-expression recognition. IEEE Transactions on Affective Computing 10(1):32–47
Sudhakar K, Manisha V, Raman S (2019) LBVCNN: Local Binary Volume Convolutional Neural Network for Facial Expression Recognition From Image Sequences. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) Workshops
Perveen N, Roy D, Chalavadi KM (2020) Facial Expression Recognition in Videos Using Dynamic Kernels. IEEE Trans Image Process 29:8316–8325
Sujata, Mitra, S.K (2021) Modular FER: A Modular Facial Expression Recognition from Image Sequence Based on Two Dimensional (2D) Taylor Expansion.SN COMPUT. SCI.2,181
Chen J, Chen Z, Chi Z, Fu H (2018) Facial Expression Recognition in Video with Multiple Feature Fusion. IEEE Trans Affect Comput 9(1):38–50
Hasani B, Mahoor MH (2017) Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, pp. 2278–2288
Lee J, Kim S, Kim S, Sohn K (2020) Multi-Modal Recurrent Attention Networks for Facial Expression Recognition. IEEE Transactions on Image Processing 29:6977–6991
Chen Boyu, Guan Wenlong, Li Peixia (2021) Residual multi-task learning for facial landmark localization and expression recognition. Pattern Recognit. 115:107893
Ji SW et al (2013) 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231
Tadesse GA, Bent O (2020) Privacy-Aware Human Activity Recognition From a Wearable Camera: Highlights From the IEEE Video And Image Processing Cup 2019 Student Competition. IEEE Signal Process. Mag 37:168–172
Kensho H, Hirokatsu K, Yutaka S (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), pp. 6546–6555
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 770–778
Wang F et al. (2017) Residual Attention Network for Image Classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 6450–6458
Li Y, Zeng J, Shan S, Chen X (2019) Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism. IEEE Transactions on Image Processing 28(5):2439–2450
Pedro D, Marrero F, Fidel A, Guerrero Pena, Tsang IR, Alexandre C (2019) FERAtt: Facial Expression Recognition With Attention Net. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) Workshops
Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N (2017) Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism. IEEE International Conference on Computer Vision (ICCV 2017), pp. 4846–4855, https://doi.org/10.1109/ICCV.2017.518
Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-Temporal Attention Networks for Action Recognition and Detection. IEEE Transactions on Multimedia 22(11):2990-3001
Tseng C, Chen Y (2018) A camera-based attention level assessment tool designed for classroom usage. J Supercomput 74:5889-5902
Jingting Li, Catherine S, Renaud S, Wang SJ, Moi HY (2019) Spotting Micro-Expressions on Long Videos Sequences. 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019)
Santos PB, Wahle CV, Gurevych I (2018) Using Facial Expressions of Students for Detecting Levels of Intrinsic Motivation. IEEE 14th International Conference on e-Science (e-Science), Amsterdam, pp. 323–324
Kim PW (2019) Ambient intelligence in a smart classroom for assessing student’ engagement levels. J Ambient Intell Human Comput 10:3847–3852
Riaz UK, Zhang XS, Rajesh K, Emelia OA (2018) Evaluating the Performance of ResNet Model Based on Image Recognition. Proceedings of the International Conference on Computing and Artificial Intelligence(ICCAI 2018) pp. 86–90
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016) pp. 770–778
Ma JX, Tang H, Zheng WL, Lu BL (2019) Emotion Recognition using Multimodal Residual LSTM Network. MM ’19: Proceedings of the 27th ACM International Conference on Multimedia. pp. 176–183
Li F, Zurada JM, Liu Y, Wu W (2017) Input Layer Regularization of Multilayer Feedforward Neural Networks. IEEE Access 5:10979–10985
Joseph R, Ali F (2018) YOLOv3: An Incremental Improvement. Comput. Vis. Pattern Recognit
Thorpe K, Rankin P, Beatton T et al (2020) The when and what of measuring ECE quality: Analysis of variation in the Classroom Assessment Scoring System (CLASS) across the ECE day. Early Child. Res. Q. 53:274–286
Danniels E, Pyle A, DeLuca C (2020) The role of technology in supporting classroom assessment in play-based kindergarten. Teaching and Teacher Education, pp. 88–96
Meng D, Peng X, Wang K, Qiao Y (2019) Frame Attention Networks for Facial Expression Recognition in Videos. IEEE International Conference on Image Processing (ICIP 2019) pp. 3866–3870
Peng YX, Zhao YZ, Zhang JC (2019) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786
Khor HQ, See J, Phan RC-W and Lin WY (2018) Enriched long-term recurrent convolutional network for facial micro-expression recognition. Proceedings of 13th IEEE International Conference on Automatic Face and Gesture Recognition, Xi’an: IEEE: pp. 667–674
Yin F, Lu XJ, Li D, Liu YL (2016) Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In: ACM ICMI, pp. 445–450
Liu M, Li S, Shan S, Wang R, Chen X (2014) Deeply learning deformable facial action parts model for dynamic expression analysis. In: Asian Conference on Computer Vision, pp. 143–157
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE transactions on pattern analysis and machine intelligence 29(6):915–928
Goodfellow IJ, Erhan D, Carrier PL (2015) Challenges in representation learning. A report on three machine learning contests. Neural Netw 64:59–63
Pramerdorfer C, Kampel M (2016) Facial Expression Recognition using Convolutional Neural Networks: State of the Art. arXiv:1612.02903
Shi JW, Zhu SH (2021) Learning to Amend Facial Expression Representation via De-albino and Affinity. arXiv:2103.10189
Minaee S, Abdolrashidi A (2021) Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network. Computer Vision and Pattern Recognition. Sensors 21(9):3046
S. Minaee and A. Abdolrashidi (2019) Deep-emotion: Facial expression recognition using attentional convolutional network, CoRR, vols. abs/1902.01019, pp. 1-xx
Liu X, Kumar BVKV, Jia P, You J (2019) Hard negative generation for identity-disentangled facial expression recognition. Pattern Recognit. 88:1–12
Zhao Z, Liu Q, Wang S (2021) Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild. IEEE Trans Image Process 30:6544–6556. https://doi.org/10.1109/TIP.2021.3093397
K. Sikka, A. Dhall, and M. Bartlett (2015) Exemplar hidden Markov models for classification of facial expressions in videos. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 18-25
C.-M. Kuo, S.-H. Lai, and M. Sarkis (2018) A compact deep learning model for robust facial expression recognition. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 2121-2129
Li Y, Zeng J, Shan S, Chen X (2019) Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 28(5):2439–2450
C. Szegedy et al. (2015) Going deeper with convolutions, In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),pp. 1-9
I. Cugu, E. Sener, and E. Akbas (2019) MicroExpNet: An extremely small and fast model for expression recognition from face images. In Proc. 9th Int. Conf. Image Process. Theory, Tools Appl. (IPTA), pp. 1-6
Krizhevsky Alex, Sutskever Ilya (2017) ImageNet classification with deep convolutional neural networks. Comm ACM. 60:84–90
X. Cheng, Z. Miao, and Q. Qiu (2020) Graph convolution with low-rank learnable local filter arXiv:2008.01818
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 61877006, No. 62272058), and CAAI-Huawei MindSpore Open Fund (No.CAAIXSJLJJ-2021-007B).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Z., Liang, M., Xue, Z. et al. STRAN: Student expression recognition based on spatio-temporal residual attention network in classroom teaching videos. Appl Intell 53, 25310–25329 (2023). https://doi.org/10.1007/s10489-023-04858-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04858-0