DMCNet: Diversified Model Combination Network for Understanding Engagement from Video Screengrabs

Engagement is an essential indicator of the Quality-of-Learning Experience (QoLE) and plays a major role in developing intelligent educational interfaces. The number of people learning through Massively Open Online Courses (MOOCs) and other online resources has been increasing rapidly because they provide us with the flexibility to learn from anywhere at any time. This provides a good learning experience for the students. However, such learning interface requires the ability to recognize the level of engagement of the students for a holistic learning experience. This is useful for both students and educators alike. However, understanding engagement is a challenging task, because of its subjectivity and ability to collect data. In this paper, we propose a variety of models that have been trained on an open-source dataset of video screengrabs. Our non-deep learning models are based on the combination of popular algorithms such as Histogram of Oriented Gradient (HOG), Support Vector Machine (SVM), Scale Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF). The deep learning methods include Densely Connected Convolutional Networks (DenseNet-121), Residual Network (ResNet-18) and MobileNetV1. We show the performance of each models using a variety of metrics such as the Gini Index, Adjusted F-Measure (AGF), and Area Under receiver operating characteristic Curve (AUC). We use various dimensionality reduction techniques such as Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) to understand the distribution of data in the feature sub-space. Our work will thereby assist the educators and students in obtaining a fruitful and efficient online learning experience.


Introduction
Engagement is an important part of human-technology interactions and is defined differently for a variety of applications such as search engines, online gaming platforms, and mobile health applications. Most definitions describe engagement as attention and emotional involvement in a task. This paper deals with engagement during learning via technology. Investigating engagement is vital for designing intelligent educational interfaces in different learning platforms including educational games, MOOCs, and Intelligent Tutoring Systems (ITSs). For instance, if students feel frustrated and become disengaged (see disengaged samples in Fig. 1), the system should intervene in order to bring them back to the learning process. However, if students are engaged and enjoying their tasks (see engaged samples in Fig. 1), they should not be interrupted even if they are making some mistakes. In order for the learning system to adapt the learning setting and provide proper responses to students, we first need to automatically measure engagement. In order to perform this, we train a few models viz. a CNN, a SVM, and deep learning methods [1] to make correct predictions about the involvement of the student. In this paper, 1 we describe the various problems that we faced during our project and how we overcame them. We also show different techniques of extracting features from images and how we trained our machine-learning-based models.
Engagement analysis has become more important during these difficult times of COVID-19 pandemic. As in most of the countries the college lectures are being taken online so it becomes important for the teachers to get feedback about engagement of students during those lectures. This will greatly assist the teachers to figure out why and where are students getting disengaged. This will also help teachers to give more attention to students that are disengaged most of the times. Current research on this technology of engagement analysis exhibit the following two major gaps: • Lack of diversification of models and lack of efficient algorithms to detect engagement, which makes it difficult for users to find the most suitable model for the chosen platform.
• There exists few techniques and objective measures to evaluate performance of the machine-learning models, making it difficult to benchmark results.
To solve the aforementioned issues, we propose various non-deep learning based models and deep learning based models. The non-deep-learning models are based on the combination of popular algorithms viz. HOG, SVM, SIFT and SURF. The deep learning methods include DenseNet-121,  MobileNetV1. In terms of the experiment, we have used several metrics viz.
Gini Index, Adjusted F-score (AGF), Area Under Curve (AUC) in order to show each model's performance. We also use dimensionality reduction techniques to illustrate the distribution of the experimental data.
The main contributions of this paper include: 1 In the spirit of reproducible research, all the codes related to this manuscript is available via: https://github.com/WangHewei16/Video-Engagement-Analysis. • We propose various models that have been trained to identify user engagement on a given dataset and explain in detail how each model was This paper is organized as follows. Section 2 analyzes and discusses the relevant research of this field. Section 3 introduces the WACV dataset which we used in our research. Section 4 describes the several models that are proposed. Section 5 conducts experiments and discusses the results. Section 6 concludes this paper with future work.

Research background
In the past, research focused on using machine learning methods in the development of the personalization of curriculum, adaptive evaluation and recommendation systems based on learner preferences and browsing behavior, such as in [2,3,4,5]. Whilst in the recent times, there has been an exponential growth of using MOOCs [6]. They have laid the path for modern education where students can learn whatever they want and from wherever they want to learn. Obtaining students' engagement in an online class is a core issue of this study, as it is important to keep students focused in the era of online education.
After obtaining the engagement of students, educators can better supervise students' learning efficiency, and teachers can also improve their teaching methods according to the concentration and engagement data of students. Our research has had a positive impact on this modernization education issue.

Development of facial expression recognition technology
Facial expression recognition technology plays an essential role in engagement recognition. In [7], the authors achieved facial expression recognition technique based on machine learning algorithms called Artificial Neural Networks (ANNs) and Web Usage Mining (WUM) to create an adaptive e-learning environment. One of the important aspects that are currently missing from these online education platforms is that teachers do not get feedback on how attentive the student was based on human's reactions, expressions, and position.
To tackle this problem researchers adopted various machine learning and computer vision techniques. The authors have proposed a part-based hierarchical bidirectional recurrent neural network (PHRNN) to analyze the facial expression information of temporal sequences. PHRNN models facial morphological variations and dynamical evolution of expressions, which is effective to extract "temporal features" based on facial landmarks. Use of Computer Expression Recognition Toolbox (CERT) in [8] to track fine-grained facial movements consisting of eyebrow-raising (inner and outer), brow lowering, eyelid tightening, and mouth dimpling within a naturalistic video corpus of tutorial dialogue (N = 65). Within the dataset, upper face movements were found to be predictive of engagement, frustration, and learning, while mouth dimpling was a positive predictor of learning and self-reported performance. These results highlight how both intensity and frequency of facial expressions predict tutoring outcomes, use of HOG and SVM to detect humans has been implemented in [9]. In [10], Hernandez et al. modeled the problem of determining the engagement of a TV viewer as a binary classification problem, using multiple geometric features extracted from the face and head and used SVMs for the classification. In [11], the authors have used deep-learning-based techniques to classify facial images into different social relation traits, facial expressions are used in [12]. In [13], the authors used computer vision and machine-learning techniques to detect students' effects from facial expressions (primary channel) and gross body movements (secondary channel) during interactions with an educational physics game. We use one of the publicly available datasets that are provided by [14]. In [15], the authors achieve facial expression detection using CNN based approach. In [16], the authors design a user engagement prediction framework and specific video quality metrics to help the providers of content, predict the time, viewers tend to remain in video sessions. In [17], the authors introduce a way to detect the emotion of the people in videos and make his accuracy greater than 60% with high reliability through the industry-level face recognition networks. In [18], the authors introduces a preliminary model, which can help students make course choices faster and provide better-personalized services. However, how to provide a better personalized and adaptive e-learning environment is still a challenge that has attracted many researchers. In [19], a coupled integration between CNN and a fuzzy system is adopted. The CNN is used to detect a learner's facial expressions and a fuzzy system is used to determine the next learning level based on the extracted facial expression states from the CNN and several response factors by the learner.

Development of engagement recognition technology and research motivation
In [20], features like heart rate, Animation Units (from Microsoft Kinect Face Tracker), and local binary patterns in three orthogonal planes (LBP-TOP) were used in supervised learning for the detection of concurrent and retrospective self-reported engagement. Theoretical analysis of engagement done in [21,22], talks about intrinsic motivation, flow theory, focus attention, positive psychology and social enterprise. These studies show some factors and theories that influence the engagement of humans. Use of facial features done in [23] are used to automatically detect student engagement and the result demonstrated initial success in this field. An important factor in solving challenging problems like determining engagement requires a well-annotated dataset. In [24], the authors have discussed for affective state annotation, how does the socio-cultural background of human expert labelers, compared to the subjects, impact the degree of consensus and distribution of affective states obtained and Secondly, how do differences in labeler background impact the performance of detection models that are trained using these labels. In [25], the authors present a deep learning model to improve engagement recognition from images that overcome the data sparsity challenge by pre-training on readily available basic facial expression data, before training on specialized engagement data. In the first step, a facial expression recognition model is trained to provide a rich face representation using deep learning. In the second step, they used the model's weights to initialize the deep-learning-based model to recognize engagement. The most common facial expressions used are (happiness, sadness, anger, disgust, fear, surprise).
There is some interesting research such as [26], where the authors introduced a recognition model with high accuracy which uses t-SNE to improve the performance of the model. The emotion of people can also be analyzed. Nezami et al.
have performed some relevant researches in [27,28], the authors use some deep learning methods based on CNNs model to recognize facial expression and engagement. Current engagement recognition research, on the other hand, reveals two major flaws. The first gap is a lack of model diversity as well as efficient model creation procedures. Users will have a tough time finding the best model for the platform as a result of this. The second flaw is that there are just a few types of assessment criteria for analyzing model performance, making it difficult to come up with compelling findings. We are inspired to propose alternative models, including non-deep learning and deep learning models, to cover these gaps. Non-deep learning approaches rely on a mix of several computer vision features. The deep learning methods include DenseNet-121, ResNet-18 and Mo-bileNetV1. The reason for using these algorithms on the engagement detection datasets is because they are well-known and widely used techniques in the face recognition field. As a result, we are interested in systematically merging these technologies so that we can clearly examine the influence and performance of each model, and so choose the optimal model through several tests and experiments. In terms of the experiment, we employ numerous measures to illustrate the performance of each model, including the Gini Index, AGF, and AUC. We also employ dimensionality reduction techniques such as PCA and t-SNE to depict the distribution of experiment data in a more thorough and varied manner.
In terms of the managerial insights, our research team, online education platform and users need to reach an agreement on privacy and profitability, and then reasonably apply the model on the platform to test students' engagement.

Engagement Analysis Datasets
This section introduces the related datasets in this field and describe WACV dataset which we used in our research. The public datasets available are DAiSEE dataset which consists of four classes (engaged, frustration, boredom, confusion) where images are generated from 9068 videos and 112 subjects (80 male and 32 female), HBCU dataset also consists of four classes (not engaged, nominally-engaged, engaged, very-engaged) where images are generated from 120 videos and 34 subjects (9 male and 25 female). In-the-wild has classes (disengaged, barely-engaged, normally-engaged, highly-engaged) where images are generated from 195 videos and 78 subjects (53 male and 25 female). SD-MATH dataset has been generated from 20 videos and 20 subjects (10 males and 210 females). Now we discuss how we analyzed the engagement of students. We used the WACV dataset which is a public dataset for our work. The dataset has three different classes that are disengaged, partially engaged, and engaged. It contains a total of 4424 3-channel images of varying sizes. We reshape all the images to a same shape of (100, 100, 3). The dataset is not a balanced dataset as it has 412 images belonging to the class 'disengaged', 2247 images belonging to the class 'partially engaged', and 1765 images belonging to the class 'engaged'.
We firstly randomly selected 412 images from class 2 (partially engaged) and class 3 (engaged) so that each class has 412 images each. We split this data into training (80%) and testing (20%). We used different models to fit this balanced data and finally chose the one which provided the best test accuracy. Figure 1 shows the three classes of WACV dataset. The link to the dataset that we have used is https://github.com/e-drishti/wacv2016.
In [29], the authors used HBCU dataset for the automatic detection of learn- and SVM to detect the deictic tip, and achieves the TPR of 85%. In [31], the authors used DAiSEE dataset for engagement detection through using three different models of Convolution Neural Networks (CNNs) -InceptionNet, C3D, and Long-Term Recurrent Convolutional Network (LRCN). The models were applied to detect boredom, engagement, confusion, and frustration, where In-ceptionNet achieved the accuracies of these engagement levels 36.5%, 47.1%, 70.3%, and 78.3%, respectively. The C3D and LRCN achieved the accuracies of these engagement levels of 45.2%, 56.1%, 66.3%, 79.1%, and 53.7%, 61.3%, 72.3%, 73.5%, respectively. In [32], the authors used the "in-the-wild" dataset for their performance evaluation. This method employed three-fold cross validation with multiple kernel learning (MKL) SVM and the average accuracy and the maximum accuracy obtained 43.98% and 50.77%, respectively.

CNN based approach
As the training data is really small it does not make sense to use a very deep neural network as it can lead to overfitting so we used a CNN model that has three convolutional layers and each convolutional layer has 3-6-9 kernels respectively with non-linearity as relu and each convolutional layer is followed by a MaxPooling with stride 2 and finally there is a fully connected layer connected to a softmax layer. We chose MSE Loss function as our criterion and used Adam optimizer with learning rate 10 −4 . We did 20 experiments, in each experiment we generated our balanced data using random sampling and trained our model for 10 epochs. After training, we tested it on the test data and stored the accuracy in a list and after all experiments were completed we calculated the average accuracy. Fig. 2 shows the flowchart of the CNN based model.

HOG
Local object appearance and shape can often be characterized rather well by the distribution of local intensity gradients or edge detection. HOG features are calculated by taking orientation histograms of edge intensity in local region.
In this paper, we extract HOG features from 8x8 local regions. At first, edge gradients and orientations are calculated at each pixel in this local region.
Sobel filters are used to obtain the edge gradients and orientations. The gradient magnitude mag(x, y) and orientation φ(x, y) are calculated using the x-and ydirectional gradients dx(x, y) and dy(x, y) computed by Sobel filter as: dx(x,y) ) − π if dx(x,y) < 0 and dy(x,y) < 0 tan −1 ( dy(x,y) dx(x,y) ) + π if dx(x,y) < 0 and dy(x,y) > 0 This local region is divided into small spatial area called "cell".

HOG+SVM
As the name suggests we use HOG Algorithm to extract a feature vector for each training image. Once we have done that we have created a representation of each image into a vector with 1236 elements. Now to perform the classification task we train a SVM model on these feature vectors. After training we use the test data to calculate the performance metrics. Fig. 4 shows the flowchart of the HOG+SVM model.

DenseNet-121
DenseNet-121 have 121 convolutional layers. The DenseNet model can be seen as the variation of the ResNet model because of the dense residual connection between each convolutional layer in a single dense block. The paper [33] illustrates that the dense connection between multiple layers can preserve feedforward feature and each layer obtains additional inputs from all preceding layers and passes on its feature-maps to all subsequent layers and it has fewer parameters than the model proposed by paper [34]. Fig. 8 shows the architecture of a 5-layers dense block.

ResNet-18
ResNet-18 have 18 convolutional layers which is the smallest model of ResNet.
The paper [34] points out that the deep CNN is hard to optimize and they introduce a network with residual short cut to remit this problem which can deliver the top-level feature to the deep layers, and enable the training of deep CNN. Fig. 9 shows the building block of ResNet. Table 1 shows the parameters choose in our experiment. Fig. 10 shows the structure of "Residual Block" under two conditions.

MobileNetV1
MobileNetV1 use depthwise separable convolution to reduce parameters amount and increase inference speed which can enable the deployment of image recognition algorithm on the mobile device. The depthwise convolution is a special case of group convolution, the groups' number is the same as the input channels. The paper [35] points out that the computational cost of standard convolution can be denoted as: The computational cost of depthwise convolution can be denoted as: In Formula 3 and 4, denote the size of kernel, denote the size of the feature map, denote the input channels and denote the output channels. These formulae show that the computational complexity of depthwise convolution is lower than standard convolution. In our implementation, we use BasicConv2d and DepthwiseConv2d as basic modules of mobilenet. Fig. 12 shows the structure of BasicConv2d and Depth-wiseConv2d, and the parameter's configuration is given in Table 2. Table 3 shows the results of the subjective evaluation. We compute several objective metrics [36] -accuracy, Gini index, Adjusted F-score, and Area Under

Subjective Evaluation
Curve for benchmarking purposes. The following is our explanation of each metric used in the experiment: 1. Accuracy (ACC): total number of images in test data be τ and the total number of correct predictions be Ω, accuracy can be expressed as: where P i denotes the probability of an element being classified for a distinct class.
3. Adjusted F-Score (AGF): computed by taking the geometric mean of F 2 and invF 0.5 where: In this experiment, we use sensitivity and precision to evaluate the performance of the model. The following formulae are the ones for calculating    Next we explain in detail about the dimensionality reduction techniques that have been used to understand the reason behind the low accuracy.

Principal Component Analysis
Amongst the several feature selection techniques [37], the Principal Component Analysis (PCA) is widely used. The principal components in PCA capture the most variation in a dataset. PCA deals with the curse of dimensionality by capturing the essence of data into a few principal components. Now we define the PCA Algorithm: 1. Given (x 1 , x 2 , . . . ...., x m ) in R n we try to represent them in R l where l < n.
Each x i represents a vector that we have created from our training images.
In our case, n is 30 000 and l is 2 that is we bring down a vector from R 30000 to R 2 .

2.
We aim to find what we call as an encoder function and a decoder function to take our x i from n to l and from l to n respectively. Now define: where g is the decoder function One of the decoding function is g(c) = Dc, where D is a matrix with l mutually orthogonal columns to minimize distance between x and the reconstruction r(x) = g(f (x)) = DD T x. The optimal c * and D * are given as: After solving the above equations we observe that D is a matrix with top l eigenvectors of design matrix X R mxn .

T-SNE
t-SNE is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more dimensions suitable for human observation. The details of the t-SNE algorithm are as follows. 1. Given a set of N high dimensional objects x 1 , x 2 , . . . ..x N t-SNE first computes the probability P i|j that are proportional to the similarity of objects x i and x j as follows. For i = j, define: And set P i|i = 0. Note that j p j|i = 1 for all i. Now define: The similarity of datapoint x j to datapoint x i is the conditional probability, P j|i , that x i would pick x j as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i .
2. t-SNE aims to learn a d-dimensional map y 1 , y 2 , . . . ..y N (y i R d ) that reflects the similarities P i,j as well as possible. To this end, it measures similarities Q i,j between two points in the map y i and y j using a very similar approach.
Specifically, for i = j, define Q i,j as: And set Q i,i = 0.
The locations of the points y i in the map are determined by minimizing the (non-symmetric) KL Divergence of the distribution P from the distribution Q, that is: The minimization of the Kullback-Leibler divergence with respect to the points y i is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

Reasons of low performance
In Fig. 13 and Fig. 14, we can see that the distribution followed by partially engaged and engaged is very similar, so our techniques cannot differentiate be-tween an image that is partially engaged and engaged. This tells us that there are no hard boundaries that we can define which clearly separates the two classes from each other but we can see that the distribution of disengaged and engaged is very different so it is easy to classify between them. The accuracy that we get is not very high and one main reason for this is the distribution of data points. From the plots, we can see that the distribution of engaged and partially engaged are nearly identical hence resulting in low accuracy.  In Fig. 15, in terms of the non-deep learning method, we can see the range of accuracies obtained by different methods. In the case of HOG+CNN the accuracy for each experiment is nearly the same and is equal to 34%. We get the highest accuracy for HOG+SVM technique of 63%. Other techniques are more or less are working in the same manner but HOG+SVM performs much better. In terms of the deep learning method, both ResNet-18 and DenseNet-121

Discussion
have good accuracy performance with around 80% and 78% respectively. The performance of MobileNetV1 is 66% which is a little better than HOG+SVM.

Conclusions and future work
In this paper, we explored different techniques of extracting features from images and train our models on those features. We used different feature descriptors and keypoint detectors being used to generate training vectors. We see that PCA and t-SNE tell us about the representation of our data which helps us to analyze why our techniques are working in this manner. Our results indicate that the deep-learning methods have better performance, with ResNet-18 possessing the best accuracy performance among three deep-learning models. Amongst the non-deep-learning methods, HOG+SVM has the best performance. We recommend online education platforms to reasonably select deepor non-deep-learning methods according to their own conditions and needs. The limitations of our research is that our techniques cannot differentiate between a human that is partially engaged and fully engaged because there are no clear and hard boundaries that we can be defined which clearly separates the two classes.
For future work, we intend to further study the deep-learning methods on engagement detection and work on various other datasets to further benchmark our proposed model. Especially, we will focus on combining advanced deeplearning models' block such as visual transformer and RepVGG. We also plan to embed the proposed algorithm model into the online education platform to observe its practical role.