A dataset for assessing real-time attention levels of the students during online classes

This dataset offers a comprehensive compilation of attention-related features captured during online classes. The dataset is generated through the integration of key components including face detection, hand tracking, head pose estimation, and mobile phone detection modules. The data collection process involves leveraging a web interface created using the Django web framework. Video frames of participating students are collected following institutional guidelines and informed consent through their webcams, subsequently decomposed into frames at a rate of 20 FPS, and transformed from BGR to RGB color model. The aforesaid modules subsequently process these video frames to extract raw data. The dataset consists of 16 features and one label column, encompassing numerical, categorical, and floating-point values. Inherent to its potential, the dataset enables researchers and practitioners to explore and examine attention-related patterns and characteristics exhibited by students during online classes. The composition and design of the dataset offer a unique opportunity to delve into the correlations and interactions among face presence, hand movements, head orientations, and phone interactions. Researchers can leverage this dataset to investigate and develop machine learning models aimed at automatic attention detection, thereby contributing to enhancing remote learning experiences and educational outcomes. The dataset in question also constitutes a highly valuable resource for the scientific community, enabling a thorough exploration of the multifaceted aspects pertaining to student attention levels during online classes. Its rich and diverse feature set, coupled with the underlying data collection methodology, provides ample opportunities for reuse and exploration across multiple domains including education, psychology, and computer vision research.


a b s t r a c t
This dataset offers a comprehensive compilation of attentionrelated features captured during online classes.The dataset is generated through the integration of key components including face detection, hand tracking, head pose estimation, and mobile phone detection modules.The data collection process involves leveraging a web interface created using the Django web framework.Video frames of participating students are collected following institutional guidelines and informed consent through their webcams, subsequently decomposed into frames at a rate of 20 FPS, and transformed from BGR to RGB color model.The aforesaid modules subsequently process these video frames to extract raw data.The dataset consists of 16 features and one label column, encompassing numerical, categorical, and floating-point values.Inherent to its potential, the dataset enables researchers and practitioners to explore and examine attention-related patterns and characteristics exhibited by students during online classes.The composition and design of the dataset offer a unique opportunity to delve into the correlations and interactions among face presence, hand movements, head orientations, and phone interactions.Researchers can leverage this dataset to investigate and develop machine learning models aimed at automatic attention detection, thereby contributing to enhancing remote learning experiences and educational outcomes.The dataset in question also constitutes a highly valuable resource for the scientific community, enabling a thorough exploration of the multifaceted aspects pertaining to student attention levels during online classes.Its rich and diverse feature set, coupled with the underlying data collection methodology, provides ample opportunities for reuse and exploration across multiple domains including education, psychology, and computer vision research.
© This, in turn, serves as a driving force for advancing the entire field of automated attention detection, leading to the development of more sophisticated and accurate models.Moreover, this innovative potential has a ripple effect, resulting in a significant enhancement of the overall efficiency and efficacy of online educational platforms.

Background
The primary aim of this study is to present a comprehensive and multifaceted dataset assessing student attention levels in online classes, featuring face detection, hand tracking, phone usage, and head pose estimation.Another important objective of this paper is to offer a valuable resource for studying the impact of attention patterns on virtual learning outcomes, advancing online teaching methods, and fostering educational technology innovation by providing benchmark data for attention-detection algorithms and real-time monitoring tools to enhance online education quality.

Data Description
In an era dominated by digital learning, online classes have become a ubiquitous mode of education and ensuring students remain engaged and attentive during these virtual sessions presents a substantial challenge for educators.However, limited datasets capturing attention dynamics during online classes hinder research and innovation to address this challenge.A dedicated dataset addressing this gap is essential to unravel the nuances of attention patterns, enabling educators and researchers to optimize teaching methods, devise interventions, and develop technology-driven solutions.This dataset bridges the gap between attention and learning, facilitating a deeper comprehension of engagement and interactions of the students in the online classroom environment.
As stated above, the developed dataset [1] is designed to facilitate the analysis of attention and engagement behaviors of students through various features extracted from multi-modal data sources.The dataset consists of 17 columns, with 16 columns dedicated to features derived from different detection and estimation modules, and one column serving as the label indicating attention status.Each row represents a session, capturing interactions and behaviors of students during educational activities.Below is a concise description of the feature and label columns of the dataset with data type in Table 1 .Table 2 offers a succinct summary of attention labels in the dataset, presenting a binary classification of attention states that provides valuable insights into participation and engagement of students during the learning process.
A comprehensive tabular representation is provided in Table 3 that summarizes essential statistics of the dataset, including measures like mean, standard deviation, minimum, median, and maximum, for some numerical columns.This offers an insight into the distribution of various important features.
The pie chart of Fig. 1 provides an overview of the balance of dataset and allows us to understand the proportion of attentive and inattentive instances in the dataset.
The dataset includes instances with varying counts of recognized faces and hands during online class sessions.Specifically, there are 28 instances with two detected faces, 3901 instances with one detected face, and 71 instances with no detected faces.In terms of detected hands, there are two hands detected in 975 instances, one hand detected in 763 instances, and no hands detected in 2262 instances.
A box plot in Fig. 2 visualizes the distribution of confidence scores for the face detection and phone detection modules, providing insights into the quality of detections; it's noteworthy that many outlying scores for face detection are plotted individually.
Fig. 3 illustrates the correlation between detected face coordinates (face_x, face_y) and attention labels, uncovering potential patterns associated with face position and attention levels in the dataset.
Within the dataset, the 'pose' feature categorizes head orientation as 'forward,' 'down,' 'left,' and 'right' based on specific criteria related to 'pose_x' and 'pose_y' features in degrees, offering valuable insights into the association of students' head positioning during online classes with their attention levels.Specifically, the dataset exhibits 2951 occurrences of 'forward,' 537 of 'down,' 334 of 'left,' and 178 of 'right' categories.
The dataset is conveniently stored within the data repository as a consolidated file named 'attention_detection_dataset_v1.csv'.As previously highlighted, this file encompasses a collection of 16 distinct features, encompassing both integer and float data types, alongside categorical values.Notably, the dataset incorporates a dedicated label column that plays a pivotal role in the classification process.

Data acquisition
A comprehensive overview of the dataset generation process is provided in Fig. 4 , showcasing the integration of four fundamental components such as face detection, hand tracking, head pose estimation, and mobile phone detection module.Our approach involved the development of a user-friendly web interface built on the Django framework [2] , ensuring a seamless and efficient data collection process.This interface facilitated the collection of video frames from students actively participating in online classes of a course of computer science.Maintaining uniformity, these video frames were extracted at a consistent and reliable rate of 20 frames per second (FPS) and meticulously transformed from the BGR to RGB color model.As part of our commitment to ethical data collection, every step of the process was conducted with strict adherence to pertinent legal regulations and institutional protocols.Prior to data collection, comprehensive informed consent was diligently sought from all participating students, ensuring their  rights and privacy were upheld throughout the study.The subsequent sub-sections offer succinct insights into the role of each component and their collaborative contribution to capturing attention-related attributes of the students.

Face detection module
We designed a web interface to capture video frames of students engaged in online classes via their webcams.Subsequently, each captured frame undergoes meticulous analysis using our face detection module, which leverages the capabilities of the MediaPipe Face Detection solution [3] , which is based on BlazeFace, to precisely localize faces and their corresponding facial attributes within the frame.The outcome of the face detector manifests as bounding boxes encompassing the identified facial regions within the image frame.
The face detection module computes the 'no_of_face' feature value within the frame by tallying the number of bounding boxes generated.Additionally, crucial facial characteristics of the detected face such as 'face_x', 'face_y', 'face_w', and 'face_h' feature values are extracted from attributes including 'xmin', 'ymin', 'width', and 'height' associated with the respective bounding box.To further enrich our feature set, the 'face_con' feature value can be derived from the 'score' attribute returned by the face detector, indicating the confidence level of the presence of the detected face.These pivotal feature values collectively contribute to a comprehensive understanding of the detected faces within the frame, as visually depicted in Fig. 5 .

Hand tracking module
Following the processing conducted by the face detection module, the subsequent step involves subjecting each gathered frame to analysis via our hand-tracking module.This module makes use of the MediaPipe Hands solution [4] , a sophisticated tool enabling the precise local-ization of key points pertaining to the hands.Furthermore, it facilitates the application of visual enhancements directly onto the hands within the frame, enhancing their visibility.
The outcome of the hand landmark model encompasses three essential components: firstly, the derivation of landmark coordinates of the detected hands in image space; secondly, the computation of landmark coordinates of the detected hands in a global reference frame (world coordinates); and lastly, the determination of hand dominance (handedness) for each detected hand.However, in our specific application, we exclusively focus on the landmark coordinates of the detected hands in image coordinates.
By employing these landmarks, we are able to accurately ascertain the number of hands that have been detected within the frame.This fundamental quantification of hands detected is then utilized to populate the 'no_of_hands' feature value with the corresponding count.A visual representation of this process can be found in Fig. 6 , providing a clear illustration of the functioning of the hand tracking module and its contribution to the overall feature collection process.The figure displays the 3D coordinates of WRIST and PINKY_TIP landmarks, two of the 21 3D landmarks detected on the hand.

Head pose estimation module
Subsequent to this, each frame undergoes analysis within our head pose estimation module.This module leverages the MediaPipe Face Mesh [3] to discern all facial landmarks and employs the solvePnP() function from OpenCV [5] to deduce the pose of head.From the facial landmarks, we extract both 2-dimensional and 3-dimensional coordinates for key points such as the nose tip, chin, eye corners, and mouth corners.Utilizing properties of the webcam-obtained image, like focal length, center, and distortion coefficients, we compute the camera matrix.The rotation vector is then computed by applying the solvePnP() function, which takes in the 2D and 3D coordinates along with the camera matrix.This vector is transformed into a rotation matrix using the Rodrigues() function of OpenCV.Subsequently, the Euler angles of rotation are derived from this matrix through the RQDecomp3 × 3() function in OpenCV.These Euler angles encapsulate the pose of the head.
Thus, the head pose estimation module contributes to the dataset with the 'pose', 'pose_x', and 'pose_y' feature values, with 'pose_x' and 'pose_y' represented in degrees.Notably, the 'pose' feature takes on categorical values such as forward, down, left, and right, serving to classify the orientation of head according to specific criteria as stated below: • If 'pose_y' is greater than 15 °, the pose is labeled as 'right.' • If 'pose_y' is less than −10 °, the pose is labeled as 'left.' • If 'pose_x' is less than −10 °, the pose is labeled as 'down.' • Otherwise, the pose is labeled as 'forward.'These stages are visually expounded in Fig. 7 , encapsulating the essence of the head pose estimation process.

Mobile phone detection module
Following its processing by the head pose estimation module, each frame undergoes a conclusive analysis within our mobile phone detection module.This module capitalizes on the YOLOv7 [6] , a real-time object identification algorithm of YOLO (You Only Look Once) family, to discern the presence of a mobile phone or cell phone within the frame.The YOLOv7, having been pretrained on the extensive MS COCO dataset [7] comprising over 20 0,0 0 0 labeled images and encompassing 80 distinct classes, including the "cell phone" class, is our chosen algorithm.
In this process, the frame serves as input to the YOLOv7 model, which in turn delineates a bounding box around the identified mobile phone, subsequently furnishing high-level attributes such as x-coordinate, y-coordinate, width, height of the phone, and the confidence score of detection of the phone.To ensure precision, we apply a confidence score threshold of greater than 0.6, warranting the selection of an object as a mobile phone.Upon meeting this threshold, the 'phone' feature value is attributed a value of 1, thereby signifying the presence of a mobile phone.Consequently, the 'phone_x', 'phone_y', 'phone_w', 'phone_h', and 'phone_con' values within the dataset align with the corresponding attributes returned from the YOLOv7 model.The operational snapshot of this module within the visual context of the students is vividly portrayed in Fig. 8 , encapsulating the essence of mobile phone detection within the overall framework.

Class labeling
Finally, we consider the following criteria as stated in Table 4 for assigning '0' (Attentive) and '1' (Inattentive) labels to the attention detection dataset based on the information provided by the face detection, hand tracking, head pose estimation, and phone detection modules.
We also use a combination of different factors such as head pose, phone detection, hand count, and more to assign labels.For example, if both phone detection and head pose indicate inattention, we label the instance as Inattentive (1).

Data preprocessing
The dataset is designed for the evaluation and training of various machine learning algorithms, including Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Ad-aBoost, Gradient Boosting (GB), and Extreme Gradient Boosting (XGBoost) [8] .These algorithms aim to accomplish the task of attention detection, ensuring active engagement, participation, and overall improvement of students during online classes.When no phone is detected or the phone is placed face down on the table.0 When a phone is detected and positioned in a way that suggests distraction (e.g., screen facing up).
Categorical variables have been transformed into numerical equivalents using techniques like one-hot encoding or label encoding, making them compatible with machine learning algorithms.Some features have undergone maximum absolute scaling with MaxAbsScaler() function of Scikit-learn [9] , individually adjusting each feature to a maximum absolute value of 1.0 within the dataset.Additionally, the 'face_con' values have been normalized to the [0.0, 1.0] range to facilitate processing.The dataset has been partitioned into training and testing subsets, maintaining an 80:20 split ratio, for effective model assessment.

Limitations
The effectiveness and applicability of the student's attention detection dataset could be constrained by certain limitations.Factors like lighting conditions, room arrangement, and device quality during data collection might introduce inconsistencies or biases in detections.While the size of the dataset is expected to increase in future versions, its current scope may be restricted, potentially affecting its capacity to encompass various scenarios and perform well on unseen instances.Furthermore, errors or inaccuracies in annotation or labeling might also compromise the reliability of the dataset.The slight imbalance between the instances of attentive and inattentive labels could lead to bias towards the more prevalent class.These constraints warrant careful consideration when utilizing the dataset for attention detection purposes.

Fig. 2 .
Fig. 2. Confidence scores distribution for face and phone detection.

Fig. 8 .
Fig. 8. Outcome of mobile phone detection module: (a) Input frame, (b) Detected mobile phone image with confidence score, and (c) Output features [phone, phone_x, phone_y, phone_w, phone_h, phone_con].Table 4Criteria for labeling the attention detection dataset.Module Criteria LabelFace detection When a face is confidently detected and centered within the frame.0 When no face is detected or the face is partially out of frame.1 Hand tracking When one or both hands are confidently detected and are within a reasonable distance from the face.0 When no hands are detected or the hands are too far from the face. 1 Head pose estimation When the head pose is estimated as looking forward or slightly tilted and aligned with the screen.
2023 The Author(s).Published by Elsevier Inc.
The data collection process involved integrating face detection, hand tracking, head pose estimation, and mobile phone detection modules into an online class environment.Video frames, obtained with informed consent from students, were captured from webcams at 20 FPS through a custom Django-based web interface between 1st and 30th June 2023.Subsequently, these frames were processed by the specified modules to extract raw data.Face detection was implemented using the MediaPipe Face Detection solution based on BlazeFace.Hand tracking utilized the MediaPipe Hands solution, while head pose estimation relied on the MediaPipe Face Mesh solution and the solvePNP() of OpenCV.Mobile phone detection was accomplished through the YOLOv7 model with PyTorch.Data normalization techniques such as one-hot encoding and maximum absolute scaling were applied to specific variables during preprocessing.Data identification number (DOI): 10.17632/smzggbnkd2.1 Direct URL to data: https://data.mendeley.com/datasets/smzggbnkd2/11.Value of the Data• The significance of this dataset lies in its exclusive focus on a critical facet of education namely the attention levels of students during online classes.By capturing real-time attention dynamics, it provides a unique opportunity to explore a domain into novel characteristics and cutting-edge algorithms associated with attention detection.

Table 1
Summary of features and label with data type.

Table 2
Brief details of attention labels.
AttentiveThe 'Attentive' class refers to instances where student's exhibit focused and attentive behavior during online classes.These instances are characterized by the factors such as consistent face presence, appropriate head pose alignment, and minimal phone usage.This class signifies active participation and engagement with the learning material, contributing to a positive and productive learning environment.[1,230.34, 296.95, 156.42, 156.42, 94.94, 0, 'forward', −0.37, 7.

Table 2 (
continued ) InattentiveThe 'Inattentive' class refers to instances where students display a lack of engagement during online classes.These instances typically involve factors such as turned-away faces, frequent head movement indicating distractions, reduced or absent hand tracking, and potential phone interactions.These indicators collectively suggest diminished attention and participation, highlighting potential distractions or disinterest in the learning content.

Table 3
Summary statistics of numerical features.