Design and Implementation of a Gesture-Aided E-Learning Platform

In gesture-aided learning (GAL), learners perform specific body gestures while rehearsing the associated learning content. Although this form of embodiment has been shown to benefit learning outcomes, it has not yet been incorporated into e-learning. This work presents a generic system design for an online GAL platform. It is comprised of five modules for planning, administering, and monitoring remote GAL lessons. To validate the proposed design, a reference implementation for word learning was demonstrated in a field test. 19 participants independently took a predefined online GAL lesson and rated their experience on the System Usability Scale and a supplemental questionnaire. To monitor the correct gesture execution, the reference implementation recorded the participants’ webcam feeds and uploaded them to the instructor for review. The results from the field test show that the reference implementation is capable of delivering an e-learning experience with GAL elements. Designers of e-learning platforms may use the proposed design to include GAL in their applications. Beyond its original purpose in education, the platform is also useful to collect and annotate gesture data.


Introduction
The COVID-19 pandemic demanded a variety of adaptations in our daily lives, including in matters of education. With physical locations for learning and teaching closed, remote electronic educational technology (e-learning) substituted presence teaching in large parts of the world [1,2]. It is more important than ever to improve and innovate e-learning to provide better and more fruitful learning experiences.
E-learning and respective platforms have already been broadly discussed in academia. In their literature review, Arkorful & Abaidoo [1] summarize e-learning's main advantages. Their results-which rely in large part on the work of Holmes & Gardner [2]-underline that e-learning provides (i) more flexibility in time and space for learners, (ii) increases the accessibility of knowledge, and (iii) reduces communication barriers by facilitating discussion forums. In particular, learners' individual learning speeds are better accommodated as they can progress at their own pace [1]. For teachers, e-learning helps to overcome scarcities of teaching equipment [1]. Maatuk et al. [3] describe the challenges that come with implementing e-learning at a university level. They mention the technical and financial requirements for both providers and learners. Moreover, they find that the technological savviness of students influences the learning outcome. E-learning platforms also need to consider copyright protection and require professional development [3]. Overall they find that students are positively disposed towards e-learning and that they think that it improves their learning experience [3]. The basic building blocks of any e-learning experience are learning objects, i.e., the digital files that generate e-learning activities [4,5]. Learning objects come in a variety of digital formats, including e-books, 2D and 3D animations, cases. The sharing of learning objects is a common occurrence in e-learning communities [4].

Data Model
While the learning content itself remains in its original form (e.g., text, sound, video), it is communicated with the addition of a gesture. This gesture does not replace, but rather enhances the learning content. Thus, we call this combination of learning content and gesture gesture-enhanced content (GEC). Consequently, a lesson is an ordered list of GECs. The instructor is responsible for defining GECs and lessons, while the learner executes the lesson by performing all GECs within it and creating GEC executions by doing so. The resulting data model is depicted in Figure 1.

Modules
The e-GAL design is composed of 5 modules. Figures 2 and 3 illustrate how the modules are connected and how learners and instructors respectively are supposed to interact with the system.

Modules
The e-GAL design is composed of 5 modules. Figures 2 and 3 illustrate how the modules are connected and how learners and instructors respectively are supposed to interact with the system.

Content Catalog
The content catalog is a database that holds learning content items in one or multi formats (e.g., text, audio, video). If the content items are large in size, the catalog shou carry file references rather than the actual learning content data, or a database mana ment system that supports large fields should be used to avoid performance issues.

Gesture Catalog
The second database is the gesture catalog. It holds pre-recorded reference gestu in one or more file formats, which may vary depending on what was used to record gestures. However, the gesture data must be sufficient to animate a humanoid avatar ( lesson player module). Ideally, the gestures are recorded with a high-quality motion-c ture system to produce the best possible reference.

Lesson Configurator
The lesson configurator is a web-based service with a graphical user interface (G for instructors that allows them to combine learning content items and gestures into in vidual GECs. Multiple GECs can be organized into lessons, and additional lesson para eters (e.g., lesson speed) can be set.

Content Catalog
The content catalog is a database that holds learning content items in one or multiple formats (e.g., text, audio, video). If the content items are large in size, the catalog should carry file references rather than the actual learning content data, or a database management system that supports large fields should be used to avoid performance issues.

Gesture Catalog
The second database is the gesture catalog. It holds pre-recorded reference gestures in one or more file formats, which may vary depending on what was used to record the gestures. However, the gesture data must be sufficient to animate a humanoid avatar (see lesson player module). Ideally, the gestures are recorded with a high-quality motioncapture system to produce the best possible reference.

Lesson Configurator
The lesson configurator is a web-based service with a graphical user interface (GUI) for instructors that allows them to combine learning content items and gestures into individual GECs. Multiple GECs can be organized into lessons, and additional lesson parameters (e.g., lesson speed) can be set.

Lesson Player
Learners interact with the platform via the web-based lesson player. It replays GECs by depicting a humanoid avatar alongside content items. The avatar is animated using the gesture reference data from the catalog. Alongside the gesturing avatar, one or more output ports (e.g., text display, speaker output) replay the content items.

Monitoring Module
As mentioned in the introduction, research indicates that gestures need to be performed correctly for GAL to provide benefits [9]. The monitoring module records motion data using some type of sensor (e.g., accelerometer, video) and transfers them to the instructor for review. The choice of motion sensor depends on the gesture's range of motion. For instance, if gestures are only performed with hands, a wrist-mounted inertial measurement unit might suffice to retrace the performed gesture [26]. Full-body gestures on the other hand may require a more complex measurement setup. The recorded motion data, along with metadata about the learner and the performed gesture, get uploaded into the log. The log holds all data about past GEC executions and provides an interface for the instructor to look at the motion data and assess the correctness of the gestures.

Reference Implementation
We demonstrated and evaluated a reference implementation of the proposed e-GAL system design (see Section 2). The learning task of this reference implementation was to learn a series of German language words. The design's modules were deployed in a microservice pattern [27] and implemented as follows.
The content catalog consisted of 64 German language words which is a subset of the words used in Mathias et al. [28]. In addition to the textual representation, synthesized speech by Google's WaveNet-based text-to-speech engine [29] was added. Both text and speech were stored in a PostgreSQL 13 database [30].
For each word, a representative gesture (cf. [28]) was recorded using the full-body motion-capture system XSENS MTw Awinda [31]. After recording, each gesture was exported into an FBX file to be suitable for animating the avatar. The FBX file reference for each gesture was stored in the gesture catalog database.
In the implementation of the lesson configurator ( Figure 4), instructors could combine a word with a gesture by drag-and-drop in their browser. An important feature was the ability to preview gestures on the fly since labels were rarely sufficient for describing what a gesture looked like. Available lesson parameters included the lesson speed, i.e., the time between two GECs, and a randomization seed with which the order of GECs was shuffled. Furthermore, instructors could generate individualized hyperlinks with which students could start the lesson.    The lesson player ( Figure 5) of the reference implementation was a Unity 3D [32] application running in a WebGL environment. It featured a robot-like avatar ("Y-Bot" [33]) on a neutral background. When the learner started the lesson, the Unity application was loaded alongside the necessary lesson data, namely the learning content items and the gestures' FBX files. After loading, the lesson player played each GEC one after the other by simultaneously displaying the word and playing the sound clip (see Video S1). Then, after a small delay, the avatar performed the gesture. This was repeated for each GEC until the lesson was completed.
It was assumed that the learner sits behind their desk and in front of their screen during learning. Their computer's webcam, therefore, was most likely to capture at least the upper body. During each GEC, the participant was recorded and after each GEC, the recorded video clip was annotated with the GEC execution ID and queued up for upload to the monitoring modules log. The instructor could access and rate the videos in the monitoring module's web interface ( Figure 6).

Evaluation of Reference Implementation
A system test was conducted to assess the e-GAL reference implementation's capability to facilitate remote GAL. We want to note that we do not claim to measure actual learning progress, as this would require more sophisticated methods from other fields closer to neurology. Rather, this study aims to answer the research question of whether elearning can deliver GAL, and in the course validating the proposed e-GAL design.
Participant recruitment: 20 people were recruited by email for the system test. Each participant received an individualized link that allowed them to take the prepared lesson at any time and place during the 2-week trial period in July 2021. One person could not finish the experiment due to technical difficulties with the web application. Ultimately, we used data from 12 female and 7 male participants with a mean age of 36.6 (σ = 9) ranging from 23 to 53 years. The majority of the participants worked in technical affine companies, therefore, a basic knowledge of using web applications was assumed. Each participant gave their informed consent to be recorded before starting the experiment.
Experiment design: An instance of the reference implementation (see Section 3.1) was made accessible online. The authors, acting as instructors, created a lesson containing six GEC items with the lesson configurator. Video S1 in the Supplementary Material con-

Evaluation of Reference Implementation
A system test was conducted to assess the e-GAL reference implementation's capability to facilitate remote GAL. We want to note that we do not claim to measure actual learning progress, as this would require more sophisticated methods from other fields closer to neurology. Rather, this study aims to answer the research question of whether e-learning can deliver GAL, and in the course validating the proposed e-GAL design.
Participant recruitment: 20 people were recruited by email for the system test. Each participant received an individualized link that allowed them to take the prepared lesson at any time and place during the 2-week trial period in July 2021. One person could not finish the experiment due to technical difficulties with the web application. Ultimately, we used data from 12 female and 7 male participants with a mean age of 36.6 (σ = 9) ranging from 23 to 53 years. The majority of the participants worked in technical affine companies, therefore, a basic knowledge of using web applications was assumed. Each participant gave their informed consent to be recorded before starting the experiment. Experiment design: An instance of the reference implementation (see Section 3.1) was made accessible online. The authors, acting as instructors, created a lesson containing six GEC items with the lesson configurator. Video S1 in the Supplementary Material contains a screen recording of the full lesson. The gestures were chosen based on whether they could be performed while sitting behind a desk. The GECs order was randomized, with the same randomization seed for each participant.
Participants could access the lesson via their personal invitation link. After displaying the informed consent form and instructions, participants had the chance to preview their webcam feed to make sure they were comfortable with what was being recorded. After accepting, the lesson player started in their web browser. The lesson started shortly after, and one GEC after another was played. Between GECs was a break of three seconds, during which participants were supposed to imitate the avatar while reading the displayed word out loud. The webcam started recording when a new GEC was played and ended 2.5 s later. The videos were stored locally and queued up immediately for background upload to the monitoring module's log.
After all GECs of the lesson had been repeated 4 times, the player stopped and redirected the participant's browser to a German translation of the SUS, originally introduced by Brooke [34] and translated by Reinhardt [35]. The SUS questionnaire is comprised of the ten items in Appendix A Table A1, which are answered with a five-point Likert scale [36] ranging from "strongly disagree" to "strongly agree". To calculate the SUS, each response option gets scored from one to five points, starting with one point for "strongly disagree" to five points for "strongly agree". The next step is to adjust the points of the questions. For all odd questions, we subtract one point and for all even questions, we subtract the value five from their score. Next, we add up the points for each of the ten questions and multiply this sum by 2.5. Finally, we get a usability score for each respondent, ranging from 0 (worst) to 100 (best).
Afterward, the participants filled in a questionnaire that asked them about their remote lesson experience (RLE). The questionnaire contained five items which are presented in Appendix B Table A2 and could be answered with a five-point Likert scale [36] ranging from "strongly disagree" to "strongly agree". These questions were supposed to identify any problems in the presentation of the learning content or gestures. For evaluation, we subtract a value of one from each question. Next, we take the mean over all participants per question and get a score ranging from 0 (strongly disagree) to 4 (strongly agree).
Additionally, an open question allowed participants to freely comment on their thoughts regarding the platform.
Interpretation of the results: To assess the learners' acceptance of the system, we follow Bangor et al. [37,38] and use three different rating scales for interpreting the SUS results.
Adjective rating: According to Bangor et al. [37,38], the SUS score can be converted into an adjective rating to interpret its results. They show that the results of a seven-point Likert scale correlate with SUS scores and can therefore be useful for interpretation. The findings of Bangor et al. [37] show that the SUS score has a mean of 12.5 when using the adjective "Worst Imaginable" to describe a system, 20.3 when using "Awful", 35.7 when using "Poor", 50.9 when using "Ok", 71.4 when using "good", 85.5 when using "Excellent" and 90.9 when using "Best Imaginable". Except for "Worst Imaginable" and "Awful", all of these adjectives are significantly different and are therefore of interest for the interpretation of the results. e.g., if the SUS score is 75, we would classify our platform as "Good".
Grade scale: Bangor et al. [37] introduce the so-called university grade analog, in which the SUS scores are related to school/university grades. According to this grading scale a SUS score between 90 and 100 is an A, 80 and below 90 is a B, 70 and below 80 is a C, 60 and below 70 is a D, and a score below 60 is an F.
Acceptability rating: Moreover, to decide whether the platform is usable or suitable to provide GAL, we follow Bangor et al. [37,38] and use the acceptance ranges they provide. The authors rate a system with a SUS score below 50 as "Not Acceptable" and above 70 as "Acceptable". Between a score of 50 and 70, Bangor et al. [37,38] state that the system should be improved and evaluated as "Marginal". This group can be further divided into "Low Marginal" (SUS score between 50 and 62.6) and "High Marginal" (SUS score between 62.6 and 70).
In sum, the adjective rating, grade scale, and the acceptability rating are suitable to answer the question of whether learners accept the e-GAL reference implementation.
Regarding the RLE responses, we consider an average of 3.0 to be sufficient. At this level, there is general agreement that the respective lesson element was comprehensible. An exception is question 5 ("I felt insecure during the lesson.") which is reverse coded to check the consistency of the participant's answers. The optional free-text comments are mapped to concepts by means of a small-scale inductive content analysis [39].
The videos of the GEC executions are visually compared against the reference gesture by the authors. Based on the difference, the GEC executions are labeled "Correct" (no discernable difference), "Poor" (recognizable as the reference gesture, but with errors, e.g., not moving the head along with the waving hand), and "Wrong" (not recognizable as the reference gesture). Videos that failed to show the gesture clearly (e.g., because the participant was out of frame) were also labeled "Wrong". Figure 7 shows boxplots for the SUS scores across all participants and female (12) and male (7) participants respectively. The median and mean SUS score was 75, with no differences between genders. Consequently, the reference implementation received a C on the grade scale, and a "Good" according to the adjective rating scale. On the acceptability rating scale, the reference implementation was rated "Acceptable". Interestingly, the individual SUS scores varied considerably, with values between 42.5 and 97.5. Therefore, we show the results of the SUS score on the individual level to better understand the results. Figure 8 represents each participant's SUS score located on each of the three scales: (a) shows that four out of the 19 participants rated the platform with the worst grade F (21%), one with a D (5.3%), six with a C (31.6%), three with a B (15.8%) and five with the best grade A (26.3%). When applying the adjective rating scale (b), we find that one participant rated the platform as "Poor" (5.3%), seven as "Ok" (36.8%), six as "Good" (31.6%), two as "Excellent" (10.5%), and three as "Best Imaginable" (15.8%). Finally, (c) illustrates the acceptability rating and shows that for one participant the reference implementation was "Not Acceptable" (5.3%), for three it was "Low Marginal" (15.8%), for one it was "High Marginal" (5.3%) and for fourteen it was "Acceptable" (73.7%).  Figure 7 shows boxplots for the SUS scores across all participants and female (12) and male (7) participants respectively. The median and mean SUS score was 75, with no differences between genders. Consequently, the reference implementation received a C on the grade scale, and a "Good" according to the adjective rating scale. On the acceptability rating scale, the reference implementation was rated "Acceptable". Interestingly, the individual SUS scores varied considerably, with values between 42.5 and 97.5. Therefore, we show the results of the SUS score on the individual level to better understand the results. Figure 8 represents each participant's SUS score located on each of the three scales: (a) shows that four out of the 19 participants rated the platform with the worst grade F (21%), one with a D (5.3%), six with a C (31.6%), three with a B (15.8%) and five with the best grade A (26.3%). When applying the adjective rating scale (b), we find that one participant rated the platform as "Poor" (5.3%), seven as "Ok" (36.8%), six as "Good" (31.6%), two as "Excellent" (10.5%), and three as "Best Imaginable" (15.8%). Finally, (c) illustrates the acceptability rating and shows that for one participant the reference implementation was "Not Acceptable" (5.3%), for three it was "Low Marginal" (15.8%), for one it was "High Marginal" (5.3%) and for fourteen it was "Acceptable" (73.7%).      Figure 9 illustrates the results regarding the items to evaluate the RLE (Table A2). When the participants were asked if the word to learn was clearly readable and audible (Questions 3 and 4), they tended to strongly agree, with a score of 3.8 for both questions. When asked whether they were able to focus on the lesson's content (Question 1) or whether they were able to imitate the avatar's gestures (Question 2), we have a somewhat lower score of 3.5 and 3.1 respectively. Furthermore, with a mean value of 1.6, the participants answered that they generally did not feel insecure during the lesson (Question 5).  Figure 9 illustrates the results regarding the items to evaluate the RLE (Table A2). When the participants were asked if the word to learn was clearly readable and audible (Questions 3 and 4), they tended to strongly agree, with a score of 3.8 for both questions. When asked whether they were able to focus on the lesson's content (Question 1) or whether they were able to imitate the avatar's gestures (Question 2), we have a somewhat lower score of 3.5 and 3.1 respectively. Furthermore, with a mean value of 1.6, the participants answered that they generally did not feel insecure during the lesson (Question 5).

Webcam Videos
After the trial period ended, the log contained 491 GEC executions. These were more than the anticipated 456 videos. During labeling, it became apparent that some participants stopped and restarted mid-lesson. Based on these videos, 340 (69.2%) GEC executions were rated "Correct", 95 (19.3%) were rated "Poor", and 56 (11.4%) were rated "Bad". The majority (54.3%) of "Poor" and "Bad" GEC executions occurred during two gestures: "Aufmerksamkeit" (eng.: "attention"; putting a hand behind an ear and leaning back) and "Papier" (eng.: "paper"; crumbling a piece of paper and throwing it away).

Webcam Videos
After the trial period ended, the log contained 491 GEC executions. These were more than the anticipated 456 videos. During labeling, it became apparent that some participants stopped and restarted mid-lesson. Based on these videos, 340 (69.2%) GEC executions were rated "Correct", 95 (19.3%) were rated "Poor", and 56 (11.4%) were rated "Bad".

Free-Text Comments
Nine out of the nineteen participants opted to give a free-text comment about their thoughts on the lesson experience. Table 1 shows how often a concept was mentioned in the comments.

Discussion
This study set out to answer the research question: Can e-learning platforms facilitate gesture-aided learning remotely? In the case of the e-GAL reference implementation, we assume this to be confirmed if (a) instructors can effectively plan and monitor gestureenhanced lessons, (b) learners are able to comprehend and imitate gestures and learning objects during lessons, and (c) learners accept the system.

Ad (a):
The lesson configurator offered instructors access to 64 learning content items and 64 distinct gestures. With these materials, a lesson containing 6 GECs was successfully created. The gestures could be previewed and selected according to the assumed learning environment (i.e., the learner sitting behind a desk). Regarding lesson monitoring, the monitoring module's log successfully collected videos of all 491 GEC executions. Instructors were able to label all videos using the monitoring module's web interface.
Ad (b): Learners were able to access the lesson with the invitation link that was sent out by the instructor. In the responses to the RLE questionnaire, there was general agreement (3.8 out of 4) that the learning content was comprehensible in both text and speech. Interestingly, two participants noted that they did not use the text but rather listened exclusively to the audio. Slightly less agreed upon (3.1 out of 4) was on the comprehensibility of gestures. A possible reason for why the gesture comprehensibility (Question 2) was rated worse may be connected to the 2 least well-performed GECs. "Aufmerksamkeit" required the participant to lean backward, which was not easily discernible on the solid-grey background of the lesson player. A better-designed 3D environment may communicate changes in depth better. As for the second badly performed GEC, "Papier" involved both palms touching each other. The avatar's extremities lacked collision boxes, therefore its hands clipped into rather than touching each other. This was interpreted differently by participants, some touching their forearms or bumping their fists. Furthermore, the XSENS skeletal model has only a rough positioning of the hands. By adding collision boxes to the avatar and including better hand sensors, the communication of gestures that feature more intricate hand movements could be improved. Broader gestures, like waving a hand, were more accurately imitated. Furthermore, the system could have better indicated the right time to imitate the gesture, especially as most of the GEC executions rated "Wrong" seemed to stem from the participant not being aware that they should imitate at that moment. In the end, 69.2% of gesture executions were labeled "Correct".
Ad (c): The e-GAL reference implementation was rated "Acceptable" and "Good", and received the letter grade C on the System Usability Scale. The evaluation of the reference implementation is limited insofar as it only considers the perspective of the learner and lacks feedback from instructors. While they were functional enough to define and monitor the experiment lesson, the lesson configurator and monitoring modules were not demonstrated in the same way the lesson player was.
To summarize, we consider all of the three requirements stated and discussed above to be fulfilled, thus we conclude the system test as successful.
During labeling, a second potential use case for the reference implementation emerged. It can collect and label large amounts of gesture data remotely and with little effort. The main issue in the video clips from the system test was that the framing of the participant was inconsistent, and their webcam quality varied. This could be solved however by better instructing the participants and by consistently checking the framing before and during the lesson.

Limitations
This study is concerned with the technical viability of e-GAL, thus it does not say anything about the influence of this mode of learning on learning outcomes. Claims of this kind would require a different study design and neurological evidence. Moreover, the lesson used in the study lacked pedagogical considerations (see [40]) which made it unfit to produce and measure actual learning outcomes. Finally, the evaluation of the lesson configurator lacks the perspective of educators who are not in higher education.

Future Work
More research on the pedagogy of e-GAL applications is needed. This includes determining the overall effectiveness of e-GAL, which parameters (e.g., repetition and order of GECs, lesson tempo) need to be adjustable, and how the avatar's and the 3D environment's designs affect learning outcomes. It should be investigated which types of learning content work best with e-GAL.
Future platforms could incorporate machine learning models for pose estimation (e.g., [41,42]) and/or quality assessment [43] of the performed gesture. Such automated methods could be used for example to support or replace the instructor's subjective rating or to provide real-time feedback to the student. Furthermore, instead of the webcam as motion sensor, future systems could use wearable motion sensors to allow students more mobility. Feature requests such as avatar customization, the option to see oneself during the lesson, and immediate feedback during the lesson were mentioned by some participants. These features are realizable for the reference implementation.
Lastly, e-learning platforms usually involve a variety of stakeholders such as content creators, educational institutions, and designers [44]. The e-GAL design could be extended or embedded into existing e-learning platforms to accommodate these stakeholders (e.g., interfaces for content creators to add new gestures from other motion-capture systems). Interfaces to existing learning object repositories could produce interesting new GECs.

Conclusions
We proposed a system design for e-GAL platforms with three design goals. A reference implementation following the design was demonstrated and evaluated in a field test. After interpreting the results of the SUS & RLE, the user comments, and the number of video clips labeled "Correct", we determined that the e-GAL reference implementation met all of the three design goals, consequently demonstrating the ability of the proposed system design to facilitate an acceptable e-GAL experience. Additionally, the reference implementation showed itself to be useful for collecting and annotating video clips of gesture executions, which can be used for instance to generate large gesture datasets for machine learning. The e-GAL design can be used to implement e-GAL applications or as the basis for further research into the topic of gesture-aided e-learning, especially its pedagogical implications. Die gesprochenen Wörter waren klar zu verstehen. The spoken words were clearly audible. 4 Ich konnte die angezeigten Wörter problemlos lesen. I was able to read the displayed words clearly. 5 Ich fühlte mich unsicher während der Lektion. I felt insecure during the lesson.