Human Skeleton Detection, Modeling and Gesture Imitation Learning for a Social Purpose

Gesture recognition is topical in computer science and aims at interpreting human gestures via mathematical algorithms. Among the numerous applications are physical rehabilitation and imitation games. In this work, we suggest performing human gesture recognition within the context of a serious imitation game, which would aim at improving social interactions with teenagers with autism spectrum disorders. We use an artificial intelligence algorithm to detect the skeleton of the participant, then model the human pose space and describe an imitation learning method using a Gaussian Mixture Model in the Riemannian manifold.


I. INTRODUCTION
Autism spectrum disorders (ASD) are linked with brain development [1].Main symptoms of ASD are difficulties with communication and social interactions, repetitive behaviors and obsessive interests.
Autistic people also have talent [2].For instance, an exceptionally good memory, a great attention to detail, an excellent ability to respect schedules, an exceptional level of honesty.Some of them are savants [3].
According to the World Health Organization [4], the global autism prevalence is around 1 in 160 children and autism is generally more common in boys than girls.Some autism prevalence studies were made per geographical area [5].
Still today, the exact cause of ASD is not known [6].
Several scientific studies have targeted improving the way autistic children communicate or interact with others [7].This is because these two functions are crucial.Other symptoms of autism can be seen mostly as consequences of impairments in social interactions or communication.
Additionally, the imitation process is known to be a pillar in learning, communication and social interactions.Imitation games can therefore prove useful in helping autistic people interact with others.
In the present work, the structure of a gesture imitation game is proposed, which shall improve social interactions with autistic teenagers and preteens.Furthermore, skeleton detection and imitation learning methods are described.
The following section of this paper presents existing work related to imitation learning as well as improvement of social interactions with autistic children.
Section III then describes the methodology: the main phases of the imitation game as well as the skeleton detection and human motion learning methods.
Simulation results for skeleton detection are presented in section IV.
Section V finally concludes this paper and suggests future work.

II.1. On imitation learning and gesture recognition
In computer science, imitation learning, also called programming by demonstration, is a technique for teaching a computer or a robot to perform new tasks, through generalization from observing multiple demonstrations [8].
Within the framework of gesture recognition, a gesture would be performed several times by a human being and then, a method used for the system to be able to later recognize the task.
Different spatial gesture models exist (figure 1).Some are 3D-model based and others are appearance-based.Image sequences and deformable 2D templates are part of the latter group.
The former comprises of skeletal and volumetric models.
Subtypes of the volumetric model category are NURBS, primitives and super-quadrics.Since human movements are nonlinear, the Euclidian space is not really suitable to represent those.Human postures and motion are therefore often represented in alternative spaces such as the Riemannian one [10], which has proved useful as shown in [11].
Once the human body is modeled, gestures must actually be learned by the system by observation of several demonstrations.Probabilistic methods serve this purpose and can for instance be based on Hidden Markov Models or Gaussian mixture models.The skeletal model, the representation of the human body in the Riemannian space as well as GMM have been used within the framework of physical rehabilitation exercises [12] [13].
In the context of our imitation game for teenagers with ASD, skeleton detection and gesture recognition methods are redirected to serve social purposes, as the aim is to improve the participants' ability to interact with others.

II.2. On imitation and autism
In [7] the author indicates that autistic children are able to imitate, whereas the general opinion previously differed.
The imitation process is complex and consists of subcomponents: induced imitation, spontaneous imitation, recognition of being imitated.This process is fundamental for learning, communicating and interacting socially.
In [8] an experimental work on imitation practice is presented in order to improve imitation abilities and reduce the autism level of 21 autistic children aged 4 to 10.
Nadel's imitation scale is used to evaluate the level of the three subcomponents of the imitation process.
These two studies are interesting on the psychological level because they show how imitation can positively impact autistic children, but they do not use modern techniques for imitation learning.The experimentation consists of human caregivers performing simple imitation games with the children.
In [14] Bernardini describes a multi-site intervention where 46 children with ASD aged 5 to 14 improved their social interactions through playing games, among which imitation games, with an intelligent agent called Andy.In most cases, the probability of the child answering Andy's requests increased.However social interactions initiated by the child were not really impacted.
In the work [15], the authors perform a review of studies using technological tools with autistic children and show that very few: • aim at therapeutic effectiveness as well as technology usability; • focus on teenagers; • have a robust methodology.

III. METHODOLOGY
At the time of using artificial intelligence algorithms within our gesture imitation game for autistic teenagers, it is important to develop a robust methodology.Furthermore, the ease of use of the technology must be guaranteed while trying to improve the participants' initial condition in terms of social abilities.
Our gesture imitation game consists of two preliminary, three core and one final stages.
The two preliminary ones are the greetings and pairing stages.
Then come the three imitation modules: one based on induced imitation, another on spontaneous imitation and the third one on the recognition of being imitated.
This proposed structure follows multiple discussions with autism professionals.
In this paper we focus on potential methods for three core processes that will be useful throughout the game: skeleton detection, body representation, and finally recognition of previously learned gestures.
At game initiation, the skeleton of the participant is detected through the computer camera.Tensorflow is used to train and execute neural networks for element classification like in gesture recognition.
As for body representation, a human pose y at time t is represented by the orientation and position of all of the considered joints.The number of joints here is N. Therefore: where O N are joint orientations.
Joint positions P N are not absolute but normalized relative positions.They are computed from their absolute positions p n relatively to the absolute position p ss of the spine shoulder.Their normalization is done using the spine bone length L spine : Unlike joint positions, joint orientations cannot be viewed in the Euclidian space but they can be represented in a 3D Riemannian manifold.
Therefore the human pose space is modeled as the Cartesian product of position and orientation of all of the human joints: In such space, among the various available methods, the one that was chosen was the Gaussian Mixture Models in Riemannian manifolds, as explained in [13].
Since the Riemannian space is nonlinear, tangent spaces at reference points are considered in order to be able to compute standard statistics, like mean and covariance.Paper [17] allows for the calculation of the mean µ of N points  ! on the human pose space: where d(µ, p) is the geodesic distance on the manifold which can be written using logarithmic map as d(µ, p) = ‖  ! ‖.
µ is also called the Riemannian center of mass.
The covariance matrix can then be computed, allowing for the learning of a Gaussian Mixture Model: where x encodes both the human pose  ! and the timestamps t, K is the number of Gaussians,  ! the weight of the k-th Gaussian,  ! the Riemannian center of mass of the k-th Gaussian computed on the manifold and Σ ! the covariance matrix of the k-th Gaussian.The parameters  !,  ! and Σ ! are learned using Expectation-Maximization on the human pose space [18].
We then moved to the main directory and installed all of the required modules: The swig command connects the C++ programs with the Python ones.
As mentioned earlier, Tensorflow API is Python-based but for the execution of the applications, high-performance C++ is used.
We finally executed run_webcam.py with the model and format of our choice, and obtained the following results captured through the webcam (figure 3).
The body joints of the participant are represented by dots and connected through lines.
Different colors are used to distinguish the different body parts.
Elements represented by the dots are the ankles, knees, hips, wrists, elbows, shoulders and ears.
On the first capture (fig.3a), we can see our participant from side-on, with the left arm behind him and the right arm raised in front of him, at head level.In spite of body occlusion (some body parts are superimposed), the body parts are correctly detected.
On a real-life background and even with low light, skeleton detection is functional using the Openpose algorithm.
This will be performed at the beginning of our gesture imitation game and useful throughout the different stages.
Each pose is represented by the position and orientation of the (fourteen) previously listed joints, as shown in formula (1).The joint positions are calculated relatively to the spine as seen in formula (2).
The human pose space is represented in the Riemannian manifold as expressed in formula (3) presented in the methodology section.Equations ( 4) and ( 5) are then used for gesture learning and recognition.

A
Gaussian mixture model (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.model then represents a probabilistic description of the target (ideal) movement against which imitation attempts are compared.
This is done using the Openpose algorithm with the open source Tensorflow library, which is used to develop Machine learning and Deep learning algorithms.Tensorflow allows for solving of high complexity mathematical issues using experimental learning architectures.It is similar to a programming system in which computations are represented by graphs where nodes are mathematical operations and arrow borders are interconnected multidimensional data called tensors.Tensorflow application programming interface (API) is Python-based but high-performance C++ is used for the execution of the applications.

Figure 2 :
Figure 2: Illustration of the human pose space with threeGaussians computed on tangent space at means  ![13]

Figure 3 :
Figure 3: Experimental results of the implementation of the Openpose algorithm Then from the tf_pose/pafprocess directory, we executed the swig command with the appropriate arguments and launched the setup file.