High-Level Features for Recognizing Human Actions in Daily Living Environments Using Wearable Sensors †

: Action recognition is important for various applications, such as, ambient intelligence, smart devices, and healthcare. Automatic recognition of human actions in daily living environments, mainly using wearable sensors, is still an open research problem of the ﬁeld of pervasive computing. This research focuses on extracting a set of features related to human motion, in particular the motion of the upper and lower limbs, in order to recognize actions in daily living environments, using time-series of joint orientation. Ten actions were performed by ﬁve test subjects in their homes: cooking, doing housework, eating, grooming, mouth care, ascending stairs, descending stairs, sitting, standing, and walking. The joint angles of the right upper limb and the left lower limb were estimated using information from ﬁve wearable inertial sensors placed on the back, right upper arm, right forearm, left thigh and left leg. The set features were used to build classiﬁers using three inference algorithms: Naive Bayes, K-Nearest Neighbours, and AdaBoost. The F - measure average of classifying the ten actions of the three classiﬁers built by using the proposed set of features was 0.806 ( σ = 0.163).


Introduction
Action recognition is important for various applications, such as, ambient intelligence, smart devices, and healthcare [1,2]. There is also a growing demand and a sustained interest for developing technology able to tackle real-world application needs in such fields as ambient assisted living [3], security surveillance [4] and rehabilitation [5].
In effect, action recognition aims at providing information about behavior and intentions of users that enable computing systems to assist users proactively with their tasks [6]. Automatic recognition of human actions in daily living environments, mainly using wearable sensors, is still an open research problem of the field of pervasive computing [4]. However, there are number of reasons why human action recognition is a very challenging problem. Firstly, human body is non rigid and has many degrees of freedom [7]. For that, human body can perform infinite variations for every basic movement. Secondly, the intra and inter-variability in performing human actions are very high, i.e., a same action can be performed in many ways by the same person each time [8].
Most existing work on action recognition is built upon simplified structured environments, normally focusing on single-user single-action recognition. In real world situations, human actions are often performed in complex manners. That means that a person performs interleaved and concurrent actions and he/she can interact with other person(s) to perform joint actions, such as cooking [4].

Related Work
In recent years, several studies have been carried out for human action recognition in various contexts using mainly video [12,13], and information extraction technique, such as raw signals from inertial sensors [14,15] that are substantially different from the signals and techniques used in our research.
Regarding research work in which action classification was made using data of joint orientation estimated from inertial sensors data, in [16] three upper limb actions (eating, drinking and horizontal reaching) were classified using elbow flexion/extension angle, elbow position relative to shoulder and wrist position relative to shoulder. In the training stage, features are clustered using k-means algorithm, and a histogram is generated from the clustering as a template for each action used during the recognition stage by matching the templates. Two sensors were attached to the upper arm and forearm of four healthy subjects (aged from 23 to 40 years) and data were collected in a structured environment. This clustering-based classifier scored a F-measure of 0.774.
In [17], six actions (agility cuts, walking, sprinting, jogging, box jumps and football free kicks) performed in an outdoor training environment were classified using the Discrete Wavelet Transform (DWT) in conjunction with a Random Forest inference algorithm. Flexion/extension of the knees was calculated from wearable inertial sensors attached to the thigh and shank of both, nine healthy and one injured subjects. A classification acuracy of 98.3% was achieved in the cited work, and kicking was the action with more instances confused with other actions.
Recently, in [18] nine everyday and fitness actions (lying, sitting, standing, walking, Nordic walking, running, cycling, ascending stairs, and descending stairs) were classified based on five time-domain and five frequency-domain features extracted from orientation signals of torso, shoulder joints and elbow joints, and using decision trees algorithm. Also, five sensors were placed on the upper body and one was attached to one shoe during indoor and outdoor recording sessions of a person. The overall performance of the classifier was 87.18%. Difficulties were encountered for classifying cycling and sitting actions.
In contrast to previous related work, our research focuses on classifying ten actions (cooking, doing housework, eating, grooming, mouth care, ascending stairs, descending stairs, sitting, standing, and walking) performed by five test subjects in their homes. The joint angles of the right upper limb and the left lower limb are estimated using information from five sensors placed on the back, right upper arm, right forearm, left thigh and left leg. A set of features related to human limb motions are extracted from the orientation signals and are used to build classifiers using three inference algorithms: Naive Bayes, K-Nearest Neighbours, and AdaBoost.

Setup
The sensors used in this research were LPMS-B (LP-research, Tokyo, Japan) miniature wearable inertial measurement units. Each one comprises three different sensors: a 3-axis gyroscope, a 3-axis accelerometer and a 3-axis magnetometer. The communication distance scope is of 10 m using Bluetooth interface, it has a lithium battery of 3.7 V at 800 mAh, and it weights 34 g.
The wearable inertial sensors were placed on five anatomical references of the body of test subjects as illustrated in Figure 1a. The first anatomical reference was located in the lower back 40 cm from the first thoracic vertebra measured in straight line, aligning the plane formed by the x-axis and y-axis of sensor S 1 with the coronal plane of the subject. Sensing device S 1 was settled over this anatomical reference using an orthopedic vest. The second anatomical reference was located 10 cm up to the right elbow, in the lateral side of the right upper arm, aligning the plane formed by the x-axis and y-axis of S 2 with the sagittal plane of the subject. The third anatomical reference was located 10 cm up to the right wrist, in the posterior side of the forearm, aligning the plane formed by the x-axis and y-axis of S 3 with the coronal plane of the subject. The fourth anatomical reference was located 20 cm up to the right knee, in the lateral side of the right thigh, aligning the plane formed by the x-axis and y-axis of S 4 with the sagittal plane of the subject. The fifth anatomical reference was located 20 cm up to the right malleolus, in the lateral side of the shank, aligning the plane formed by the x-axis and y-axis of S 5 with the sagittal plane of the subject. Sensors S 2 , S 3 , S 4 and S 5 were firmly attached to the anatomical references using elastic velcro straps. The configuration and coordinated systems of the sensors are shown in Figure 1a.
Two wearable RGB cameras were worn by the subjects for recording video that can be used for segmenting and labeling the performed actions and afterwards, to validate the recognition. One of these devices was a camera GoPro Hero 4, C 1 (GoPro Inc., San Mateo, CA, USA) which was carried on the front of the vest used for sensor S 1 . The other device was a Google Glass, C 2 (Google Inc., Menlo Park, CA, USA) which was worn as typical lenses. The weight of the GoPro Hero 4 is 83 g and the weight of the Google Glass is 36 g. The first camera stores video streaming using an external memory card whereas the camera of the Google Glass stores the video streaming using an internal memory of 2 GB. Both cameras were used for recording video streaming at 720p and 30 Hz.
Five healthy young adults (mean age 26.2 ± 4.4 years) were asked to wear the wearable sensors. All subjects declared to be right-handed and have not any mobility impairment in their limbs. They were not asked to carry any particular cloth, just wearing the Google glass, the GoPro camera, and the five sensing devices. Additionally, the subjects signed an informed consent in which they agreed that their data can be used for this research.

Environment
The test subjects were asked to perform the trials in their homes. Additionally, the subjects were asked not to take the sensing devices close to fire nor to wet them. The trials were performed in the mornings after the subject had taken a shower, a common action of all subjects at the beginning of the day, and before to leave home.
For each trial, sensors S 1 , S 2 , S 3 , S 4 and S 5 were placed on the body of each subject, according to the description given in Section 3.1. During each trial the subjects were asked to perform their daily activities at will, i.e., they did not perform any specific action in any established order.
To start and finish a new trial the subjects had to remain in an anatomical position similar to that shown in Figure 1b. Each subject must complete 3 trials in 3 different days. Additionally, to synchronize the data obtained by the sensing devices and the two cameras, the test subjects were asked to perform a control movement 3 s after the beginning of each test. This movement consisted of performing a fast movement of the right upper limb in front of the field of view of the cameras and it was used in a further process of segmentation. The wearable inertial sensors captured data at 100 Hz. Figure 1 shows the sensing devices and the cameras worn by a test subject in his daily living environment. The sensing devices S 2 and S 3 were attached directly to the skin of test subject usign two elastic bands, while sensing devices S 1 , S 4 and S 5 were firmly attached to the clothing of the subject using an orthopedic vest and elastic bands so that the movement of the clothes did not modify the initial position of the sensors.  In Figure 2 an example of each action performed by a test subject during a trial at his home is shown. During a pilot test it was observed that the field of view of the Google Glass was too narrow and too short for recording the grasping actions performed by the subjects over a table, so it was decided to place the GoPro camera at chest height to complement information captured on video streaming. As an example, in the top snapshots of Figure 2a-c the action performed by the subject can not be distinguished. The actions that are performed on the head of the subject can not be properly distinguished using the camera placed on the subject's chest, see Figure 2d,e.
The duration of the trials in daily living environments of the five test subjects ranged from 17 min to 36 min (mean of 23 min ± 6 min). The actions performed for each trial were segmented and labeled manually according to the recorded video. The number of correctly segmented/labeled actions was 90 (46% of the total of captured actions), which were used as instances for the following analysis. The distribution of instances according to each action/class is presented in Table 1.
The action with more instances is 'walking', since it is the intermediate action between the rest of actions. Conversely, the actions with less instances are 'ascending stairs' and 'descending stairs' because the house of a test subject has only a floor, while another subject did not use his stairs in any of the three trials.

Feature Extraction
A set of joint angles L is estimated based on kinematic models of upper and lower limbs [19]. Each orientation l depends on the degrees of freedom of the joint, L = { SH l, EL l, HP l, KN l}, that correspond respectively to shoulder, elbow, hip, and knee joints. The human body is composed of bones linked by joints forming the skeleton and covered by soft tissue, such as muscles [20]. If the bones are considered as rigid segments, it is possible to assume that the body is divided into regions or anatomical segments, and so the motion among these segments can be described by the same methods used for manipulator kinematics.
The degrees of freedom modeled for the right upper limb are SH l = {l f /e , l a/a , l i/e } and EL l = {l f /e , l p/s }, where f /e, a/a, i/e and p/s are flexion/extension, abduction/adduction, internal/external rotation, and pronation/supination, respectively. Similarly, the degrees of freedom modeled for the opposite lower limb are HP l = {l f /e , l a/a , l i/e } and KN l = {l f /e , l i/e }. The description of human motion can be explained by the movements among the segments and the range of motion of each joint and are illustrated in Figure 3.  Thus far, orientation signals are used as recordings of human movement in general. Such recordings, in the form of time-series, may be captured during tests of movement in controlled environments, such as human gait tests for clinical purposes, recordings as well can be captured during experiments in uncontrolled environments, such as monitoring activities in the homes of test subjects. This research is mainly concerned with the second type of experiments.
This work involves capturing data from the subjects during long periods of time, in which the subjects perform several actions. Therefore, it is necessary to segment recorded data according to the actions of interest for the study. Each data segment w i = (t 1 , t 2 ) is defined by its start time t 1 and end time t 2 within the time series. The segmentation step yields a set of segments W = w 1 , ..., w m , in which each segment w i contains an activity y i .
Segmenting a continuous sensor recording is a difficult task. People perform daily activities fluently, which means that the actions can be paused, interleaved or even more, can be concurrent [21]. Thereby, in the absence of sharp boundaries, actions can be easily confounded. However, the exact boundaries of an action are often difficult to define. An action of mouth care, for instance, might start when the subject reaches the toothbrush or it might start when he/she initiates brushing teeth; in the same way, the action might end when the subject finishes rinsing his/her mouth or even when the toothbrush is left. For that, a protocol was defined for segmenting and labeling the time-series L. This protocol depends of the video recording captured during the tests in the daily living environments of the subjects.
The feature extraction reduces the time-series W into features that must be discriminative for the actions at hand. Features are extracted from features' vectors hl f X i on the segments W, with F as the feature extraction function expressed in Formula (1).
The features corresponding to the same action should be clustered in the feature space, while the features corresponding to different actions should be separated. At the same time, the selected type of features need to be robust across different people as well as to intraclass variability of an action.
The selection of features requires previous knowledge about the type of actions to recognize, e.g., to discriminate the walking action from resting action, it may be sufficient to select as feature the energy of the acceleration signals, whereas the very same feature would not be enough to discriminate walking action from ascending stairs action. In the filed of activity recognition, several features have been used to discriminate different number of actions [22][23][24].
Even though signal-based (time-domain and frequency-domain) features have been widely used in the activity recognition field, these features lack a description of the type of movement that people perform. One of the advantages of using joint angles is that the time-series can be characterized as terms of movements. This representation based on anatomical term of movements can be used not only for classification purposes, it can also be used for describing how people perform actions. From now on, features extracted from joint angles are referred to as high-level features or HLF.
The anatomical term of movements varies according to the degree of freedom modeled for each joint as summarized below: • Shoulder joint: flexion/extension, abduction/adduction, and internal/external rotation.
The first stage for extracting high level features hl f X i from orientation time-series (L, w i ) is searching tendencies in the signals that are related to the anatomical terms of movements. Three tendencies have been defined and used as templates to find matches along time-series, as shown in Figure 4. The first one is ascending (Figure 4a) during a time given by temp asc , and is used for searching movements of flexion, abduction, internal rotation and pronation. The second one is descending (Figure 4b) during a time given by temp desc , and it is used for searching movements of extension, adduction, external rotation and supination. The third one (Figure 4c) during a time given by temp neu , is related to resting lapses or to readings with a very small arc of movement.
The three templates depend on the associated values to conform the signals temp asc , temp desc , and temp neu . Two approaches are explored based on the extraction of such values: static and dynamic. The static approach consists of using an a priori value based on prior knowledge of the performed actions, whereas the dynamic approach searches tendencies in function of the magnitude of the movements made during the tests.  The complete description of the high level features extraction is detailed in Algorithm 1. Dynamic time warping algorithm, DTW, is used to match templates to the time-series [25]. This algorithm is able to compare samples of different length in order to find the optimal match between them and each template. Finally, the HLF extracted from each signal of (L, w i ) are used together to conform the hl f X i vector.

Algorithm 1: Summary of the high-level features extraction
Inputs: (L, w i ): segmented time-series comprising { SH l, EL l, HP l, KN  Output: hl f X i : high level features vector.

Notation:
SH, EL, HP, and KN refer to the human joints: shoulder, elbow, hip and knee. f le, ext, abd, add, i_rot, e_rot, pro, sup are the term of movements of flexion, extension, abduction, adduction, internal rotation, external rotation, pronation and supination, respectively. dist is the distance calculated by the DTW algorithm. num is the set of high-level features of joints. Extracting HLF according to joints DoF: for each j ∈ SH,EL,HP,KN hl f do j num f le ← count( j hl f , 1); j num ext ← count( j hl f , −1); for each j ∈ SH,HP hl f do j num abd ← count( j hl f , 1); j num add ← count( j hl f , −1); for each j ∈ SH,HP,KN hl f do j num i_rot ← count( j hl f , 1); j num e_rot ← count( j hl f , −1); EL num pro ← count( EL hl f , 1); EL num sup ← count( EL hl f , −1);

Action Classification
The classification is divided into two stages: training and testing. In supervised learning, classification is the process of identifying to which class from a set of classes a new instance belongs (testing), on the basis of a training set of data containing instances whose class is known (training).
Training is performed using training data T = {(X i , y i ) n i=1 } with n pairs of feature vectors X i and corresponding ground truth labels y i . A model is built based on patterns found in training data T using supervised inference methods, I, before to be used in testing stage. Model parameters λ must be learned to minimize the classification error on T if a parametric algorithm has been used for building models. In contrast, nonparametric algorithms take as parameter the labeled training data λ = T without further training.
Testing is performed using a trained model with parameters λ, mapping each new feature vector χ i to a set of class labels Y = y 1 , . . . , y c with corresponding scores P i = p 1 i , . . . , p c i , as defined in Formula (2).
with the inference method I. Then, the calculated scores P i is used to obtain the maximum score and select the corresponding class label y i as the classification output expressed by Formula (3).
Three inference algorithms were selected because they are appropriate to deal with problems involving unbalanced data [26,27]. Naive Bayes (NB) classifier is a classification method founded on the Bayes theorem based on the estimated conditional probabilities [28]. The input attributes are assumed to be independet of each other given the class. This assumption is called conditional independence. The NB method involves a learning step in which the probability of each class and the probabilities of the attributes given the class are estimated, based on their frequencies over the training data. The set of these estimates corresponds to the learned model. Then, new instances are classified maximizing the function of the learned model [29]. (KNN) is a non-parametric method that does not need any modeling or explicit training phase before the classification process [30]. To classify a new instance, KNN algorithm calculates the distances between the new instance and all training points. A new instance is assigned to the most common class according a majority vote of its k-nearest neighbors. The KNN algorithm is sensitive to the local structure of data. The selection of the parameter k, the number of considered neighbors, is a very important issue that can affect the decision made by the KNN classifier.

K-Nearest Neighbors
Boosting produces an accurate prediction rule by combining rough and moderately inaccurate rules [31]. The purpose of boosting is to sequentially apply a weak classification algorithm to repeatedly modified versions of the data, and the predictions from all of them combined through a weighted majority vote to produce the final prediction [32]. AdaBoost (AB) is a type of adaptive boosting algorithm that incrementally trains a base classifier by suitably increasing the pattern weights to favour the missclassified data [33]. Initially all of the weights are set equally, so that the first step trains the classifier on the data in the usual manner. For each successive iteration, the instance weights are individually modified and the classification algorithm is reapplied. Those instances that were misclassified at the previous step have their weights increased, whereas the weights are decreased for those that were classified correctly. Thus as iterations proceed, each successive classifier is forced to concentrate on those training instances that are missed by previous ones in the sequence.

Results
With the aim of recognizing the actions listed in Table 1, the high-level features were extracted from the orientation signals (L, w i ) according to the method detailed in Section 3.3. Then, three classifiers were built using (1) features based on the static approach, (2) features based on the dynamic approach, and (3) features based on both static and dynamic approaches; and the cross product of the inference algorithms described in Section 3.4. For evaluating the performance of each classifier, three standard metrics were calculated: sensitivity, specificity and F-measure.
Sensitivity is the True Positive Rate TPR, also called Recall, and measures the proportion of positive instances which are correctly identified as defined in Formula (4). Specificity is the True Negative Rate TNR and measures the proportion of negative instances which are correctly identified, see Formula (5). F-measure or F 1 is the harmonic average of the precision and recall as indicated in Formula (6).

TPR =
True positives True positives + False negatives (4) TNR = True negatives True negatives + False positives (5) where True positives are the instances correctly identified, False negatives are the instances incorrectly rejected, True negatives are the instances correctly rejected, False positives are the instances incorrectly identified, and Precision is the positive predictive value as indicated in Formula (7). Precision = True positives True positives + False positives (7) k-fold cross validation was used for partitioning datasets, with k = 5. Thereby, features datasets were randomly partitioned into k equal sized subsamples. Of the k subsamples, a subsample was used for testing the classifier, and the remaining k − 1 subsamples were used for training the classifier. The cross validation process is then repeated k times, with each of the k subsamples used once as testing subsample. The k results from the folds were averaged to score a single estimation.
In Tables 2-4, the classification results of classifiers built using static approach, dynamic approach, and both static and dynamic approaches, are shown, respectively. In general, the KNN classifiers scored the best results and AB classifiers scored the worst results.
From Table 2, the actions with the highest rate of instances correctly classified were Standing and Eating, while Ascending stairs was the action with the worst rate of true positives. For the classification using dynamic HLF summarized in Table 3, the actions with the best rate of instances correctly classified were Eating and Cooking, while Sitting was the action with the worst rate of true positives. Combining the static and dynamic HLF, three actions scored a TPR average greater than 0.9: Standing, Walking and Cooking, while Ascending stairs was again the action with the worst rate of true positives. These values are consistent for TNR and F-measure metrics too.
To analyze in detail the actions classified by the combined approach summarized in Table 4, the confusion matrices for each classifier were obtained and are shown in Tables 5-7. In general, the instances missclassified in the three confusion matrices are related to the actions involving mainly movements of the upper limbs: Cooking, Eating, Doing housework, Grooming and Mouth Care, as well as the actions involving mainly movements of the lower limbs: Ascending and Descending stairs, Sitting, Standing and Walking; with the exception of three instances of Doing housework and Eating actions, which were missclassified as Walking by the classifier built using AdaBoost algorithm.   In particular, AdaBoost classifier correctly classified all instances of Walking, and the largest number of instances of actions that involve mostly the lower limbs, only confused three instances between pairs of similar actions: Ascending stairs-descending stairs, and Sitting-Standing. However, six of eigth instances of the Mouth Care action were incorrectly classified.
The classifiers built using Naive Bayes and K-Nearest Neighbors correctly classified most of the instances of Cooking, Doing housework and Eating actions. However, they classified incorrectly half of instances of Ascending and Descending Stairs as Walking. Additionally, the KNN classifier confused most instances of Sitting as Standing. Table 5. Confusion matrix of classification results of Table 4 using Naive Bayes algorithm.  Table 6. Confusion matrix of classification results of Table 4 using K-Nearest Neighbours algorithm.

Discussion
The set of high-level features proposed in the present study allows the classification of the actions of interest with a sensitivity close to 0.800. The most important aspect to highlight is that although there were some missclassified instances by the three classifiers, most of these instances were confused with similar actions, which reflects the consistency between the features and the joint signals that were used.
In order to compare the set of proposed features, two other types of signal-based features were extracted from the orientation signals of the performed experiment. This set of features is subdivided into temporal-domain and frequency-domain features. Time-domain features are: arithmetic mean, standard deviation, range of motion, zero crossing rate, and root mean square. For their part, frequency-domain features are extracted from the power spectral density of the signal in frequencies 0-2 Hz, 2-4 Hz, and 4-6 Hz [34]. All features were extracted for W segmented time-series, as expressed by Formula (1).
The classification results for comparing the three type of features are summarized in Table 8. From this Table, the results scored by the classifiers built using the three type of features are close, although the results using frequency-domain features are the worst ones. Conversely, the best result was scored by the classifier built using time-domain features and using the K-Nearest Neighbours algorithm. As can be noticed, there is not a significant difference between the values of the calculated classification metric. Even though the classification average is higher using the time-domain features, in general, the dispersion of scores using the high-level features is smaller than the dispersion of scores using the time-domain features.
Finally, with respect to the related work, the F-measure 0.806 obtained in this study using the data captured in daily living environments, is greater that the F-measure 0.774 reported in [16] using data captured in a structured environment, whose scores are the closest ones to our results reported in the literature to the best of our knowledge. According to the number of test subjects, in [18] the authors obtained an 87.18% of correct classification, in a study in which participated only one person. Regarding the types of actions, in [17] a 98.3% of correct classification was reported from a study involving six actions with high variability among them, the studied actions were training exercises, in contrast to our study in which the actions were daily living actions. Finally, in contrast to the related work in which the data used to classify were based on time-domain or frequency-domain features, the proposed set of features also describe how the movement of the limbs was performed by the test subjects.

Conclusions
In the present research a set of so-called high-level features to analyse human motion data captured by wearable inertial sensors is proposed. HLF extract motion tendencies from windows of variable size and can be easily calculated. Then, our set of high-level features was used for the recognition of a set of actions performed by five test subjects in their daily environments, and their discriminant capability under different conditions was analysed and contrasted. The F-measure average of classifying the ten actions of the three classifiers built by using the proposed set of features was 0.806 (σ = 0.163).
This study enabled us to expand the knowledge about wearable technologies operating under real situations during realistic periods of time in daily living environments. Wearable technologies can also complement the monitoring of human actions in smart environments or domestic settings, in which information is collected from multiple environmental sensors as well as video camera recordings [35,36]. In the near future, we consider to analyze more in detail user acceptance issues, namely wearable and comfort limitations.
In order to carry out an in-depth analysis on the feasibility of using high-level features for the classification of human actions, other databases must be used for evaluating the set of features proposed in this research, including databases obtained from video systems, thus the performance of the classifiers using this set can be evaluated. Similarly, new high-level metrics can be included to those currently considered, e.g., the duration of each movement term or the prededence between them. Also, a full set of the three type of features presented in this study: high-level features, frequency-domain features and time-domain features, will be evaluated in the following classification studies.
The studied actions might also be extended with new actions that were observed during the present study, lasting a longer duration than the durations considered for this research, as well as outdoor actions. In the same way, the detection of the set of actions performed in daily living environments in real time, including 'resting' and 'unrecognized' actions, in the case of those body movements that are distinct of the actions of interest.