Journal Pre-proofs An Ensemble of Autonomous Auto-Encoders for Human Activity Recogni‐

Human Activity Recognition is focused on the use of sensing technology to classify human activities and to infer human behavior. While traditional machine learning approaches use hand-crafted features to train their models, recent advancements in neural networks allow for automatic feature extraction. Auto-encoders are a type of neural network that can learn complex representations of the data and are commonly used for anomaly detection. In this work we propose a novel multi-class algorithm which consists of an ensemble of auto-encoders where each auto-encoder is associated with a unique class. We compared the proposed approach with other state-of-the-art approaches in the context of human activity recognition. Experimental results show that ensembles of auto-encoders can be eﬃcient, robust and competitive. Moreover, this modular classiﬁer structure allows for more ﬂexible models. For example, the extension of the number of classes, by the inclusion of new auto-encoders, without the necessity to retrain the whole model.


Introduction
Human Activity Recognition (HAR) is a research field focused on the use of sensing technology to classify human activities and to infer human behavior [1].
A HAR system can use data from different sources, like wearables, sensors from objects and cameras. These systems have been successfully applied for health 5 and well-being [2], tracking and mobile security [3] and elderly care [4].
Most HAR machine learning approaches found in the literature, such as: decision trees [5], support vector machines [6] and k -Nearest Neighbor [4] rely on the use of heuristic hand-crafted feature extraction to train their models. That includes, for example, time-domain calculations, mean and standard deviation 10 for each sensor signal and correlation (Pearson correlation) between axes for the 3D sensors.
In our previous work [7] we studied a semi-supervised ensemble, Ek VN, which combined 3 different algorithms (k Nearest Neighbour, Very Fast Decision Tree and Naive Bayes). This method relies on heuristic hand-crafted feature ex- 15 traction for HAR. The features were extracted from the raw data of different types of sensors: accelerometer, gyroscope and magnetometer sensors. We investigated the impact of some hyperparameters in the accuracy of Ek VN. We found that the accuracy of Ek VN is more sensitive to data from different users, to the window size and to the overlapping factor. We also found that the fea-20 ture extraction process has a relatively high energy and time costs. This can have implications, for example in mobile applications, where the use of resources must be carefully managed in order to keep the application efficiently working for long periods of time.
An alternative to the manual extraction of features is the automatic feature 25 extraction with neural networks [8]. One type of neural network commonly used as a powerful tool for discovery of features is the Auto-Encoder (AE). This type of neural network tries to learn two functions, an encoder, which maps the input to the hidden layers (the bottleneck), and a decoder, which maps the hidden layers to the output layer. In other words, an AE can learn compact 30 representations of the input data in an unsupervised manner [9]. Therefore, the output of an auto-encoder is the reconstruction of its input.
In this work, an extension of [7], we propose a classification approach which is an Ensemble of AEs (EAE). In this EAE each AE is trained with data from one class 1 . Thus, in the context of HAR, each AE is associated with a label/activity. 35 As new data arrives for classification, the reconstruction loss is calculated for each AE. The data is then classified with the label from the AE which obtained the lowest reconstruction loss. When used in online learning, the ensemble model can be updated with the user's data when the reconstruction loss drops below a given threshold. To the best of our knowledge there are no approaches 40 that use AEs as an ensemble classifier.
We tested two variants of EAE in HAR data, an online and an offline one.
Both variants learn from the same train data, however the first also learns incrementally when the loss increases more than an user-defined threshold. Experimental results show that the EAEs are efficient, robust and competitive 45 with state-of-the-art approaches. This paper is structured as follows. Section 2 presents the related work on machine learning for HAR. Section 3 describes the method proposed in this study. The results obtained are presented and discussed in Section 4. Finally, Section 5 summarizes the main conclusions and points out future work direc-50 tions.

Related Work
The main goal of HAR is to recognize human physical activities from sensing data. In this research area many approaches were presented in the last decade [11,12,13,14]. These approaches vary depending on the sensor tech-55 nologies used to collect the data, the machine learning algorithm and the features created to train the model. In relation to extraction and selection of features, the models can be trained using hand-crafted feature extraction or automatic feature extraction.
The conventional approaches in HAR use hand-crafted feature extraction, 60 which means that these approaches rely on human domain knowledge. Those features often include statistical information, such as: mean, variance, standard deviation, frequency and Pearson Correlation [15]. These approaches use traditional machine learning methods such as: SVM classifiers, k -Nearest Neighbour, decision tree, Naive Bayes classifiers, Random Forest [6,16,5,1]. Others focus 65 on the combination of these machine learning approaches as ensembles in order to improve accuracy [7,1]. It is generally known that ensembles with bagging and boosting techniques can increase the performance of classifiers [17]. In most of these studies the improvements proposed are more focused in the tuning of hyperparamenters that are common in HAR (e.g. window size and overlapping 70 factor) [18] and feature construction [1].
In contrast to that, Neural Networks methods (a.k.a Deep Learning) have the capacity to automatically learn relevant features from raw data without human domain knowledge [9]. Many different deep learning architectures have been proposed, such as Convolutional Neural Networks (CNN) [12,14,19], recurrent 75 neural networks [13] and AEs [20, 21,22].
Mostly used for computer vision, CNN models have also demonstrated to be effective in natural language processing [23], speech recognition [24] and text analysis [25]. In terms of HAR, CNNs have also been used to extract features from sensing data and to classification tasks [9]. Approaches for HAR based on 80 CNNs can learn the correlation between nearby signals and be scale-invariant for different frequencies [9,19]. Some of these approaches process each dimension of a signal (e.g. a 3D accelerometer signal) as a channel. In other words, that means that to each channel is applied a 1D convolution. After that, the outputs from all channels are flattened to unified layers. Chen  sharing is a technique used to incorporate invariance, to reduce complexity and to speed up the training process of CNNs [9]. To use 2D convolutions, some approaches resize the inputs from the signals as virtual 2D images [28].To learn the dependencies between signals they applied a CNN using a 2D convolution kernel and a 2D pooling kernel. Following this idea Jiang and Yin [29] designed 95 a more complex process to transform the signals into 2D image description and applied 2D convolution to extract features.
AEs are one family of neural networks which can learn a compact representation of the input signals. Stacked Auto-Encoder (SAE), for example, stack the learned features which can later be used to build a classification model [9]. 100 Wang et al. [20] proposed a Continuous AE that converts high-dimensional continuous data to low-dimensional data in the encoding process. The features are extracted by AEs with multiple hidden layers. Gao et al. [21] proposed the combination of Stacking Denoising AE for feature extraction with LightGBM as the classifier. Ensemble of AEs can also be used for unsupervised outlier detection.

105
For example, Chen et al. [30] proposed an ensemble of AEs randomly connected with different structures and connection densities, which reduces computational costs. The outliers are detected by computing the median of the AEs reconstruction error. In HAR, the features learned by Denoising Stacked AEs can be used by a random forest algorithm to build an ensemble classifier [31].

Ensemble of kVN
The Ek VN is an ensemble model composed by three classifiers: k NN, Very Fast Decision Tree (VFDT) and Naive Bayes. The implementation of the ensemble classifier is the combination of Democratic Co-Learning and Tri-Training [32].

120
This method uses a vector of hand-crafted features as input, both in its training and test phase, as illustrated in Figure 1.  The top pipeline in Figure 1 shows the offline training using raw data extracted from different wearables and/or smartphone sensors. In the first step, window segmentation & overlapping, the raw data is stored in sliding windows 125 and consecutive windows are overlapped. The window size (w) and overlap factor (ovl) are user defined values.
Data from sensors is usually susceptible to noise, especially accelerometer data [1]. Thus, the preprocessing step is important for calibration and filtering of the input data in order to reduce the noise. After that, a new instance is 130 created containing the features that will be used to train the model, the feature extraction step. These features include time-domain calculations, specifically the mean, the standard deviation and the Pearson Correlation of each axis for the 3D sensors. Afterwards these instances are used to train (training step) one model from each one of the algorithms: k NN, VFDT and Naive Bayes. Then, 135 they are combined as an ensemble of models.
In the online phase, new data is collected from a specific user. This data is preprocessed as described in the steps from the training phase: window segmen-tation & overlapping, preprocessing and feature extraction. Each new instance is classified by the ensemble, which provides a confidence factor for the classifi-140 cation. The instances classified with high confidence, more than 99%, are used to update the model. In this work, we propose to use a set of AE, as an ensemble, for classification.

150
The code is available on GitHub 2 . Figure 3 illustrates the steps for training the Ensemble of Auto-Encoders (EAE) (offline phase) and how it can be used for classification (online phase). In the offline phase a batch of data from multiple users is used to train one AE per class. Thus, each AE learns a different activity.
In the online phase, as new data arrives, each AE tries to reconstruct the 155 original input. Then the AE with the smallest reconstruction error, minError, is selected. The data is then classified with the label corresponding to the AE with the minError. During this online phase, each AE is updated whenever the reconstruction error falls below a user-defined threshold, T . By default, we define this threshold as X standard deviations of the training error. The 160 threshold is a hyperparameter to set a high confidence factor, as in the method explained in Section 3.1. In both offline and online phases, the raw data is segmented according to a user defined window size (w) and an overlapping factor (ovl).  To illustrate how the EAE works, we present in Figure 4 a simple example.

165
The red line represents the real signal (used as the input data) and the blue line depicts the reconstructed signal by each AE. In this example, the model is composed by 6 AEs, where each was trained with data from one of the following activities: Walking Downstairs, Jogging, Sitting, Standing, Walking Upstairs and Walking. In Figure 4 one can see that the AE which better reconstructs 170 the signal is the Sitting AE. Therefore, the model classifies this activity as Sitting. Finally, if its error is below the defined threshold T , the AE is updated with this new signal.

Experiments
We conducted several experiments to compare the predictive performance 175 of the EAE, with the Ek VN and 5 other deep learning approaches. These methods are briefly described in Section 2 and as in [33] will be referred by the name of their author as: ChenXue [26], HaChoi [27], Haetal [28], JiangYin [29], Panwaretal [19]. The performance of all the methods was tested in 3 datasets commonly used in the literature of HAR.

Datasets
In Table 1 we can see some statistics with a brief description of each dataset.  The WISDM dataset [34] contains sensor data from phone-based accelerometers 3 . The data was collected by an application installed on each user's phone.  We analyzed the performance of all the tested methods in terms of accuracy and computational cost. The latter was measured in seconds both in training and testing. For a fair comparison between the models, we only used the accelerometer data from each dataset. We trained the models with a fixed window size, w, of 160. Although other alternatives can be considered for the choice 220 of the window size, for example, dynamic window size [36], it would require additional steps such as compression or concept drift detectors, which would increase the computational cost. Therefore, for simplicity, we used a fixed window w = 160. In practical terms, this represents 8.0 seconds for WISDM (20 Hz), 3.2 seconds for MHealth (50 Hz) and 1.6 seconds for PAMAP2 (100 Hz).

225
Each consecutive window is overlapped with ovl, overlap factor, of 20% [7]. For the evaluation we used the leave-one-user-out approach.
We note that in the case of the proposed EAE the input is a vector with 480 entries. Which consists of the 3 components of the accelerometer sensor: x-acceleration, y-acceleration, and z-acceleration.

230
For the Ek VN, we created the features: mean, standard deviation and Pearson Correlation. We also use the confidence factor of 99% for updating the model.

Results and Discussion
In this section, we present and discuss the main results from the experiments.
We note that in the experiments that includes the deep learning models, we present the results of our method with incremental learning, called EAE, and the 260 model without incremental learning, called EAE Off. We present both versions so we can analyze the improvement of online model update and also for a fair comparison with the other deep learning models that are not updated online.

WISDM Dataset
Considering the WISDM dataset, we can observe a plot containing 8 vio-265 lin box-plots representing the variation/dispersion of the accuracy per model  Figure 6 we notice that EAE has less dispersion in accuracy than the other models. The median of the accuracy is around 87% for EAE, 270 while for Ek VN it is around 80%. As for the other models, the median accuracy is around 87%, however their variance is larger than the variance of EAE. As for the lowest accuracy, it can reach in some cases, less than 25%. We can see in Table 2    In Figure 7 we see the accuracy per user for the models EAE and the Ek VN.
The EAE obtained a higher accuracy in 78% of the users as compared with Ek VN. One of the most striking differences is in user 30, where the accuracy 290 of the EAE model is 71% while the accuracy of Ek VN model was only 16.7%.
As mentioned before, we do not have demographic information about the users, however we observe that the misclassification between Walking and Jogging was more evident in some users than others. Since the difference between the activities is in the intensity of the movement, it could have been useful to compare 295 physical characteristics of the users with the classification.
When looking at the confusion matrix of the EAE (Table 3), we can observe that the classes with higher misclassification are Downstairs and Upstairs. They are often misclassified with each other or with Walking. One difference between the classes Downstairs and Upstairs is the orientation of the activity: one is 300 descending stairs and the other is ascending stairs. This concept might be hard to learn only from accelerometer data, since this sensor does not capture the orientation of the movement. On top of that, we also notice that the AEs Downstairs and Upstairs were trained with less data than others classes ( Figure 5) which makes it even more difficult for the models to learn them.

MHealth Dataset
In Figure 8, we can observe the dispersion of accuracy obtained by the models in the MHealth dataset. The median of accuracy of the EAE model is above 90%. We note that the EAE has less variance than EAE Off, meaning that the incremental learning reduces variance. The models ChenXue, HaChoi 310 and JiangYin had higher variance than the other models. Although the lowest variance is Ek VN, its median accuracy of 75%.
For this experiment we consider only data from the body location hand. In terms of average accuracy and time consumption, we can see in Table 4   In terms of time consumption, one more, the models with simpler architectures are faster to train (HaChoi and ChenXue). The EAE takes more time 320 in the prediction phase, specially because this phase includes the incremental learning of the model. Considering that this is an ensemble, the amount of models influences on the time consumption of its testing phase.
When comparing the accuracy per user of the models EAE and Ek VN (Figure 9) we can observe that the EAE was better for all individuals. This shows, 325 once again, that the proposed method can learn meaningful representations of the activities.
By looking at the confusion matrix of the EAE model ( Table 5) we see that the class Stairs has an average accuracy of 93%. This class has data from Downstairs and Upstairs combined. This shows that the model can learn better 330 from the classes which are independent of the orientation. The class Running was more misclassified as Jogging than the other way around, which might be related to the pace that each individual takes to perform these activities.

PAMAP2 dataset
For the PAMAP2 dataset, we see in Figure 10 a high dispersion in the accu-335 racy of the models, specially for ChenXue, HaChoi and Panwaretal. Although Ek VN also has a high dispersion of the accuracy, it is the model with the highest median accuracy, around 70%. The model EAE has the median of accuracy slightly above 60%, presenting a small improvement compared with the offline variant, EAE Off.

340
The maximum accuracy reached by EAE is 0.91, which is the highest of the deep learning methods. We also observed that the minimum accuracy of the EAE is always the highest in all the datasets tested. In this case, the lower value is 0.44 which is the same as for Ek VN.
In Table 6 we see that the average accuracy of the Ek VN the highest, mean-345 ing that traditional models can achieve better performance in some datasets.
Haetal is the deep learning model with the highest average accuracy. All the   it is natural that EAE shows a higher consumption time.
Considering the results per user of the models EAE and Ek VN (Figure 11), we see that the accuracy of Ek VN was slightly higher for all individuals. However, in this dataset, the analysis per user is not an easy task because the users did not perform the activities in equal proportion. For example, the user 9 360 only performed the activity Jumping. This is reflected in Figure 5 where we can see that there is less data for some classes. This less amount of data has obvious implications in the deep learning methods which are known to require more data than classical approaches. This is more evident for the activities learn the activities as good as in the other datasets. The reason for that might be because this dataset was collected with a frequency of 100Hz (see Table 1). Be-

Accuracy per body location
In    AEs, however we consider as a true positive, when the classification is correct, 400 any class belonging to the super-class ( Table 9). The average accuracy was 74.1%, which is higher than the 60.0% showed in Table 7. This shows that AE from similar activities obtain smaller errors. In particular, the misclassification between the Dynamic Positions and Static Positions is quite low. Experimental results show that the proposed EAE is competitive with exist-415 ing methods found in HAR literature. We observed that the minimum accuracy of the EAE is always the highest in all the datasets tested. From this we can conclude that the EAE is more robust to data from different users, which is also supported by the low variance in accuracy.
We note that the presented results were obtained from models trained with 420 accelerometer data only, which is usually a more challenging classification task.
Moreover, a simple and unique architecture was used for all AE in all datasets without hyperparameter tuning.
The modular structure of the EAE proposed in this work has the advantage of making the model easily adapted. First of all, in the case of online learning, 425 only the AEs corresponding to the most frequent activities are updated which can save computation time. In this way it is not necessary to retrain the whole model, as it would be necessary for most machine learning models. Therefore the EAE can specialize in the most performed (or preferred) activities of each user. Moreover this modular structure has also the advantage for the inclusion 430 of new activities when is needed. For that, it is only necessary to add more AE and train each one with each new class. Likewise, it could be similarly adapted to forget activities, by simply removing the respective AE from the ensemble. Finally, another advantage of the EAE is that each AE can have its own architecture and even use different types of layers, such as Recurrent or 435 Convolutional.
In terms of time consumption we see that models with more complex architectures are slower to train than simpler ones. In that sense the EAE, even though it has multiples models, has a similar time consumption to other deep learning models. However, since the concept of EAE can use many different in the results consider all the instances used in each experiment. Which means that the prediction of each instance took less than half a second.
As future work we intend to combine AE with different architectures in the 445 same ensemble.