1 Introduction

The human activity recognition and classification is a subject of recent interest, on which many works have been presented and proposed different applications [1,2,3,4,5], in order to facilitate the daily life of human beings by promoting an automated interaction with their environment.

An important aspect to be considered in human activity recognition is the data source to be used. As a data source, the use of different types of sensors has been proposed, as in the work presented by Arnon [6]. In recent years, in the topic of recognition and classification of children activities, has been proposed the use of different data sources, such as video cameras, accelerometers, and radio-frequency devices, as in the work presented by Kurashima et al. [7]. Most of the works described in this area collect information for analysis by embedding the sensor directly into a child’s garment to record activity data as is proposed by Nam et al. [8]. This way of data capture has the disadvantage that the devices or sensors used, which are placed in the garments, can interfere directly with the natural action of the children, not allowing them to perform normally the activities to be analyzed.

One way to solve the problem mentioned above is to change the data source to one that does not interfere with the activities to be performed by the study subjects. Under this idea, environmental sound has been used as data source to recognize and classify human activities, as in the works presented by Leeuwen [9] and Galván-Tejada et al. [10], since data capture passes inadvertently to the study group, thus not interfering with the activities to be analyzed.

Environmental sound as data source in child activity classification models is a major challenge due to the complexity of the audio signal analysis process as well as due to the different environmental factors that may interfere during the data capture process, causing that the samples taken do not have the necessary features for their analysis. Therefore an adequate data processing (audio samples) and the choice of an appropriate model that optimizes the process of recognition of the activities becomes of vital importance.

For the correct audio signals processing it is necessary to perform a feature extraction on which the classification model will be based. Given these features, it is possible, with a set of training examples (samples), to label the classes (type of sound to which the samples belong), construct and train a model that predicts the class of a new sample. Once the classification model is constructed, it is possible to perform the process of recognizing an activity through an audio signal, passing the signal through the model so that it can predict which kind of sound it belongs to, based on the information with which the model was trained. In the present work, the accuracy of 5 classification algorithms, Support Vector Machines (SVM), k-Nearest Neighbors (kNN), Random Forests (RF), Extra Trees (ET) and Gradient Boosting (GB) is compared, in the generation of a model of recognition and classification of children activities using environmental sound as a data source.

The activity classification models are constructed by executing the classification algorithms with the data obtained from the audio samples in the feature extraction stage. In the first phase of the methodology proposed, these models are built using the 34 extracted features present in the dataset. Nevertheless, in order to develop a more efficient classification model that can be used in mobile applications, a feature selection is performed to reduce the number of features. Therefore, in this proposal Akaike criterion is applied to re-generate the models with a reduced set of features, finally comparing the results obtained in terms of accuracy.

This paper is organized as follows, in the present section is presented an introduction to children activity recognition. Materials and methods are described in Sect. 2. Section 3 reports the results obtained from the methodology. The discussion and conclusions of this proposal are described in Sect. 4. Finally, future work is reported in Sect. 5.

2 Materials and Methods

To compare the efficiency of the classification algorithms SVM, kNN, ET, RF and GB in the generation of a model of recognition and classification of children activities using environmental sound data, and a feature selection process using the Akaike criterion, 5 main stages were performed: data acquisition, feature extraction, classification analysis based on the complete set of features, feature selection, classification analysis based on the set of selected features.

The feature extraction was performed using the programming language, Python [11], while the feature selection and the classification analysis were performed using the free software environment, R [12].

2.1 Data Description

In the majority of the works presented about children activity recognition, is common to analyze detectable activities through movement, such walking or running, because these works use motion sensors like accelerometer as data source, as in the works presented by Boughorbel et al. [13] and Nam et al. [8]. In order to analize different kind of activities, in the present work, the dataset is composed of recordings from four activities commonly performed by children from 12 to 36 months in a residential environment: crying, running, walking and playing (manipulating plastic objects), two of which are not detectable through motion sensors (crying and playing). For the conformation of the dataset, 10% of the sounds was generated and 90% was acquired from the Internet [14, 15] through a search of audio clips about children activities carried out on October 3, 2018.

Table 1 shows the description of the activities analyzed in this work.

Table 1. General description of activities.

Recording Devices. To make the recordings of the audio clips corresponding to the part of the generated data, the devices used were a Lanix Ilium s620 (MediaTek MT6582 quad-core, Android 4.2.2) and a Motorola Moto G4 (Snapdragon 617, Android 6.0.1).

Metadata. From the process of recording the audio clips using different devices and different configurations, as well as considering the recordings taken from the Internet, the dataset of this work includes audio clips with a sample rate between 44100 Hz and 96000 Hz, in stereo and mono channels. Table 2 shows the metadata of the audio clips in the dataset for each activity. The features presented in Table 2 ensure an acceptable quality for recorded audio files, and they define the parameters required for future recordings in order to expand the dataset.

Table 2. Audio clips metadata per activity.

2.2 Feature Extraction

The feature extraction is the process by which information is obtained from audio clips. This information is used to differentiate the type of activity to which the recording belongs, since for each type of activity, the audio clips contains different measurements for their extracted features.

Because the dataset contains audio files of different lengths, these were divided into 10-s clips, causing that all the analyzed samples have the same length. Each 10-s clip is transformed into an array, where each position represents the magnitude of the corresponding feature for that audio clip. Table 3 shows the set of 34 features extracted for each audio 10-s clip. To prevent problems with the difference in the channels of the recordings (Mono and Stereo), all the samples were converted to the Mono type.

Table 3. Features extracted.

It is important to mention that this set of features was chosen because they have been commonly used in related works of audio processing [16,17,18,19], especially the mel-frequency spectral coefficients, being one of the most robust features in the area of recognition and classification of activities using sound [20,21,22,23,24].

2.3 Classification Analysis Based on the Features Extracted

For the classification analysis, the 34 extracted features were subjected to five classification algorithms, SVM, kNN, RF, ET and GB, generating five children activities classifications models, one for each algorithm used.

The classification algorithms used in this work are supervised learning algorithms, being necessary to be previously trained with known data, using a training dataset (70% of the total samples), to later be able to classify new data automatically, based on a blind test using a testing dataset (30% of the remaining total samples).

Finally, each classification model was evaluated obtaining its accuracy, to be compared with each other.

2.4 Feature Selection

In this stage, a feature selection process based on the Akaike criterion (AIC) [25, 26] is performed to reduce the number of features, selecting those that present the most significant information to differentiate the classes to which the audio samples belong.

The principle of this technique is based on the generation of models constructing all the possible combinations of the 34 features extracted through a stepwise regression, a forward selection and a backward elimination, calculating subsequently the AIC for each of these models. Then, the models are ranked according to their AIC, being the best of them the one with the lowest AIC [27].

2.5 Classification Analysis Based on the Features Selected

The classification analysis based on the features selected was carried out only with the set of features that belong to the combination of those with the lowest AIC, since they are the ones that best describe the difference between the analyzed classes.

Finally, as well as in the classification analysis based on the total extracted features, a validation to compare the accuracy of each model is performed in order to evaluate which classification approach presents the most significant results, the classification based on the total number of features extracted or the classification based on the selected features.

3 Results

From the data acquisition, a total of 146 recordings were obtained (considering both the own recordings and those taken from the Internet), which were divided into 2,716 10-s clips. Table 4 shows the number of recordings obtained for each activity as well as the number of 10-s clips generated.

Table 4. Audio clips per activity.

A total of 34 features were extracted for each 10-s clip, so the database for the comparison of the classifying algorithms was contained by 2,716 records with 34 features each one.

Then, from the classification analysis based on the total set of features extracted, in Table 5 are shown the true positives obtained from each classification technique and Table 6 summarizes the accuracy by activity. Table 7 shows the average accuracies of each technique considering the whole set of activities analyzed.

All classifiers achieve an accuracy equal or greater than 0.90.

Table 5. True positives for each classification technique based on the features extracted.
Table 6. Accuracy for each classification technique based on the features extracted.
Table 7. Average accuracy for the features extracted.
Table 8. Features selected.

In the feature selection stage, a set of 27 features was selected, shown in Table 8.

From the classification analysis based on the features selected, the true positives obtained are shown in Table 9, while Table 10 summarizes the accuracy by activity. Table 11 shows the average accuracies for each classification technique.

All classifiers achieve an accuracy between 0.89 and 0.97.

4 Discussion and Conclusions

The objective of this research is to compare the efficiency of five classification techniques based on the generation of a recognition and classification model of children activities using environmental sound data, comparing the classification accuracy of a specific set of extracted features and a reduced set of selected features through an AIC approach.

Table 9. True positives for each classification technique based on the features selected.
Table 10. Accuracy for each classification technique based on the features selected.
Table 11. Average accuracy for the features selected.

From the results presented in Sect. 3, it can be observed that, initially, for the four analyzed activities, on average the best model is the one generated by the ET classification technique, followed by GB, RF, kNN and SVM, respectively, all with an accuracy equal or greater than 0.90. These five initial models were generated using the 34 extracted features of the audio samples, which represents 100% of the data.

In the next phase the process feature selection was performed, selecting a set of 27 according to the AIC, which represents the reduction of 20% of the data used for the development of the classification models.

For the generation of the recognition and classification models of activities using the reduced dataset, the results show a practically equal behavior in the accuracy of the classifying techniques, besides that the RF and ET techniques presented an improvement in their accuracy values.

According to this results, the set of 27 selected features classify activities with a similar performance as the complete dataset contained by 34 features, managing to reduce the amount of data needed by 20% and practically maintaining or improving the accuracy of the models.

The reduction in the number of features is important because when the classification techniques are subjected to large amounts of information, the response time is usually long in a significant way, increasing the computational cost, in addition to the fact that the recognition and classification of activities models are usually designed to be implemented in mobile applications, so it is important to optimize the amount of data with which the user will work and reduce the cost of processing.

5 Future Work

As part of the future work, it is proposed to add to the analysis more common activities in children with the established age range, as well as perform a validation analysis of the dataset to establish if the number of features and samples is optimal for the type of study that is being done.

Another important aspect is to improve the process of feature selection, finding a mechanism that further reduces the set of features needed to describe the phenomena or activities analyzed, and thus, reducing the size of the database with which the algorithms work and the models are generated.