Smartphone‑sensor‑based human activities classification for forensics: a machine learning approach

,


Introduction
In the present era, mobile phones are undoubtedly the most omnipresent devices being used by almost everyone.The latest generations of mobile devices provide intelligent help to users rather than just for information transfer [1].Smartphones have features such as cameras, GPS, and web browsers and embedded sensors such as accelerometers, gyroscope and magnetometer used for human motional activity detection/recognition and measurements [2].
Human activity recognition (HAR) with visual concept in Fig. 1 is the application of the data collected from sensors, usually embedded in smartphones to accurately recognize and classify the states of a person's activities, independent of external noises [1].Smartphone data are used to reveal information contextually and situationally on different human activities.Most of the sensors are specific; hence, they are only applied in situations where they yield their best performances.Based on this, an accelerometer and gravity sensor will perform optimally in measuring acceleration and gravity, respectively.Prior to model training, data are initially collected by smartphone sensors and stored temporally in a database of a server.This is then followed by an approach based on learning features by codebook.This generates feature vectors by encoding data coming from the smartphone's sensors by parameter tuning [3] of the different groups of human activity recognition.Figure 1 is the concept diagram that illustrates the steps involved in the use of smartphones in HAR classification.
Human activity recognition (HAR) can be grouped into four major groups namely: group activities, real-time activities, individual activities and daily living activities.These activities can be categorized as static or dynamic in nature.Static activities account for 18% of the total HAR while dynamic activities account for 24%.When an individual performing the activity is fixed with respect to an observer, the activity is said to be static, but when the individual moves consistently, the activity is said to be dynamic [4].
The diversified nature of human activities makes them not readily recognizable, hence the necessity of applying machine learning algorithms to easily recognize and monitor them accurately [5].Apart from sectors like business, healthcare, safety, transportation, where human activities detection and classification are becoming widespread, forensics is a potential field where smartphone data could be used.This is due to an increase of at least 20% in crimes after 2021 [6,7].Forensics is the application of scientific approaches, procedures or expertise to investigate or probe into evidence from crime scenes that can Fig. 1 The concept diagram for HAR.The figure shows an example of a crime scene where a woman is being attacked by an armed rubber.The smartphone having the sensors is placed in her handbag.Data are collected during the crime, stored and later used to predict the type of activity that was involved be presented before a judge in a criminal procedure court [8].Though some areas of criminalistics involve straight forward scientific procedures, many challenges remain and include the reliability and acceptability of the results obtained by a particular forensics method and how these results could be easily interpreted and understood by a judge or by a non-expert of the field.
Nowadays, companies, healthcare, education, transport and many other sectors store information electronically.This causes the e-discovery challenge to become expensive and complicated [9].These have been causing tremendous problems and huge challenges to lawyers in large scale litigations as they entail the management of a huge volume of data (documents).The review of these legal documentations, the acceptance and interpretability of forensic results have been (and are) burdensome for these legal experts.To mitigate these problems, computer-aided techniques have been introduced to process these documents faster, with machine learning being the backbone of this novelty in the legal community [9].Interpretation of stored smartphone sensor data, collected during the occurrence of a crime, could provide additional evidence needed for a successful prosecution.
The focus of this work is the development of a predictive machine learning model that will accurately classify the patterns of ten simulated crime scene scenarios based on HAR, grouped into three classes recognizable by criminal law.The goal is to precisely classify these human activities and behaviors related to crime scenes.The predictive models will be evaluated in terms of reliability, acceptability and their quick and easy interpretability by the judges and other non-experts.This is because the application of ML in the legal community has caught the attention of several researchers in the past decade, with significant state of the art advances done in the area of human activities and behavior classification.This has eased some of the challenges faced by experts in the legal community.

Related works
Table 1 summarizes the closest works of machine learning related to this paper.Other related works include (1) a comprehensive review study done by Muhammad et al. [4], (2) the state-of-the-art results obtained by Saponas et al. [8] and Moco et al. [10], focusing on the motion-related activities like standing, running and walking.They use data from a smartphone, emanating from a set of users within a particular time frame.(3) Khan et al. [11], used human motion-based data from mobile sensors (accelerometer, gyroscope and magnetometer) to distinguish between "normal" and "brick" walking, where brick walking refers to lame or dangling walking patterns.(4) Mylonas et al. [12], summarized the use of smartphone data in forensic and their legal implications.(5) Keeling et al. [9] introduced the predictive coding or technology assisted review (TAR) approach that gives an insight into how machine learning could help ease the burden on lawyers.The research showed that no machine learning algorithm was superior to another in the legal field by undertaking 34,000 experiments in view of searching for the best machine learning algorithm and (6) Jahangiri et al. [13] implemented KNN and SVM to identify transportation mode by using mobile sensor data.Most of these works do not focus on real-time smartphone mobile sensor data during the pursuit of a crime, are based on instantaneous classification decision making and do not focus on the criminal laws criteria of crime events, hence the necessity of this work to portray real HAR events in criminalistics.Sukhada et al. [6] is the closest work to this paper related to HAR in digital forensics.The work focused on binary classification (suspicious or non-suspicious activities) and used a black-box technique in features extraction.Time-domain features were used to train a KNN and Subspace discriminate model for the classification of activities in digital forensics.This paper takes into account the static and dynamic natures of multiple activities in classification, involving ten different types of activities.

Novelty
The contribution of this work is the accurate classification of real-time patterns for human activities on crime scenes based on ten different crime scenes activities grouped into three classes as per judiciary considerations, with classification decision taken based on the entire period of the activities and not time instances of the activities.This is the first paper focusing on this procedure.The association of an ensemble of static and dynamic activities as well as their intensities are considered in contrast to [7,8].The latter are focused solely on motion-based activities (walking, climbing, sitting, etc.).A combination of data from five different sensors (accelerometers, gyroscope, gravity, light intensity and orientation) are used in contrast to only three sensor data used by the previous works.The accurate real-time static and dynamic human activity recognition and classification, which help to precisely materialize a crime scene behavior during forensics is achieved.These complement the works done in [1][2][3][4][5][6][7][8][10][11][12] which are solely focused on static activities like slow movement, standing still and falling.

Protocol specification
In order to mature the novelty of this work as mentioned above, the following questions were to be answered; • Question one (Q1); Can several crime scene activities be mimicked, for data emanating from them to be recorded by smartphone sensors, that would be used to train predictive models?• Question two (Q2): Can data from static and dynamic crime scene activities be used simultaneously to train a predictive model?
• Question three (Q3): Can these crime scene activities be successfully classified using several sensor data?• Question four (Q4): Does the application of time and frequency-domain features improve the classification accuracy of these activities?
The attempts made to answer these questions are elaborated in the rest of this paper adopting the work flow chart in Fig. 2. For the rest of the paper, (Q1), (Q2), (Q3) or (Q4) will be mentioned where the corresponding indexed question is addressed and a comprehensive pipeline of the work done will be describe for its easy replication or use, in the classification of activities with different datasets.

Paper organization
The organization of this paper is nurtured as follows: "Introduction" section exposes an insight into the recent trends made on the classification of human activities using data from smartphone sensors and a nutshell description of forensic science."Materials and methods" section gives an overview of the proposed models.It details the materials and the methods applied in this work.This is followed by "Results and discussions" section, that presents the results obtained and a comprehensive discussion levied on the results."Conclusion and perspectives" section rounds up this work with a conclusion encompassing the approach by which this work could be improved in future research.

Data description
To obtain a genuine outcome in this project, a privately collected dataset from [14] was used.The data were recorded from five sensors in a smartphone (Q3).These sensors were used to separately record five different parameters related to the behavior of a subject in a mimicked crime scene (Q1).The parameters are acceleration, gravity, orientation, data from the gyroscope and light intensity.These parameters were collected for ten activities, with each activity corresponding to data from a child or an adult, performing different activities during the day or recreating the scenario of the particular crime scene.

Data recording
Two or three independent repetitions were conducted for each activity.This was aimed at varying the data sets, to prevent any eventual model overfitting and to make Raw data of 10 Mimicked crime scene activities [14] Filtering, removal of outliers and nan values, filling missing data, handling imbalanced data.

Time and frequency domain features, hierarchy and selection by CHI-SQUARE, MRMR and
ReliefF tests DT and SVM implemented and hyperparameters tuned.

Classified activity
Fig. 2 The flow chart of the work done the trained model robust.We later grouped them into three classes.These classes were classified in accordance with the criminal law in [15] to help judges take quick decisions.Class1 can be viewed as an escape from a "felony" attack, class2 as a normal activity and class3 as a "felony" attack (the most serious crimes that can be lethal).
The recording of the data was done through the "Sensor Logger" App using a VIVO Y50 (VIVO 1935) phone, version 10, a processor of 2 GHz and a RAM of 8 GB.The sampling rate for all the sensors is 100 Hz, except for the light sensor that was collected without sampling rate.Accelerations were recorded at ± 6 g and the gyroscope calibrated at ± 2000 deg/s.Segmentation was done with an optimized window size of 40 ms.This window size did not lead to the reduction of the size of the dataset and ensured the capturing of signals resulting from micro activities like ECG.To avoid data loss at the edges, a 50% overlap was implemented as in [15].The data for the five signals were recorded for approximately 10.5 s per activity.Each activity was realized by mimicking the behavior of a real crime scene and by firmly attaching the smartphone on the arm of the subject and the classification decision was done at the end of 10.5 s corresponding to the entire duration of the activity being considered as per this work.Table 2 shows the different activities and the classes constituted from the activities, the number of repetitions made during recording and their classification as static or dynamic (Q2).The table also shows how the different repetitions of the activities were distributed between the subsets to be used for training and testing of the proposed models.Two of these subsets will be used for training and the other will be left out and used for testing.Permutation was done between the subsets and crossvalidation done.The "x" on the subsets means no data entry for that set corresponding to the concerned activity.This distribution strategy was adopted to reduce the effect of data imbalance between the sets.

Data preprocessing
The data used in this work were raw in nature and hence might be corrupted by artifacts.These artifacts could be noise, degradation of the signal (that leads to missing values), imperfection of in measurement due to the degradation of the smartphone sensors, etc.These artifacts are not part of the patterns of the recorded signals and if not removed, will cause the overfitting of the model.This is because the algorithm will be finding a target function that will fit all the data points (including the noise).Hence the resulting function will work well with the training data but will fail miserably with the test data.Therefore, the conditioning of the data was primordial prior to the training of the algorithm to classify simulated human activities by the machine learning algorithm proposed in this work.MATLAB-2022(a) was used for both data preprocessing and feature computation.The classification task was performed using MATLAB.To ensure the high performance of the proposed algorithm, the artifacts that corrupted the data were handled by; (1) linear interpolation, to replace missing data, (2) filing the imbalance of the observations in activities 2 and 5 with data from previous observations as in [16] and distributing the activities between the classes as shown on Table 2.This was to make the observations from all the activities same.The data were initially plotted to pinpoint the outliers, the latter that were simply deleted from each class of data and which constituted less than 0.1% of the data.Low frequency artifacts from usual body motions and possible noise emanating from the accelerometer were attenuated by the implementation of a second order Butterworth band pass filter, with cutoff-frequencies of 0.05 Hz and 0.2 kHz [17].This frequency range was taken in order not to miss any contribution of ECG or EEG signals that may be generated during the recording of the human activity data.A notch filter centered at 60 Hz was implemented to filter out any noise coming from the power source and processing devices.

Features extraction
The features considered in this research are time-and frequency-domain features (Q4) as presented in Table 3. Prior to the features' computation, segmentation of the data into smaller segments to reduce computation time was done [18].A sliding window technique was used with an optimized window size of 40 ms selected for the time-domain feature extraction.For each of the 40 ms window size, the following procedure was adopted; This procedure was implemented to prevent foreign frequency components in another window from appearing in the current window being considered.A 50% window overlap was implemented for both time and frequency domains in computing the features.This was to prevent signal being lost at the edges of the windows.The mean, standard deviation, variance, harmonic mean, zero-crossing rate, kurtosis, skewness, RMS, sum of the absolute values and mean absolute value were computed for each axis of the sensors in the time domain while the median, sum of absolute values, energy spectral density, total energy, harmonic mean and RMS values [19][20][21] were calculated for the frequency domain.The minimum value and maximum value of the spectra coefficient in frequency domain were also analyzed.The window size and overlap chosen for feature computation did not lead to a significant reduction in the row sizes of the features, so the whole length of 10.5 s were used for the data in both training and testing.This can be understood in the sense that an activity must be initiated and completed (or run for at least a considerable duration, as per this work) for it to be accurately classified.So, using instantaneous or point data to classify an activity will not reflect the reality.
Properly normalizing (standardizing) and selecting features prior to the training of any model will help the training to converge faster and prevent the overfitting of the trained model.Based on these, the features were normalized by the Z-norm technique.(Z = (v − m)/std, where Z is the normalized datapoint, v is the raw datapoint, and m and std are the mean and standard deviation of the feature data, respectively).This step was to convert the data points to a range between 0 and 1(This reduced the computation time and to made the algorithm converge sooner), to solve the problem of biasing of the data and to prevent the overfitting of the trained model (this may result if the hypothesis function is simpler than the target function).Due to the complexity and the highly uncorrelated nature of the features, statistical approaches using CHI_SQUARE, MRMR and ReliefF techniques were used for feature hierarchy.The best twenty time and frequency-domain features were selected to train the SVM and DT while the best forty features were used when the combination of time and frequency domains were considered.These features were the subsets of features intersecting in all the three statistical tests.

Implementation of the algorithm
Classification algorithms have been found to yield high accuracies in human activities detection [19,21].Relying on this, a multi-sensor classification of human activities is performed in this work (Q3).To accomplish this task, the decision tree and support vector machine classification algorithms are implemented.Classification accuracy and N-way ANOVA test (Table 5) evaluation techniques were used in this work.The former is a type of evaluation technique associating both precision and recall and is very appropriate in evaluating human activity classifications [22].Two of the three repetitions from each class were used as training sets and the other one as testing set.For validation between the repetitions, permutations were done between the sets for training and testing as shown on Table 2 and the average accuracy computed as shown on Table 4. Classification decisions were made at the end of 10.5 s (the entire duration of the activity).The accurate performances of the proposed models and the reduction of overfitting were ensured by applying a tenfold crossvalidation on each of the repetitions used for training [23].This has been found to be the accurate method of preventing the overfitting of the model.

Results
Having addressed the four questions that we set as the basis of this work; the following subsection elaborates the outcome that was achieved.Initially, the performances of the  sensors for (acceleration, gyroscope, light, orientation, gravity) were analyzed separately.The models were trained by 20 best feature columns out of the total 306 original features columns.The emanating average accuracy after training the time-frequency features with SVM is 97.8% (which is the sum of the accurately classified activities in this work), with the testing accuracy of 89.1%.Table 5 shows the ANOVA test results.It can clearly be seen that the probability value on the table is too small (close to zero).This legitimates the rejection of the null hypothesis, which states that the results obtained may be by chance.The F value indicates clearly that the between group variation of the final features are very high compared to the within group variation hence ensuring the effective separation of the classes.To ensure that the accuracy obtained by the SVM is unbiased, the decision tree algorithm was also implemented with training and testing accuracies of 100% and 89.1 obtained, respectively.The confusion matrixes of these results are shown in Fig. 3. Two misclassifications are observed on class 3 being classified as class 2 (with positive predictive value of 90% and false discovery rate of 10%), and all the other classifications are correctly predicted (PPV of 100% and FDR of 0%).Separate results obtained with the time-domain features, frequency-domain features and the combination of features of both domains are presented in Table 4a-c, respectively.Statistical tests as proposed in [24][25][26][27] for twenty best intersecting features from CHI_SQUARE, MRMR and ReliefF tests were considered for all the domains except for the combination of both domains where forty best features were considered.For each feature domain two sets of data from each class were used for training and one for testing with a tenfold cross-validation and optimization by grid search.The decision tree shows consistencies in training and testing accuracies with all the feature domains in contrary to the SVM as can be seen on Table 4, showing that the decision tree is a better algorithm for this work.The training and testing accuracies slightly higher with the combination of the time-and frequency-domain features (Q4) than when the domains are considered separately.The confusion matrices in Fig. 3 present the classification matrices for both classifiers.

Analyses
The objectives of this work were anchored on examining the following; 1. Whether several crime scene activities could be mimicked, for data emanating from them to be collected by mobile sensor, that would be used to train predictive models.2. Whether data from static and dynamic crime scene activities could be used simultaneously to train a predictive model.3. Whether these crime scene activities could be successfully classified using more than three sensor data without lost in accuracy and 4. Whether the application of time-and frequency-domain features would improve the classification accuracy of these activities.
These objectives were addresses, respectively, as follows; 1. Three classes constituted made from ten activities have been successfully mimicked, and the smartphone mobile sensor data collected from them have been successfully used to train two predictive models.This serves as a genuine argument for the veracity of this work as the results are comparable to others done with real-time data in [4] 2. As elaborated in Table 1, two out of the ten activities were static.This had little bias impact on the accuracy because the bias was largely mitigated by the distribution technique adopted between the three classes.This increased the variability of the data.3. Contrarily to most of the related works, five different sensor data have been used in this work for the model training with and acceptable accuracy.4. Combining the time-and frequency-domain features slightly increases the average accuracies of the models as shown in Table 4c.A comparison of the models proposed in this work was made with related works as shown on Table 1.Based on the results on the table, it can be argued that: 1.The accuracy obtained in this work is lower than that obtained by Sukhada et al. [6] since the work focuses solely on binary classification (suspicious or non-suspicious activities) while this work took into account the static and dynamic natures of multiple activities in classification, involving ten activities grouped into three classes.2. The accuracy obtained is clearly higher than all the other accuracies, even though Köping et al. [3] used up to eight different sensor data to classify five activities.3. Overall, we verified the four claims that were set for this work by addressing them simultaneously, which were not done by the works stated for comparison as shown on Table 1.
The rule of thumb was to test the proposed model with data used in the literature mentioned on the comparison table especially that in [6].But the existence of the disparities in the number of classes, the number of sensor data as well as the activity types classified, did not agree nor constitute a common framework for it to be done.A possible future work will be to extend the binary classification technique mentioned in [6] to multi-class classification.

Parameter tuning
In order to obtain the optimal accuracy results presented above, the following parameter tunings were done:

Hyperparameters
The optimized values for the hyper parameters of the DT and SVM models were done by the grid-search method with ten divisions at the window size of 40 ms.These hyper parameters are listed in Table 6. 3. To analyze and confirm the regularization (Cp) effects on the accuracy as depicted in Fig. 4b, all the other parameters were kept constant and Cp was varied.Cp is a very important hyperparameter due to the fact that it is the one that gives information on misclassification and model fitting (overfitting/underfitting).This Cp value was then varied from 10 to 50 at a step of 5.The value of 20 was found for the error (misclassification) term because the accuracy it yielded converged with the training accuracy and gave a good trade off at the decision boundary while the other values were rejected because they were prone to overfitting

Evaluation of performance and robustness
We evaluated the performance and robustness of the proposed model by a fair comparison with the results from Sukhada et al. [6].This is because it is the closest work dealing with HAR in forensics.We used their dataset as independent data to evaluate and validate our results.We computed the frequency-domain features of their data.Only one experiment was done in time domain, frequency domain and time-frequency domain because the data was not recorded in triplicate (the case of our data), Slide degradation in accuracies were observed.These could be because of the fundamental difference in the nature of their dataset with ours.Table 7 presents the results obtained.

Discussion
As a grand premier, several static and dynamic real-life crime scenes for human behaviors and activities have been mimicked and accurately classified with good accuracy.By the time this work was done, and to the best of our knowledge, it was the only work known to have been done using this dataset.The data emanated from five different sensors and the features used for classification were both time-and frequency-domain features.The aim of this research was focused on the experiments based on the four questions set at the opening of this paper.These questions have been addressed with the use of a dataset on which no prior paper has been published to date.With this dataset, a better accuracy was obtained compared to the closest work to this one done by Sukhada et al. [6].The paper focused mainly on binary classification of suspicious against nonsuspicious activities for digital forensics.Nevertheless, the accuracy obtained needs improvement.This accuracy would have been higher but for the fact that, (1) We did not collect the data by ourselves to be sure of its genuineness at 100%.(2) Using the feature selection methods in the time domain as analyzed in [28,29] would have enhanced the results obtained in the time domain.(3) The effect of the segmentation window size [30] on the accuracy was analyzed only in the time domain while this effect was assumed for the frequency domain, (4) the smartphone was placed on the arm of the subject under test during data collection, which is not the optimal positioning according to the studies undertaken by Nweke et al. [31].They showed that the optimal positioning of the phone is on the chest.(5) The imbalance of the different classes of the dataset also contributed to the biased classification of the classes with higher observations (repetitions) to the detriment of the ones with smaller observations, (6) No information on the adversity of the weather was mentioned, since the data were from mimicked activities and (7) It was noticed that the data from all the sensors were not synchronized in time.A better performance would have been achieved if the synchronization was done before being used for classification as suggested by Jahangiri et al. [13].These factors might have compromised the quality of the data by one way or the other, hence not properly reflecting the reality.Attempts to handle these problems constitute the basis of future work on this dataset.

Conclusion and perspectives
Ten recurrent human activities in suspicious or crime scenes have been classified in this work.SVM and decision tree algorithms have been implemented with average accuracies of 97.8% and 100% obtained with SVM and DT, respectively, for training and 89.1% for both algorithms in testing.A complete pipeline for the addressing of our four objectives has been met (but not fully for objective Q2 due to imbalanced data) associated with a trained model encompassing a complete model selection and regularization, feature hierarchy and selection.The accuracies obtained have complimented the works mentioned in the literature of this work, most of which are based on bagging and majority voting.The trained models could be used in forensics science to accurately classify the activities that could have been done by a suspect or done on a victim.This accuracy would be a convincing factor for machine-aided human activity classification to be accepted in legal procedures by Judges.Prior to its improvement, this work will serve as a starting point in the analysis of a complex dataset emanating from HAR and the training of more complex models for classification.An extension of this work would be to place smartphones on several locations of the body of the subjects during activity mimicking and to use data from all of them (and synchronize them) to have more precise and unbiased data.A multi-sensor stacking ensemble algorithm could then be used in model training to obtain better results.

Fig. 3
Fig. 3 Confusion matrixes of the decision tree and SVM classification results from the using the timefrequency features with sets 1 and 3 as training sets and set 2 as testing set, a, c Classification of observations for the DT and SVM, respectively, b, d the positive predictive values and false discovery rates for the DT and SVM, respectively

1 .Fig. 4
Fig. 4 Optimization of the SVM; a Plot of training and testing accuracies against window size and b Plot of testing accuracy against the error term

Table 1
Related works

Table 2
Classes of activities and their descriptions

Table 3
Features extracted Ten time-domain features were extracted, and 2. Eight frequency-domain features extracted based on the FFT.

Table 4
(a) Training and testing accuracies with time-domain features, (b) training and testing accuracies with frequency-domain features, (c) training and testing accuracies with time-frequencydomain features

Table 6
(a) Tuned hyperparameters for the decision tree, (b) tuned hyper parameters for the SVM

Table 7
(a) Evaluation training and testing accuracies with time-domain features, (b) evaluation training and testing accuracies with frequency-domain features, (c) evaluation training and testing accuracies with time-frequency-domain features