Smartphone Inertial Measurement Unit Data Features for Analyzing Driver Driving Behavior

Driving behavior is an important aspect of maintaining and sustaining safe transport on the roads. It also directly affects fuel consumption, traffic flow, public health, and air pollution along with psychology and personal mental health. For advanced driver assistance systems (ADASs) and autonomous vehicles, predicting driver behavior helps to facilitate interaction between ADAS and the human driver. Consequently, driver behavior prediction has emerged as an important research topic and has been investigated largely during the past few years. Often, the investigations are based on simulators and controlled environments. Driving behavior can be inferred using control actions, visual monitoring, and inertial measurement unit (IMU) data. This study leverages the IMU data recorded using a smartphone placed inside the vehicle. The dataset contains the accelerometer and gyroscope data recorded from the real traffic environment. Extensive experiments are performed regarding the use of a different set of features, the combination of original and derived features, and binary versus multiclass classification problems; a total of six scenarios are considered. Results reveal that “timestamp” is the most important feature and using it with accelerometer and gyroscope features can lead to a 100% accuracy for driver behavior prediction. Without using the “timestamp” feature, the number of wrong predictions for “slow” and “normal” classes is high due to the feature space overlap. Although derived features can help elevate the performance of the models, the models show inferior performance to that of using the “timestamp” feature. Deep learning models tend to show poor performance than machine learning models where random forest and extreme gradient boosting machines show a 100% accuracy for multiclass classification.

3) Intensity Difference: The positive and negative relative speeds are different though driving at the same condition for response intensity of the drivers such as the same speed, the similar magnitudes for relative speeds, and the same gap among the following and leading vehicles. In addition, the driving behaviors in deceleration and acceleration vary when analyzing the next-generation simulation (NGSIM) data. In [9], experimental results of the presented research have reported that the reaction time in acceleration is different from that in deceleration. However, various driving profiles have been acknowledged in existing research regarding road traffic safety. In addition, aggressive driving has been comprehensively studied in many research articles that have focused to identify braking events and harsh acceleration [10].
Road transportation is commonly used to travel from one location to another location. The emergence of technologies, such as the Internet of Things (IoT), computer vision, wireless communication, and artificial intelligence, enables smart transportation with advanced capabilities for safe traveling. The wide adoption of advanced technology-enabled vehicles for transportation is still in progress. On the other hand, road accidents are not stopping soon. The World Health Organization (WHO) reported that more than one million people are killed and around 50 million people are injured by road accidents every year [11]. The road accidents trend is predicted to be increasing over the next few years and expected that road accident-based deaths become the fifth leading cause of death by 2030 [12]. The majority of road accidents are caused by human driving behavior. Although autonomous driving and advanced safety monitoring capabilities are incorporated into the vehicles, there is no guarantee that the driver is safe unless the driving behavior is normal. Driving behavior may have a direct impact on public health, traffic flow, air pollution, and environmental condition. Thus, there is a need to analyze driving behavioral patterns and understand individual driving habits so that safe driving recommendations can be provided to the users.
This study leverages the data from the inertial measurement unit (IMU) of a smartphone that is placed in a car and predicts driving behavior into several categories. In this regard, this study makes the following key contributions.
1) Importance of feature selection from the IMU data is investigated for the driver's driving behavior. The influence of using different original features derived features and the impact of binary versus multiclass classification is investigated in this study. For driver behavior prediction, six cases are considered, including binary classification, accelerometer features alone, gyroscope features alone, accelerometer and gyroscope features combined, all features combined, and accelerometer and gyroscope features plus derived features without using the "timestamp" feature. 2) Extensive analysis of prediction performance regarding driver behavior is carried out using the data recorded in a real traffic environment. The dataset is recorded using a smartphone placed in the vehicle in a fixed position and readings from the accelerometer and gyroscope are recorded.
3) Experiments involve using five well-known machine learning (ML) models and two deep learning (DL) models. Such models include random forest (RF), extreme gradient boosting (XGBoost) machine, support vector classifier (SVC), extra tree classifier (ETC), logistic regression (LR), long short-term memory (LSTM), and convolutional neural network (CNN). Performance is analyzed with several parameters such as accuracy and precision in addition to standard deviation and the number of correct and wrong predictions. The rest of this article is organized as follows. Section II discusses the state-of-the-art ML-based approaches to address the driving behavior prediction problem. Section III presents the proposed methodology to accurately predict driving behavior, and also, a description of the datasets is provided. Section IV includes the results from an extensive set of experiments for driving behavior, as well as a discussion of the results. Section V concludes this article.

II. LITERATURE REVIEW
Recent developments in driver assistance and autonomous vehicles (AVs) led to a great deal of research and development. Consequently, a large body of literature can be found on different aspects related to driving. For example, the role of trajectory data and its critical applications for microscopic modeling has been discussed in detail in [13].
In the last few years, experimentation has been performed on openly accessible trajectory datasets and reports have been published related to several traffic flow phenomena. In addition, comprehensive empirical analysis has been reported, including traffic oscillations [14], traffic hysteresis [15], and heterogeneity [16]. In addition, various models have been presented for a better approximation of car lane-changing behavior [17] and the following behavior.
Conventionally, the trajectory data are collected using an image processing technique that is based on recorded videos from either fixed drones or cameras. Currently, driving datasets are getting attention due to the demand for AV technology. The main purpose is to comprehend the challenge of computer vision systems in a self-driving context. In addition, the vehicle-based techniques detect the vehicle operating parameters, including changes in steering, speed of the vehicle, acceleration, lane tracking, and braking. On the other hand, driver-based techniques are based on devices that directly monitor the condition of the driver. Also, the driver-based techniques are the physical movement's parameters such as blink ratios and eye closure ratios, and facial expression tracking with video imaging methods. The most famous trajectory dataset is possibly the NGSIM database [18], which has a total duration of 150 min from fixed cameras at four different sites. Also, another famous dataset is the highD dataset, which contains videos from camera-equipped drones and has a total duration of 16.5 h at six locations on the highways of Germany [19]. The driving situations presented in highD and NGSIM are quite limited. The NGSIM dataset includes signalized intersection and highway driving scenarios. However, traffic lights used to control signals and interactions are slight and rare.
Currently, many new datasets concerning the vehicles at high-level automation are made available [20]. For example, Argo [21], KITTI [22], BDD100K [23], Lyft Level 5 AV [24], Waymo open [25], and nuScenes [22] contain the data for AVs and similar driver assistance systems. These are related to Lyft Level 5 AV, AV, nuScenes, and Waymo open datasets and combine trajectories for AV and the human-driven from real-world traffic. Moreover, these datasets are mentioned as AV-oriented empirical datasets. Therefore, these datasets are mainly helpful for driving behavior research. In addition, these AV-oriented empirical databases are sophisticated, which helps to understand complicated driving behaviors to understand and use by traffic flow researchers. First, these datasets are combined using an array of sensors; for example, light detection and ranging (LiDAR), a novel sensor to record traffic flow, is used. Second, the dataset contains several sensors and is more sophisticated than the conventional dataset. It collected not only comprehensive information for the movement of AVs but also a vast amount of information for all objects in the vicinity of the vehicle. Finally, the format and structure of these datasets are not user-friendly. Moreover, BDD100K contains ten tasks, namely, lane detection, image tagging, drivable area segmentation, semantic segmentation, road object detection, instance segmentation, multiobject segmentation tracking, multiobject detection tracking, imitation learning, and domain adaptation.
Ferreira et al. [26] performed a survey on driving behavior improvement using ML and DL models. The study revealed that the combination of sensors and intelligent methods improves the performance of driving behavior classification. The study [27] designed a driving behavior detection method for identifying rash drivers. The contributions in this article mainly include the architectural aspects of a system to build the driving behavior identification, including the monitoring system. However, the authors did not evaluate the driving behavior using the ML and DL models.
Osman et al. [28] presented a two-level hierarchy classification of driver activity while driving. Five input features, speed, longitudinal acceleration, lateral acceleration, pedal position, and yaw rate, are considered for testing the driving behavior classification. The driver's secondary task while driving is detected in the first level. Then, the different types of secondary tasks are categorized in the second level. The ML-based decision tree achieved the best results with an accuracy of 99.8% to classify the driver's secondary tasks. The study [29] proposed a lightbgm model to detect abnormal driving behavior. The accelerometer and gyroscope sensor data are input features to predict driving behavior. The authors reported that lightbgm achieved 82% accuracy on the test dataset. The classification accuracy still needs to improve for better driving behavior detection.
The study [30] proposed a 2-D CNN technique to analyze the driving behavior. The sensor data, such as acceleration, gravity, revolutions per minute, speed, and throttle, are used as a feature to construct an input image. The output is classified into five types, such as normal, aggressive, distracted, drowsy, and drunk driving using 2-D CNN. The authors reported that the proposed method obtained good results in predicting driving behavior.
The study [31] explored multiclass gait classification with ML approaches, including k-nearest neighbors (KNN), extreme learning machines (ELMs), SVM, and multilayer perceptron (MLP), and evaluated the performance for multiclass gait classification. The presented approach achieved the best results. The ELM is introduced to analyze the neuromuscular mechanics that is associated with the brain of patients suffering from multiple strokes and sclerosis. In addition, an artificial neural network (ANN) is applied to classify the human gait and its performance is compared with the ELM. A DL ensemble technique is used for human lower activities recognition to capture the learning process of bipedal robot locomotion in [32]. The LSTM and CNN models are used to classify these activities. In [33], a multibranch CNN-bidirectional LSTM (BiLSTM) network is applied for automatic feature extraction from raw sensor data with minimum data preprocessing.
Predominantly, existing literature on driver behavior prediction is based on ML algorithms; however, the use of non-ML architectures is also observed. For example, Oliver and Pentland [34] used the hidden Markov model (HMM) and coupled HMM (CHMM) for driver behavior prediction. Combined with car and traffic data, promising results are obtained regarding different driver actions. It is believed that ML models are black boxes, and it is not clear how predictions are made from such trained models. Consequently, several studies prefer non-ML models. The study [35] leverages rule-based models for driver behavior prediction. These models maintain long-term coherence and are easy to interpret.
The study [36] utilizes an autoregressive input-output HMM (AIO-HMM) for driver behavior prediction. The focus is especially placed on driver behavior at intersections, and driver gaze and traffic light recognition are used for that purpose. Similarly, Xu et al. [37] determined aggressive driver behavior by using multivariate-temporal features and driver's intention using HMM.
The above-discussed research studies have several shortcomings. First, the datasets containing smartphone sensor data are not very well studied for analyzing driver driving behavior. Second, although several studies utilize these datasets for driver behavior prediction, the impact of feature combination and ML techniques is not well covered in the literature. Third, the context of the dependency between the features and the prediction output is not explored very well. Last but most important, the driving behavior prediction accuracy can be improved for existing works. Keeping in view these research gaps, this study proposes a highly accurate, ML-based driving behavior solution with extensive performance analysis for driver behavior.

III. PROPOSED METHODOLOGY
In this section, we discuss the proposed methodology for driver behavior prediction. We used several ML models to predict driver behavior as "slow," "normal," or "aggressive." Fig. 1 shows the flow of the proposed methodology. This study leverages DL models for driver driving behavior Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  prediction. The selection of DL models is based on the results reported in the existing literature. For example, LSTM and CNN models are commonly used on similar kinds of datasets as in [38] for human behavior prediction and in [39] for human activity detection. Similarly, Zhang et al. [40] used variants of CNN and LSTM for driver behavior detection.
First, we acquire the dataset from the Kaggle repository. The dataset consists of several samples related to three target classes "slow," "aggressive," and "normal." After acquiring the dataset, we find that dataset features are not correlated with target classes, which does not help the ML models to achieve a significant accuracy. Feature engineering steps are included in our proposed methodology to improve the performance of ML models. In feature engineering, we generate new (derived) features using old features to train learning techniques. Data are split into training and testing subsets for training several ML models. We split the dataset with an 80:20 ratio where 80% of the datasets are used for the training of models and 20% of the datasets are used for testing of models. In the end, testing and validation are performed. We evaluate all models in terms of accuracy, precision, recall, and F1 score.

A. Dataset Description
The mobile sensors generated driving behavior dataset (DBD) is obtained from Kaggle [41]. The "Sensorrecords" mobile application was used to capture the sensor data observations. This dataset is used by many recent studies [42], [43], [44]. The three dimensions of accelerometer and gyroscope sensor observations are mainly considered dataset features. The combination of accelerometer and gyroscope sensors helps to effectively track movement behavior. The accelerometer captures the linear acceleration along the axis, whereas the gyroscope captures the rate of rotation along the axis. The timestamp is also included as a feature in the dataset. The DBD was collected using mobile sensing technology with accelerometer and gyroscope sensors enabled in the mobile when the user is driving the vehicle. Table I shows the input and output feature set along with measurement metrics and the dataset count. The driving behavior output is designated as "normal," "slow," and "aggressive" driving. Normal driving behavior denotes that the driver maintains a constant speed and is aware of the surroundings. Slow driving may include low-risk driving behavior and essentially driving with fear or overconscious. The aggressive driving category includes unusual driving behavior with sudden breaks and accelerating the vehicles, unexpected lane-changing behavior, and unfocused driving due to eating, texting, and so on. The dataset consists of 3084 samples with a different number of samples for driving behavior classes as slow with 1273 samples, and normal and aggressive classes with 997 and 814 samples, respectively. The sample of the dataset is shown in Table II.
The original dataset consists of three accelerometer features, three gyroscope features, and a timestamp. These features are not much correlated with the target classes so to improve the accuracy of models we generate more features that are more correlated with the target classes. Fig. 2 shows the sample values for accelerometer and gyroscope data for each of the three classes. We find that several values from the "normal" and "slow" target classes are similar, which can create complexity for learning models to distinguish these targets based on sample values.
Along with the x-, y-, and z-axes values for both accelerometer and gyroscope, the dataset also contains a timestamp  attribute. The histogram distribution of all these attributes is presented in Fig. 3.
To analyze the feature correlation of these features, RF is used and the results are shown in Fig. 4. It can be observed that features have different levels of correlation.

B. Feature Selection
We have seven features in the used dataset for driver behavior prediction. All features are not important for ML models. Thus, we make several scenarios/cases with feature selection.   and the feasibility of using the features to obtain higher accuracy. We find that several values for normal and slow classes are similar, which creates complexity for the models. Thus, we combined both target classes as one (SLOW + NORMAL = SLOW). In this way, we convert the multiclass problem into a binary class problem (SLOW and AGGRESSIVE). 4) Case 4: Experiment without timestamp feature and three target classes. In this case, we used both the gyroscope and accelerometer features and excluded the timestamp feature. We used three target classes in this case (SLOW, NORMAL, and AGGRESSIVE). 5) Case 5: Experiment with the timestamp feature plus three target classes. In this case, we used all features (three features from the gyroscope, three features from the accelerometer, and the timestamp feature) for model training with three target classes (SLOW, NORMAL, and AGGRESSIVE). 6) Case 6: Experiment with new (derived) features and without the "timestamp" feature for three classes. In this case, we used three gyroscope features, three accelerometer features, and four newly generated features, including mean, median, ProbRF, and ProbXGBoost. The mean feature is obtained by taking means of the gyroscope and accelerometer features. Similarly, we take the median of the gyroscope and accelerometer features. Two additional features of ProbRF and ProbXGBoost are generated using the tree-based ensemble models, including RF and XGBoost. We train three models on the whole dataset and then pass the whole dataset to make prediction probabilities. These prediction probabilities are used as features.
a) ProbRF: To drive the new features, we use ML models. The derived features are closer to the target, which guides the learning models toward more accurate predictions. We trained RF on original features and find the prediction probability for the target classes against each sample. This prediction probability included in the feature set. We can define it mathematically as where M is the size of the dataset and D is the dataset. b) ProbXGBoost: We used XGBoost also to drive the features, and similar to RF, we also pass original features to the model and find the prediction probability for the target classes against each sample. We can define ProbXGBoost mathematically as

C. ML Models
We used several ML algorithms for driver behavior prediction. We used RF, XGBoost, SVC, ETC, and LR with their best hyperparameters settings. We find the best hyperparameters by tuning each model between a specific range.
1) Random Forest: It is applied for both regression and classification problems. It is an ensemble model that uses the decision tree concept for classification. The bagging technique is applied to train a large number of decision trees with several samples of bootstrap [45]. In addition, an RF is used to reduce the overfitting problems with a bootstrap technique for sampling. Sampling for the training dataset using replacement is applied to attain a bootstrap sample where the training dataset and sample size are similar [46]. All classifiers that use the decision trees for the process of prediction apply the same methods to construct the decision trees. For this, attribute selection of root nodes at every level is challenging during tree construction in RF [47]. In ensemble classification, different classifiers are trained and all classifier results are integrated through the voting process. Many contributors have described multifarious ensemble approaches; boosting and bagging are very famous ensemble techniques [48]. Several classifiers are trained on bootstrapped samples that lead to a drop-in for classification in the bagging method. As shown in Table IV, we choose m_ estimtr = 300 to obtain the best accuracy when using the voting method for combining the individual predictions. The maximum depth, mx_dpth, is set to be 300 to reduce the probability and complexity of overfitting. The RF class prediction is represented aŝ where B represents the number of decision trees.
2) Extra Tree Classifier: It uses the process of randomization as a base concept to construct trees [49]. For every node, the split conditions are decided randomly at every node for an extra tree, and the prime performing rule is selected to associate with that node, which is based on a score calculation. This is helpful when reducing the complexity significantly of the induction process and increasing the training speed. To do so, the correlation among the decision tree is reduced. The process of node splitting is easy and the computational load for the algorithm is dropped as the ETC is not included in locally optimal cut points. The bagging process is not used as the whole available learning set is provided to every decision tree [50]. As described in Table IV, the three parameters, rndm_state, mx_dpth, and m_estimtr, are chosen to be 27, 300, and 300, respectively.
3) Logistic Regression: It is a pure statistical technique that is applied for data analysis and contains one or more variables for outcome prediction. LR is applied to evaluate the class member's probability because it is the best classifier when it comes to a definite target variable. To estimate the probabilities, a logistic function (LF) is used to evaluate the behavior among dependent variables and independent variables [45]. The "slvr" parameter is set as "newton-cg" due to solving the multiclass classification problem. In addition, the multiclass parameter is set as "multinomial" because of multiclass classification. "D" is set to 1. The "D" value is inversely proportional to regularization strength and helps to reduce the overfitting probability eventually [51]. The probability of predicting the class k, given the input sample X i , is 4) Support Vector Classifier: It is a linear one, is used for regression and classification, and has many applications. SVC divides the sample data into different classes with a hyperplane or set of hyperplanes in an m-dimensional space, where m is used for the number of features [52], [53]. SVC performs classification to find the "best fit" hyperplane that is differentiated among classes. To deal with the nonlinear issues, this research uses a "linear" kernel for the support vector machine, which is frequently used when the dataset has many features. The linear kernel training is faster due to the requirement of D regularization parameter optimization. In Table IV, D regularization parameter value is set to 3, and rndm_state value is 500. The hyperplane function is denoted as The objective function needs to be minimized such that y i (w · xi+)b ≥ 1 satisfies all the time. 5) Extreme Gradient Boosting: XGBoost model works in a way similar to the gradient boosting model. However, an additional feature is needed for assigning weights to every sample such as in the Adaboost model [54], [55]. The XGBoost is a tree-based classifier and it has received much attention recently. XGBoost fits several distinct decision trees parallel, which ensures the sequence. For this, XGBoost provides a speed boost. The XGBoost has standardized methods to control overfittings such as L1 and L2 and these methods are not available in Adaboost and GBoost models. Here, α and λ are the L1 and L2 regularization terms, respectively. In addition, an extra key feature of gradient boosting is scalability. It helps to better perform on distributed systems and process large-scale datasets. Moreover, it uses a log-loss function, which is very helpful for loss minimization and increasing accuracy. The log-loss function estimates the probability of false categorizations. The loss function is defined as In Table IV, values of four parameters are set for XGBoost. The parameter m_estimtr = 300 implies XGBoost that is used 300 decision trees for the base learner, which takes part in the process of prediction. The parameter mx_dpth = 300 restricts the growth of the trees to a maximum of 300. The lerning_ratio = 0.2 is used to control the overfitting [55]. The rndm_state = 27 restricts the random seed specified to every tree estimator at every boosting repetition. In addition, it controls random permutations for features at every split.

IV. RESULTS AND DISCUSSION
In this section, a detailed description of the experimental results obtained using ML techniques and analysis is presented. The experiments were run on a standalone Linux machine with a system configuration of 8-GB RAM and eightcore processors. A notebook web application runs locally on the machine to perform the experiments. The software packages scikit-learn 1 were installed and the python programming language was used to write the code. The performance metrics, accuracy, precision, recall, and F1 Score, are used to compare the experimental results. Accuracy is defined as the sum of the true positives (TPs) and true negatives A. Driving Behavior Prediction Using Gyroscope Features (Case 1) In this section, the driving behavior performance results when the acceleration feature is ignored are discussed. Table V shows the performance metrics for all five models when the acceleration feature is excluded.
The prediction accuracy of the selected ML models varied between 35% and 38%. It is evident that driving behavior prediction using only the gyroscope data performed poorly. The SVC and LR models performed slightly better than decision tree-based models. In particular, the "slow" target class prediction shows promising results with 99% recall for both LR and SVC models. These results also show that the "normal" class target is misclassified as a "slow" class target in both SVC and LR models. The distinction between normal 1 https://scikit-learn.org/stable/ and slow targets is challenging using mathematical models to separate the classes. The precision and recall metrics follow a similar trend as accuracy in decision tree-based models. Overall, the performance metrics indicate that acceleration features are valuable for driving behavior classification and should be included for multiclass evaluation as the performance of models using the gyroscope data alone is not satisfactory. Table VI shows the accuracy of consistency analysis by measuring the standard deviation. The accuracy-based standard deviation for all the models is varying between 0.02 and 0.04. These results indicate that we can rely on the obtained accuracy values and not see significant variations even while repeating the experiments. Here, CP is the number of correct predictions and WP is the number of wrong predictions. Table VI shows the driving behavior target class correct and wrong prediction sample count for each model. The LR model correctly classifies more test samples with 275 correct classifications that are higher compared to other models. On the other hand, the ETC model least correctly predicts the test samples with 247 correct predictions. We can construct that the decision tree-based models are least performed to correctly classify the driving behavior compared to the LR and SVC models.

B. Driving Behavior Prediction Using Acceleration Features (Case 2)
In order to evaluate the performance of the driving behavior under different feature combinations, we start with a feature set of 4 by excluding the gyroscope attributes and using only the accelerometer features. Table VII describes the performance of the driving behavior with the accelerometer features case. The five ML models, RF, LR, ETC, SVC, and XGBoost, are considered for our evaluation. Table VII clearly shows that none of the ML techniques performed well when the gyroscope features are excluded from the trained dataset. All five models achieved similar prediction accuracy on the test datasets. The SVC and LR obtained 40% accuracy, whereas RF, ETC, and XGBoost achieved approximately 39% accuracy. A similar trend appears in precision and recall metrics for all five models except the recall for the "slow" class case when LR and SVC are used. The LR and SVC report 84% recall for the "slow" target case. The "normal" target case is greatly impacted by recall performance when the "slow" target case is predicted using LR and SVC. The macro average obtained for both precision and recall metrics is almost the same for all five ML models. Overall, these results show that gyro features are essential for predicting driving behavior and should not be ignored. The accuracy of the ML models is verified using the standard deviation measurement. Table VIII shows that the accuracy standard deviation in all five models is minimal and near zero. Thus, the accuracy results are consistent when the standard deviation is considered regarding the ML models.
The test dataset samples' correct and wrong predictions for all the five ML models are shown in Table VIII. The SVC model correctly predicts the highest number of test samples with 291 correct predicts, which is higher than all other models. On the other hand, the XGBoost and LR models show the least correctly predicted test-driving behavior samples, each with 278 correct predictions.

C. Driving Behavior Prediction Without Timestamp Features and With Two Target Classes (Case 3)
In the above test case, we have seen that it is difficult to discriminate between the "slow" and "normal" target classes, which reduces the prediction accuracy of models. Thus, we evaluate the performance of the models by combining the "normal" and "slow" target classes as one target class and excluding the timestamp feature from the input dataset. Thus, the number of input features is 6, and the output classes are 2 in this scenario. Table IX presents the performance metrics of the models when two output classes are considered, and the timestamp is excluded from the input dataset. The results indicate that all five models achieve 100% accuracy, prediction, recall, and F1 score. Thus, for binary classification, the decision tree-based models, SVC and LR, are able to classify the target classes even if the timestamp is not present in the input datasets. When the gyroscope and accelerometer features are used to train the models, the models classify the "aggressive" and "normal" driving with 100% accuracy. However, the "normal" and "slow" driving behavior classifications require additional features to capture the driving behavior. Table X supports the fact that the accuracy is 100% for this dataset when the output classes are categorized into two classes and no standard deviation is observed for this case. Table X shows the number of samples that are correctly classified for all five models when the classification categories are two. All the 446 testing samples are correctly classified as either "normal" or "aggressive" driving.

D. Driving Behavior Prediction Without Timestamp Feature and Three Target Classes (Case 4)
In general, the timestamp feature may add little value to accurately predict the detection or classification using ML models. We excluded the timestamp from the input features to test the case and trained the models with six features, three features each from the accelerometer and gyroscope. Table XI shows the performance metric values for the selected five models when the timestamp feature is excluded from the input feature. The prediction accuracy for all five models is slightly better than in the previous two cases. However, the overall performance follows a similar trend as the last two cases. Except for the "slow" target classification using LR and SVC, the performance is nominal. The "aggressive" target classification precision for RF, LR, ETC, and SVC has been slightly improved as well compared to the gyroscope and accelerometer feature alone. Overall, based on the performance metrics obtained when one of the features is excluded from the input, a feature set suffers a performance loss. This can be the fact that the target classification is multiclass and the input features are not enough to distinguish the multiclasses, in particular, the "normal" versus "slow" target classes. Table XII shows that the accuracy is consistent for all the models, even if ML training and testing experiments are repeated with a slight standard deviation between 0.03 and 0.04. Thus, we can confirm that the timestamp exclusion also has a consistent performance loss impact on the classification results.
Interestingly, the RF performed slightly better than other models when the timestamp feature is excluded from the input feature dataset. Table XII shows the correct and wrong predicted test classification sample count for the models. RF can correctly classify 314 samples, whereas the SVC correctly classified the least number of data samples, i.e., 278.

E. Driving Behavior With Timestamp Feature and Three Target Classes (Case 5)
Although we obtained 100% accuracy for driving behavior using binary classification, the best accuracy still needs to be achieved for multiclass classification. Thus, we use all the dataset input features and keep the target classes as three (normal, slow, and aggressive) for performance evaluation. Table XIII shows the performance metrics of the models when all the input features are included to train and test the models. We can see that decision tree-based models obtain 100% accuracy for multiclass target classification. The precision and recall are also 100% for decision tree models such as RF, ETC, and XGBoost. However, the LR and SVC models did not perform well for multiclass classification and only obtained 37% accuracy. These models cannot distinguish between the "normal" and "aggressive" target classes. Overall, the RF, ETC, and XGBoost techniques suit well for driving behavior target classification, in which the classes are not easily separated with mathematical computations. Table XIV reveals that the decision tree-based model's accuracy is consistent when performing the experiments multiple times. On the other hand, the LR and SVC obtained low accuracy when repeating the experiments.
As we can see in Table XIV, the RF, ETC, and XGBoost are able to correctly classify all the driving behavior samples of 729 into three classes. On the other hand, LR and SVC perform poorly and each has 458 wrong predictions.

F. Driving Behavior Prediction Without Timestamp Feature and New (Derived) Features (Case 6)
As the number of features in the dataset is less, additional features are included to test the ML models' performances. The mean of the accelerometer in the x-, y-, and z-axes is  included as another feature. Similarly, another feature is the mean of gyroscope values in the x-, y-, and z-axes. Overall, nine features are used to train the models. Table XV presents the performance evaluation metric values when testing the dataset with models. The prediction accuracy is improved in all five models, and the accuracy range is between 65% and 67%. The "aggressive" target class obtained 100% precision and recall for all five models. It shows that distinguishing the "slow" and "normal" classes hampers the overall accuracy of driving behavior multiclass classification. The "normal" target class characteristics should be captured in training to accurately classify all the classes in the DBD. As shown in the previous cases, the SVC and LR models achieved "slow" target classification with a recall of 86%. Overall, the mean of the sensor values is essential and can obtain significant performances in the multiclass category. Table XVI shows the consistency of model accuracy values by measuring the standard deviation. The results indicate that the accuracy for all the models lies between 0.62 and 0.69.  The ETC model can correctly predict more driving behavior samples (485) than other models. XGBoost shows the least performance model with a correct classification of 474 samples. The results in Table XVI show that most of those correct predictions belong to the aggressive target class. Fig. 5 shows the comparison between all cases. According to the results, models show superb performance when the "timestamp" feature is included in the dataset for training and testing the models. Although, not as successful as the "timestamp" case, when using the additional features, the performance of the models is better than using the original features without "timestamp." Similarly, when the problem is transformed into a binary problem (slow versus aggressive), models show superior performance.  selection and output class selection and the corresponding model performances, we propose that the XGBoost techniques achieve the best performance with minimum computation time for driving behavior sample multiclass classification. This variation is because of the number of target classes and a number of features for the experiments. Case 6 consists of original features as well as new features. Thus, the increase in feature set size also increases the computational time.

H. Results Using Additional Datasets
To prove the significance of the proposed approach, we utilized two additional datasets for case 6 where additional features are generated. In addition, used datasets include the "DBD" [56] and the "Carla driver behavior dataset" (CDBD) [57]. DBD consists of 60 features combining accelerometer and gyroscopes features and four classes. The dataset collection includes the use of Ford Fiesta 1.25, Ford Fiesta 1.4, Hyundai i20, and three different drivers with the ages of 27, 28, and 37. The collection involves an MPU6050 sensor and Raspberry Pi 3 Model B, while the CDBD dataset consists of six features, three from the gyroscope and three from the accelerometer. Seven drivers contribute to this dataset, and for each instance, the dataset is categorized on the driver names, mehdi, selin, onder, apo, berk, hurcan, and gonca. For experiments, the best-performing models, RF and GBM, are used and the results are shown in Table XVIII. RF and GBM both show better results for the DBD dataset as they achieve a 1.00 accuracy score, while for the CDBD, they could not perform well as only RF can achieve a 0.72 accuracy score. The model's poor performance on the CDBD is because of the poor relationship between the target classes and the feature set. Fig. 6 shows the confusion matrices for RF and GBM for both datasets. For the CDBD dataset, confusion matrix values 1-7 indicate the apo, berk, gonca, hurcan, mehdi, wonder, and selin classes, respectively. For the DBD dataset,

I. Performance of DL Models
This study also performs experiments using the DL approach. We deployed two state-of-the-art models LSTM and CNN for driver behavior predictions. This study uses two models LSTM and CNN for driver behavior prediction as these are commonly used models for similar kinds of datasets. For example, the study [38] used CNN for human behavior prediction, and the study [39] used LSTM and CNN for human activity detection. The authors utilized variants of CNN and LSTM in [40] for driver behavior detection. The wide use of these models motivated us to choose these models in the  current study for driver behavior prediction. We used these models with their state-of-the-art parameters settings, as shown in Table XIX. The embedding layer is an input layer that defines the vocabulary size, output dimension, and length of the feature set. We used these models with 100 epochs and categorical_cross-entropy loss function because of multiclass data. We also used the "Adam" optimizer to compile these models. Table XX shows the results of DL models, which indicates that both models perform well when we used the "timestamp" feature. LSTM outperforms with a 0.84 accuracy score for case 5 when we used timestamp. Overall, the performance of DL models is not significant in terms of accuracy compared to ML models because of the small feature set. DL models required a large feature set for a good fit.

J. Comparison With Other Approaches
To show the significance of the proposed approach, we performed a comparative analysis with other studies as well. We deployed approaches from other studies on the dataset used in the current study to perform a fair comparison. We selected recent studies, which have done work on similar types of datasets. The study [58] worked on human activity detection using an MLP model. The authors utilized gyroscope and accelerometer features for human activity detection. Similarly, studies [59] and [60] used smartphone accelerometers and gyroscope features using SVM and KNN models, respectively. The study [61] worked on sign classification using an ML approach. The authors deployed SVM using the accelerometer and gyroscope features dataset. Similarly, smartphone IMU data are used in [62] with an ensemble model for the same purpose. We deployed all these studies on our used dataset with all cases to carry out a performance comparison. Table XXI shows the comparison with other approaches, which indicates that the proposed approach outperforms the existing state-of-the-art approaches.

K. Discussion
This study performs the experiments using a total of six different cases where the influence of using a different set of features is extensively investigated for driver behavior prediction. Similarly, the impact of multiclass and binary classification is also analyzed. It is found that "timestamp" is the most important feature regarding the performance of ML models. Adding this feature to the training dataset dramatically increased the classification accuracy for multiclass classification. Although using additional (derived) features can show better performance even without using the "timestamp" feature than using individual features from accelerometer and gyroscope features alone, this performance is inferior to that of using the "timestamp" feature. Primarily, the "slow" and "normal" classes seem to have similar feature space, as shown in Fig. 7(f), which leads to a higher number of wrong predictions for these classes when the "timestamp" feature is not used. Using accelerometer features or gyroscope features alone is not sufficient to produce high performance, as shown in Fig. 7(a) and (b). However, when the "timestamp" features are combined with either accelerometer or gyroscope features, the performance of the models is enhanced, as shown in Fig. 7(c) and (d). The deliverable things of this research are a software-based approach for driver behavior prediction, which is more accurate and efficient. Linked with a data source, this approach provides driver behavior prediction.

V. CONCLUSION
Driver behavior prediction is an important part of designing the interaction between advanced driver assistance systems (ADASs) and human drivers for future transportation systems. Consequently, driver behavior prediction has emerged as an important research topic and has been investigated largely during the past few years. Often, the investigations are based on simulators and controlled environments. This study investigates the use of a different set of features, feature combinations, use of original plus derived features for driver behavior prediction using the dataset recorded in a real traffic environment. The data recorded using the smartphone accelerometer and gyroscope is used for experiments using several ML and DL models. This study designs six cases to investigate the impact of feature selection and binary versus multiclass classification problems. Results indicate that using accelerometer or gyroscope data alone is not sufficient to obtain high performance. Combining the features though increases the performance, and yet, the accuracy is still low. Primarily, the "slow" and "normal" class feature spaces tend to overlap, which reduces the performance of the models. Adding derived features would further improve the performance, and however, the best performance of 100% accuracy is achieved by RF and XGBoost models when accelerometer and gyroscope features are combined with the "timestamp" features. DL models tend to show lower accuracy than ML models. This study uses a small dataset, which can be seen as a limitation. The small size of the dataset may not be enough for the training of models, especially DL models. The second limitation is the small feature set because DL models require a large feature set to get a good fit. In future work, we intend to collect our own dataset and perform experiments. Besides, the use of non-ML architectures, including probabilistic methods or statistical directed acyclic graphs, would be a good dimension to explore.