Driver Behavior Recognition using Recurrent Neural Network in Multiple Depth Cameras Environment

To improve the driving safety triggered by driver’s behavior recognition in an in-car environment, we propose to use depth cameras mounted in a car to generate behavior models generated by a deep learning algorithm for a driver’s behavior classification. The contribution of this paper is trifold: 1) The proposed multi-view driver behavior recognition system can handle the occlusion problem happened in one of the cameras; 2) Using the recurrent neural network can effectively recognize the continuous time behavior; 3) the average recognition accuracy of proposed systems can achieve 83% and 88%, respectively. Introduction A driver’s behavior plays an important role to affect the traffic safety. For example, answering a phone, watching a video, or chatting with the people with a head turning behavior often lead the following car accidents. To increase a driving safety, a driver’s behavior is analyzed, understood, and recognized [1, 2] to assist a driver to behave in a proper manner in a car. For example, Jain et al. [2] proposed to utilize cameras to understand a driver’s behavior in an in-vehicle environment. However, it is challenging to use an in-car camera for behavior recognition due to the light changing, occlusion, and the clutter issues. Furthermore, in a limited in-car space, as shown in Fig. 1 (a), mounting positions of a camera to capture a driver’s behavior is also very limited. Based on a limited mounting position, the captured content of a frame leads severe self-occlusion issue, as shown in Fig. 1 (b). Figure 1. In-vehicle environment: (a) In-vehicle environment is a narrow space, (b) Driver in a sitting position and whole body was occluded by other


Introduction
A driver's behavior plays an important role to affect the traffic safety. For example, answering a phone, watching a video, or chatting with the people with a head turning behavior often lead the following car accidents. To increase a driving safety, a driver's behavior is analyzed, understood, and recognized [1,2] to assist a driver to behave in a proper manner in a car. For example, Jain et al. [2] proposed to utilize cameras to understand a driver's behavior in an in-vehicle environment. However, it is challenging to use an in-car camera for behavior recognition due to the light changing, occlusion, and the clutter issues. Furthermore, in a limited in-car space, as shown in Fig. 1 (a), mounting positions of a camera to capture a driver's behavior is also very limited. Based on a limited mounting position, the captured content of a frame leads severe self-occlusion issue, as shown in Fig. 1 (b). To recognize a driver's behaviors, a Kinect depth camera mounted is in a car, with skeletons and the 3D point cloud revealed from an official SDK [3]. We propose two approaches for driver behavior recognition based on a deep learning algorithm. The rest of this paper is organized as follows. The related works and the framework of the proposed driver behavior recognition using recurrent neural network system are presented. The experimental results are also reported. Finally, the conclusions and future work are given.

Related Work
To recognize human behavior using a Kinect depth camera, Hussein et al. [4] proposed a 3D joint covariance descriptor which employs the angular relationships among joint vectors with the linear support vector machine (SVM) classifier for recognizing actions. By adopting MoCap and a Kinect depth camera, Wang et al. [5] extracted 3D joint features and used local occupancy patterns to generate spatial histograms for behavior recognition, using SVMs to train the classifiers. Yang and Tian [6] proposed the EigenJoints method based on a principal component analysis for behavior recognition. In addition to using a depth camera, human action recognition approaches [7,8] adopt deep learning algorithms with a color camera.
On the other hand, to achieve behavior recognition, researchers paid attention to utilize multiple cameras to compensate the the occluded areas and out of observation rage issues in a single camera environment. Azis et al. [9] proposed a weighted averaging fusion algorithm for generating a multiview skeleton with extracted 3D joint features to train the behavior classifiers. Kuo et al. [11] proposed a time-variant skeleton vector projection scheme using multiple infrared-based depth cameras. The proposed occlusion-based weighting element generation can be employed to train SVM classifiers to recognize behaviors in a multiple view environment.
In the in-vehicle environment, to recognize a driver's behavior becomes a challenging research issue in human action recognition due to the limited viewing angle and the clutter environment in a car. To name a few, Xing et al. [12] used a Kinect camera to match the FFNN network to identify driving and non-driving actions. On the other hand, Chuang et al. [13] used the relative position in the space of the skeleton to recognize the driving behavior.
Therefore, in this paper, we will focus on the driver behavior recognition in a single depth camera and a multiple depth cameras environments. In addition, deep learning algorithms will be adopted for training a proper model for classification. Furthermore, the computational complexity for adopting different architectures of deep learning algorithms will be compared and discussed.

Proposed Driver Behavior Recognition System
In this paper, a driver behavior recognition system is proposed, as shown in Fig. 2. Basically, the proposed system is separated as a training stage and a testing stage. In the training stage, the training data needed to be pre-processed and used for training the recurrent neural network (RNN) model. In the testing stage, when the testing data is pre-processed, the RNN model generated in the training stage is used for classification, which is applied for driver behavior recognition in the proposed system. The details will be described in the following subsections. The flowchart of the proposed multiple views driver behavior recognition system.

Recurrent Neural Network
To generate a model for behavior recognition, a conventional recurrent neural network (RNN) is adopted in this paper. As shown by the left part of Fig. 3, an RNN uses a sequential data (the green circle) with a memory state (the orange circle) and the generated hidden layers (blue circles) to decide the output (yellow circle). By extending from the conceptual architecture in the left part of Fig. 3, the sequential step-by-step flow chart is shown in the right part of Fig. 3. For example, at time instance t 2 , the input vector is x 2 (the green circle in the bottom-middle part), the hidden layer s 2 (the central blue circle) is influenced by the memory state value c 1 (the orange circle, a copy from s 1 ) in the previous time instance at t 1 . Meanwhile, the current hidden layer s 2 is made a copy to c 2 to be the input of s 3 in the next time instance t 3 . Directly applying an RNN model can bring the advantages of an artificial neural network, but the vanishing gradient problem for training a deep neural network is also brought. To alleviate the vanishing gradient problem, Hochreiter and Schmidhuber proposed the long short-term memory (LSTM) [14] approach. To improve a simple chain structure of the hidden layer with a tanh activation function in an conventional RNN, LSTM uses multiple sigmoid activation functions with an adaptive memory manner. In this paper, we adopt LSTM to generate the RNN model for behavior recognition.

Skeleton Based Driver Behavior Recognition System
In the proposed driver behavior recognition system, as shown by the scenario in Fig. 1, Kinect cameras is mounted at the left and right of a driver to capture the skeleton data, according to the official Microsoft Kinect SDK 2.0 [3] with 25 skeletal joints, as shown in Fig. 4. As shown in Fig. 1 (b), because the lower body of the driver is occluded by an instrument panel, only the skeletal joints of the upper body of the driver is used as the input for generating the RNN model for behavior recognition. The pre-processing step in the proposed skeleton-based approach is to remove the skeletal joints not belonging to the upper body to be the input X of an RNN. As shown in Fig. 5, a single layer LSTM neural network (the blue rectangles in the middle) is adopted for generating the RNN model for driver behavior recognition. After the pre-processing step to remove the skeletal joints belonging to the lower body of a driver, 19 joints are reserved for one frame. The 3D position values in x, y, and z axis are obtained for each joint, according to the Kinect SDK. Therefore, 19 × 3 = 57 values are utilized as the input nodes of the LSTM neural network for a time instance. For example, to time instance t 1 , 57 nodes are collected at frame t 1 of a Kinect camera, as by the left part shown in Fig. 5. In addition, for recognizing a driver's behavior, total 60 frames are used for behavior observation. Furthermore, in the single layer LSTM neural network for generating an RNN model, the number of the nodes for a hidden layer is set as 10.

Multiple Views Point Cloud Based Driver Behavior Recognition System
In the proposed driver behavior recognition, in stead of using a single camera, it is possible to mount a second Kinect camera in a car to compensate the occlusion issues and the out of the observation range issues (in a field of view). As shown in Fig. 6 056-2 IS&T International Symposium on Electronic Imaging 2019 Autonomous Vehicles and Machines Conference 2019 According to an operation by a homography matrix [16,17], as shown in Fig. 6 (b), the point clouds from multiple views can be fused as a more complete 3D point cloud. Next, as shown in Fig. 6 (c), the background point cloud of a driver can be removed by setting a 3D region-of-interest. Finally, to achieve a reasonable RNN model generation target, the 3D points are uniformly downsampled to 10, 000 points, as shown in Fig. 6 (d). After the pre-processing step, 10, 000 points of the point cloud are reserved for one frame. The 3D position values in x, y, and z axis are obtained for each point, according to the Kinect SDK. Therefore, 10, 000 × 3 = 30, 000 values are utilized as the input nodes of the LSTM neural network for a time instance. For example, to time instance t 1 , 30, 000 nodes are collected at frame t 1 of a Kinect camera, as by the left part shown in Fig. 7. In addition, for recognizing a driver's behavior, total 30 frames are used for behavior observation. Furthermore, in the three layers LSTM neural network for generating an RNN model, the number of the nodes for hidden layers are set as 2048, 512, and 128 nodes, respectively.
Eventually, either the proposed skeleton-based approach or the multiple views point cloud based approach, the generated RNN models are utilized for driver's behavior recognition, using the conventional LSTM [14] classification.

Experimental Results
In the experimental results, two Kinect v2 depth cameras with the resolution 512 × 424 to obtain the 3D point cloud to capture a driver's behavior with the official Kinect SDK [3], and the depth data is served as the raw data. To simulate the in-car environment, as shown by the right Kinect camera in Fig. 8, it was positioned 1.1m away from the driver to capture the right view. On the other hand, the other Kinect camera was mounted 0.8m away to capture the left view. The preprocessing tasks include: skeleton obtaining, multi-camera point cloud calibration, background removal, and downsampling. In addition, Tensorflow 1.8.0 [18] is used to build the RNN model.

VAP Multiple Views Driver Behavior Dataset
In order to evaluate the proposed method, we generate a "VAP Multiple Views Driver Behavior Dataset" for evaluation. As shown in Fig. 9, ten volunteer users were invited to perform ten different behaviors for three times. As a result, 10 × 10 × 3 = 300 video clips were generated, with a manually timesynchronization process. For example, Fig. 10 shows the consecutive skeleton and point cloud of a waving behavior after performing the preprocessing steps. In the evaluation, a leave-one-out cross-validation (LOOCV) is adopted. In our test, the data from the nine of the ten drivers are used for model training/validation, and the remained one driver data is used for testing the classification performance, with average classification accuracy displayed as follows.
IS&T International Symposium on Electronic Imaging 2019 Autonomous Vehicles and Machines Conference 2019 056-3

Single View Skeleton Based Driver Behavior Recognition
At first, the joints of the skeleton data captured from a single Kinect camera is use for evaluation. As shown left bottom green circle in Fig. 3, the RNN input X is set as 57 nodes, and the x 1 , x 2 , x 3 · · · is observed until x 60 to represent that a user's behavior is observed during 60 frames. The learning rate and drop out is set to 0.0001 and 0.5 respectively.
By 10,000 times iterations for obtaining the RNN model, the average accuracy rate can achieve 0.83, ranging from 0.67 to 0.90, which is shown in Table 1. It is obvious that the "right side" behavior recognition result has higher accuracy than "left side" in Table 1, due to the left-driving setting has fewer self-occlusion issues with proper Kinect camera observation distance, about 1.0m falls into the rage of valid depth observation 0.5m − 3.5m from the infra-red based depth sensor. In other words, the skeletons observed from the "left side" camera is relatively noisy due to the too short distance, smaller than 0.5m with almost out of the valid observation range from the depth sensor.
Furthermore, as shown by the confusion matrix for different behaviors in Fig. 11, the behaviors "Turning right" and "Adjusting mirror " can be successfully classified with the accuracy as 0.97, but the behavior "Watching video" can be classified with a relatively lower accuracy as 0.70. The false classification result in "Watching video" is caused by the similar geometric skeleton distribution in "Look up" and "Waving left" from a single camera, due to certain self-occlusion issues and out of observation rage issue from a single camera environment.

Multiple Views Point Cloud Based Driver Behavior Recognition
To compensate the limitation from a single view camera environment, according our proposed method, 3D point cloud captured from multiple views with Kinect cameras is used for performance evaluation. After the preprocessing stage, the RNN input X is set as 30, 000 nodes, and the x 1 , x 2 , x 3 · · · is observed until x 30 to represent that a user's behavior is observed during 30 frames. The    learning rate and drop out is set to 0.00001 and 0.5 respectively. As shown in Table 2, the average behavior recognition accuracy is proportional to the number of epochs. For example, when the epochs is set to 2, 000, the average accuracy is 0.88. Moreover, the confusion matrix of point cloud based multiple views setting is shown in Fig. 12. By comparing with the single view skeleton-based approach in Fig. 11, the accuracy in most of the behaviors are achieved near 1.00, but the "Horn" behavior was incorrectly classified as "Look up" behavior, because of the hand motion of honking horn is occluded by the steering wheel. In addition, the accuracy of "Turning right" is reduced to from 0.97 to 0.83, because of the too short distance (less than 0.5m) from the Kinect camera to the driver. As a result, by combining the depth information from the two views of Kinect cameras, some of the missing parts or occluded parts in one view can be compensated from the other view, and the behavior recognition accuracy can be improved.

056-4
IS&T International Symposium on Electronic Imaging 2019 Autonomous Vehicles and Machines Conference 2019

Complexity Comparison
The proposed methods were executed on a computer, with an Intel 3.20-GHz CPU (Core i7), GTX 1080 Ti GPU, and 64Gb of RAM. The total computational time for the skeleton-based and the point-cloud-based approaches are shown in Table 3. By comparing the results in the first row and the second row, it is apparent that the skeleton-based approach spent much less time than the point-cloud-based approach. The main reason is that RNN input X of the point-cloud-based approach needs 30, 000 nodes, but the the skeleton-based approach only needs 57 nodes. In addition, the computational cost for the total training time and the training time per 100 epochs from the point-cloud-based approach to the skeleton-based approach is about 100 times. However, the testing time is about 3 times, and the GPU memory usage is about 35 times, from the point-cloud-based approach to the skeletonbased approach. Therefore, in order to obtain the higher accuracy with the point cloud compensating property, the computational cost and the memory usage increasing is needed.

Conclusion
In conclusion, we proposed two approaches for a driver behavior recognition: a skeleton-based approach and a multiple views point cloud based approach, based on Kinect depth cameras. The recurrent neural network models based on LSTM algorithm is adopted for training the behavior models in the proposed approaches. In the experimental results, the driver behavior recognition accuracy can achieve 83% and 88%, respectively. In the future, the proposed driver behavior recognition scheme can be applied in an in-vehicle environment. Furthermore, wearable sensors on a driver and the sensors mounted on cars can be also utilized for driver behavior recognition. In the future, the proposed driver behavior recognition is possible to be adopted in the development of advanced driver assistance systems (ADAS).