Monitoring Driver’s Vigilance Level Using Real-Time Facial Expression and Deep Learning Techniques

Road accidents caused by human error are among the main causes of the death in the world. Speciﬁcally, drowsiness and unconsciousness while driving are responsible for many fatal accidents on highways. Accuracy and performance are key metrics related to many researched techniques for the detection of drivers’ drowsiness. To improve these metrics, in this paper, a new method based on image processing and deep learning is proposed. The proposed method is based on facial region diagnosing using the Haar-cascade method and convolutional neural network for drowsiness probability detection. Evaluation analysis of the proposed method on the UTA-RLDD dataset with stratiﬁed 5-fold cross-validation showed a high accuracy of 96.8% at a speed of 10 frames per second, which is higher than those that have previously been reported in the literature. For further investigation, a custom dataset including 10 participants in diﬀerent light conditions was collected. The result of all experiments showed the great potential of the proposed method for practical applications in intelligent transportation systems


I. INTRODUCTION
W ORLDWIDE, there are many fatalities due to car accidents [1], [2], [3].According to the AAA Traffic Safety Foundation and NHTSA report, 37% of accidents, in developed countries like France and Japan are caused by drowsiness, resulting in an average of 70,000 injuries and 1,550 deaths annually [4], [5], [6], [7], [8].The situation even seems to be worse in some countries due to the lack of legalization and lower technological developments [9].Due to the impact of drowsiness in road crashes and injuries, numerous methods have been proposed to detect drowsiness of drivers [10], [11], [12].Also, one of the solutions that have been considered in car safety is to develop intelligent driver surveillance systems [13], [14].To this end, smart systems for detecting drowsiness (i.e. percentage of eye closure (PERC-LOS), Steering Wheel Movement (SWM), Standard Deviation of Lane Position (SDLP), etc.) have been received increasing attention [15], [16], [17].Depending on the drivers' physical conditions, vehicle movement patterns, and (or) environmental conditions, smart systems are categorized into three main groups [18]: i) Physiological-based, ii) Vehicle-based and iii) Behavior-based methods.These systems not only improve the quality of driving successfully through automating the motors, steering, gears, brakes, and more in ad-hoc systems [19], [20] but also assist and alert the driver by smart detection of accident situations, which in turn leads to a decrease in traffic accidents [21], [22].
To date, the essential built-in automotive safety modules including emergency brake assist systems [22], front crash alert [23], lane departure alert [24], blind spot detection [25], smart headlights [26], and drivers' drowsiness detection systems [27], play a critical role in preventing road accidents and fatal crashes.Since changes in eyes, head and face (e.g.prolonged blinking time, slow eyelid movement, eyelids approaching or even closing eyelids, nodding, yawning, gazing, and eye sluggish) in drowsy drivers are clearly observable, behavior based methods could be a promising method to detect the drowsiness using image processing techniques [28], [29].For example, in [11], a new method for assessing drivers' drowsiness by using yawning features has been proposed.The system initially extracts the facial area using (support vector machine) SVM to reduce the cost of subsequent processing.Then the facial edges are found and analyzed using a new edge detection technique.In [12] the authors have proposed a fast sleepiness detection method to prevent road accidents.The method detects the face by applying the HSI colour on the input images.To find the position of the eyes, the Sobel edge detector is applied, which considers the dynamic image of the eye as eye-tracking.Afterward, it analyzes the images to assess the close or open status of the eyes.Moreover, in [30] the authors designed a system to detect drowsiness using eye closure and head posture features.They first recorded videos using a webcam, and then the participants faces and eyes are recognized via the viola-jones method.Afterward, the eye region is detected through Haar Classifier.Finally, in order to determine the state of the eyes, a wavelet network based on the neural network is applied.The system warns the driver when the period of eye closure exceeds a predefined threshold.In [10] the authors presented a real-time two-step approach using deep learning.In this approach, Multi-task Cascade Convolutional Networks (MTCCN) is first applied to find the face location as well as five landmarks, including the eyes, nose, and lips.These landmarks are then fed into a deep neural network similar to AlexNet architecture to classify the samples into three classes of conscious, yawning, and sleepy.Finally, in [31] the authors proposed a real-time deep learningbased approach, that is compatible with the Android OS, to identify drowsiness.In their study, 22 subjects from the NTHU database are converted from video into the images and then the Dlib library (Viola-Jones method) was applied to extract the facial landmarks.These landmarks are then fed as input to a multilayer perceptron with three hidden layers to detect driver's status (vigilant or sleepy).
There are three important factors in drivers' drowsiness detection: feasibility (in terms of cost, comfortably, and availability), accuracy (true detection rate, reliability), and speed of the method.Proposing a method that satisfies all three factors is challenging and of high importance.Since, some of the previous methods have low accuracy and speed in detecting drowsiness, the goal of this paper is to propose a new method for drivers' drowsiness detection, which is affordable, real-time, and leads to reliable results.Our proposed method can overcome these challenges and has high performance.It is based on drivers' videos captured by a camera and uses image processing to detect drowsiness.Therefore, our method is simple and has low overhead costs, since the system requires only one camera to be installed inside the car.Moreover, the reliability and accuracy of the method can be evaluated using real data.It should be mentioned that the proposed method uses high-speed image processing techniques, which enables the method to detect drowsiness in real-time.

A. Dataset Descriptions
The model was trained and evaluated on the UTA-RLDD dataset [32], which is the largest public available dataset.The UTA-RLDD dataset is stands for the University of Texas at Arlington Real-Life Drowsiness Dataset [39].The dataset has a higher number of samples compared to other available datasets.These samples can be (or are) grouped into three classes, namely Alert, Low Vigilance, and Drowsy and consists of 180 videos taken from 60 participants.Furthermore, these videos socially and technically diverse: • Nationality: 30 Indo-Aryan and Dravidian, 5 non-white Hispanic, 8 Middle Eastern, 7 East Asian, and 10 Caucasian, • Age: 60 healthy participants in the 20-59 age range from which 50 are male and 10 are female, • Special conditions: Wearing glasses (21 videos), with beard or mustache (72 videos), • Capturing condition: self-recorded videos from different angles in such a way that both eyes are visible, • Video duration: Each video is almost 10 minutes long, with a frequency of fewer than 30 frames per second.Our proposed method solves binary classification for drowsy and alert samples.As a result, from this dataset we utilized videos that represent alert and drowsy behaviors.
Furthermore for additional analysis, we collected a custom dataset with the help of 10 participants, using their laptop's webcam, which consists of 53 alert and 69 drowsy videos.Videos were captured in various light conditions (normal light, Fig. 1: The overall framework of proposed method.lack of light, very bright).To have a better variety of sleepy faces, the participants were asked to record the sleepy videos in three different ways: • Closed eyes and head lean downward, • Closed eyes with head straight, • Yawning.

B. Drowsiness Detection Method
The proposed method for detecting drowsiness consists of five stages: • Converting videos to images, • Converting RGB images to grayscale, • Cropping face images using Haar-cascade face detection method, • Resizing face images, • Feature extraction and image classification using CNN.First we start by taking the video of a driver, and convert it into a sequence of images.Then the images are converted into grayscale images for further processes.In the next step, the face of the drivers is detected using Haar-cascade method.Finally, CNN is used as a feature extractor and classifier to predict the possibility of drivers' drowsiness.The overall framework is shown in Figure1 1) Face Detection and Preprocessing: In first step, frames of video are converted to grayscale.Then Haar-cascade, one of the accurate, simple, and fast method for face detection [33], is used to crop face from images.Haar cascade uses Haar-like features to detect face landmarks such as eyes, eyebrow, and mouth.Some of the Haar-like features are shown in Figure 1 (Haar-Cascade section).These features act as a filter sliding across the image in order to detect specified features such as lines, edges, etc.Each Haar-like feature resembles some parts of the face.For example, the eyebrow is detected using an edge detection feature because the eyebrow pixels are darker than the pixels above the eyebrow.Each feature has a black region and a white region, and a feature value is calculated by: white pixels #white pixels − black pixels #black pixels (1) Fig. 2: Calculating Haar-like feature by subtracting the average of pixels in black region from the average of pixels in white region.
An example of applying a Haar-like feature is illustrated in Figure 2. If this value is greater than a threshold, the image has the feature.The features can be used in various scales and sizes.Among various Haar-like features, with different scales, and different threshold values, a weighted combination of them is chosen using Adaboost as follows.
where f i (x) are the selected features and α i are the corresponding weights.AdaBoost selects the most relevant features and discards others; thus, it reduces the computation.In order to increase the efficiency, Haar-cascade method focuses on the face region and discards the surrounding area.For this aim, cascading is used.In cascade, the first feature is checked, if the frame contains the first feature, it is passed to the next stage otherwise it is discarded.Having found the first feature, the cascade moves on to check the second feature, and this procedure is repeated until it finds the features describing the face.Using cascading technique can dramatically boost the face detection speed.
We used OpenCV library that includes implementation of Haar-cascade in python to extract faces from frames of videos.In order to unify the size of the input images feeding to CNN, and reduce the computational process, the face images resulted from Haar-cascade are resized to 28*28.
2) The Proposed CNN: The proposed network is inspired by LeNet architecture [34] consisting of five layers.This network has two sets of convolution (CONV) and pooling (POOL) layers, followed by two fully connected (FC) layers.In the first layer, the resized face image with 28 * 28 size is fed into a CONV layer with 20 different filters of size 5 * 5 with stride 1 and zero padding.Then a ReLU function is applied to the outputs, to eliminate the negative values as follows.
In the second stage, a max-pooling layer is applied to subsample the feature map and reduce the dimension.The maxpooling layer has a filter of size 2 * 2 with stride 2. This layer halves the size of the feature map, but keeps its depth unchanged.So the resulting feature map has the size of 14 * 14 * 20.After that, a second convolutional layer with 50 different filters of size 5 * 5 with stride 1 and zero padding is used.Then a ReLU function is applied on the outputs.Afterwards, the next max-pooling layer is applied to subsample with a filter of size 2 * 2 with stride 2. This layer halves the size of the feature map, but keeps its depth unchanged.Therefore, the resulting feature map has the size 7 * 7 * 50.At the end of the network, two fully connected layers are used.In the first fully connected layer, there are 500 neurons.Therefore, all 2450(7 * 7 * 50) neurons of POOL2 output are vectorized and connected to the 500 neurons in FC1 layer.The activation function of this layer is ReLU.Finally, the second fully connected layer consisting of two neurons is used to classify the output into two classes of drowsy or vigilant.
The activation function of this layer is softmax σ : R k → R k , where Using Softmax function, it assesses the probability of a sample data belonging to each of the two classes (drowsy/ vigilant).A summary of proposed CNN architecture is given in TableI The proposed method was implemented in Python 3.6.It used OpenCV library for several tasks such as converting videos to images, converting images into grayscale images, and executing Haar-cacade for face detection.Moreover, it used Keras library to implement the CNN classifier.Also, the Scikit-learn library was used to evaluate the model.

III. RESULTS AND EXPERIMENTS A. Evaluation criteria
In classification problems, the predicted results are compared to the real labels.In this paper, we classify videos to be drowsy or vigilant.The common criteria for evaluating classification models are accuracy, precision, recall, and F1score, which are defined as follows: In the above equations True Positive (TP), False Negative(FN), False Positive (FP) and True Negative(TN) stand for the positive samples (drowsy) that the model predicted as positive, positive samples that the model predicted negative (conscious) wrongly, the negative samples that the model predicted positive and the negative samples that the model predicted negative, respectively.The accuracy of a model shows the percentage of correctly predicted samples.The precision denotes the fraction of correctly predicted positive samples from the positive predicted samples, and the recall means the ratio of real positive samples that are truly discovered by the model.Since there is a trade off between precision and recall, the harmonic mean of precision and recall, which is called F1score, is considered a more meaningful criteria than precision and recall.
In order to evaluate and compare the proposed method, we used stratified k-fold cross validation (STKF), a modification of k-fold cross validation.The accuracy of the method is assessed using stratified 5-fold cross validation.The output images of the preprocessing phase are divided into 5 folds in such a way that there are almost equal numbers of positive (drowsy) and negative (vigilant) samples in each fold.In the first iteration, the model is trained using 4 folds and tested on the remaining fold, then in the second iteration, one of the folds, which were previously in the train dataset is chosen for the test dataset and the other 4-folds for the train dataset.The procedure repeats until every fold is used once as the test dataset.Also, it should be noted that 20% of train data were considered as the validation data in each fold.

B. Tuning Hyper-parameters
In order to set the best hyper-parameters for the proposed method, various settings are set as follows: • Batch size : 64, 200, 300 • Epoch number: 15, 50, 100 , 500 • Optimizer: stochastic gradient descent (sgd), Adaptive Moment Estimation (adam) The performance of the model on each setting is evaluated .The results are shown in Figure 3.According to the results, the last setting with sgd optimizer, batch size 200 and epoch number 500 shows significantly better accuracy than other settings.

C. Evaluation proposed method on UTA-RLDD dataset with 5-fold STKF
In this experiment, Since the execution of the code on all frames of each sample in the database is very time consuming, a random frame of each video is considered as the input of the model.Also, to evaluate the model performance with balanced data, the accuracy of the model is assessed using stratified

D. Comparison of the Proposed Method with a Previously Proposed Method
In order to further evaluate the proposed model, its performance is compared to the performance of the method proposed by Reddy et.al [10].To have a fair comparison, all conditions must be the same.Thus, the method proposed in [10] is implemented and executed on UTA-RLDD dataset.Since execution of the code on all frames of each sample in the database is very time consuming, a random frame from each video is considered as the input of the model.The performance of the method in [10] is evaluated using stratified 5-fold cross validation.The results of method in [10] do not show a significant performance comparing to our proposed method.Its precision values are in range of 0.62 to 0.65, recall is in range 0.62 to 0.64, F1-score is in range 0.61 to 0.63, and accuracy is in range 0.62 to 0.64.In order to compare the proposed method with method in [10], the average of evaluation criteria is calculated over all folds, which are presented in TableII.From the result of Table II it can be seen that the proposed method significantly outperforms method in [10].Our proposed method improves accuracy, F1-score, recall and precision up to 0.288, 0.295, 0.288, and 0.293, respectively.Moreover, the proposed method processes each video at a speed of 8.4 frames per second, which is faster than method in [10].Both methods were executed on the system with CP U Intel core i5, 2.60GHz * 4, 8 GB memory and U buntu 16.04 LT S operating system.

E. Evaluation on Custom Dataset
In addition to further investigate the proposed method, it was executed on a custom dataset.Having applied the preprocessing phase on the custom dataset videos, it yields 200 frames of alert participants and 200 frames of sleepy participants.These 400 frames were split into 80% for train and 20% for test.Also, 20% of the train data were considered as validation data.The accuracy value on train and validation sets in each epoch is shown in Figure 5.The accuracy has an increasing trend in both validation and train sets.After having finished the learning procedure through 500 epochs, the final accuracy was 86% on the train set, 70% on the validation set and 76% on test set.The high accuracy of the model on the custom dataset suggests the efficiency of the model on the data collected using the webcam and its ability in handling various light conditions.6b, and 6c, respectively.It is obvious that the drowsiness score in sleepy samples is higher than the alert sample.

IV. CONCLUSION
In this paper, we proposed a real-time method using image processing and deep learning techniques, which can detect the drivers' drowsiness fast and accurately.The proposed method consists of five stages.It takes the video of a driver, and convert it to images.Then the RGB images are converted to grayscale for further processes.The face of the driver is detected using Haar-cascade method and cropped from images.Afterwards, the size of the face image is reduced.Finally, a CNN is used as a feature extractor and classifier to predict the probability of drivers' drowsiness.LeNet is a simple convolutional neural network that according to our researches has not been used for drowsiness detection so far.The architecture of the CNN is inspired by LeNet architecture and implemented using Keras library in python.The evaluation of the model using stratified 5-fold cross validation on UTA-RLDD, shows that the model achieves high values of evaluation criteria with average accuracy, precision, recall, and F1-score of 0.918, 0.928, 0.920, and 0.920, respectively.Moreover, we presented comparison result with a previous method an we showed that it can outperforms this.In addition, a custom dataset was used and we showed that our method achieved 86% accuracy on the train set, 70% on the validation set and 76% on test set.In conclusion, we believe that our method has great potential in practical applications and can contribute if implemented to the reduction of accidents on roads and increase the safety in intelligence transportation systems.For future works, higher socio and technical diversity could increase further the performance on this method.

Fig. 3 :
Fig. 3: Accuracy for every setting of hyper-parameters.The first row shows the type of optimizer, the second row shows the batch size, and the last row shows the epoch number.

Fig. 5 :
Fig. 5: Train and validation accuracy of proposed method on custom dataset.

Fig. 6 :
Fig. 6: Performance of the model in various facial expression Figure 6 shows some samples of the output of model in various facial expression.The rectangle boxes specify the face position, which are the output of Haar-cacade.The drowsiness score is placed in the bottom of frame, which is calculated by the CNN.An alert person is shown in Figure 6a that leads to low value of drowsiness score.By revealing drowsiness

TABLE I :
Summary of proposed CNN architecture.

TABLE II :
Comparison of the proposed method with a previous method.The best result of each criteria is shown in bold.