Adjusting eye aspect ratio for strong eye blink detection based on facial landmarks

Blink detection is an important technique in a variety of settings, including facial movement analysis and signal processing. However, automatic blink detection is very challenging because of the blink rate. This research work proposed a real-time method for detecting eye blinks in a video series. Automatic facial landmarks detectors are trained on a real-world dataset and demonstrate exceptional resilience to a wide range of environmental factors, including lighting conditions, face emotions, and head position. For each video frame, the proposed method calculates the facial landmark locations and extracts the vertical distance between the eyelids using the facial landmark positions. Our results show that the recognizable landmarks are sufficiently accurate to determine the degree of eye-opening and closing consistently. The proposed algorithm estimates the facial landmark positions, extracts a single scalar quantity by using Modified Eye Aspect Ratio (Modified EAR) and characterizing the eye closeness in each frame. Finally, blinks are detected by the Modified EAR threshold value and detecting eye blinks as a pattern of EAR values in a short temporal window. According to the results from a typical data set, it is seen that the suggested approach is more efficient than the state-of-the-art technique.


INTRODUCTION
Eye blinks detection technology is essential and has been applied in different fields such as the intercommunication between disabled people and computers (Królak & Strumiłło, 2012), drowsiness detection (Rahman, Sirshar & Khan, 2016), the computer vision syndromes (Al Tawil et al., 2020;Drutarovsky & Fogelton, 2015), anti-spoofing protection in face recognition systems (Pan et al., 2007), and cognitive load (Wilson, descriptors (Malik & Smolka, 2014). Using the effective Eye Aspect Ratio (Maior et al., 2020) and face landmarks (Mehta et al., 2019) methods, we developed our own algorithm to perfect it. Another method for blink detection is based on template matching (Awais, Badruddin & Drieberg, 2013). The templates with open and/or closed eyes are learned and normalized cross-correlation.
Eye blinks can also be detected by measuring ocular parameters, for example by fitting ellipses to eye pupils (Bergasa et al., 2006) using the modification of the algebraic distance algorithm for conic approximation. The frequency, amplitude, and duration of mouth and eye opening and closing play an important role in identifying a driver's drowsiness, according to Bergasa et al. (2006). Adopting EAR as a metric to detect blink in Rakshita (2018) yields interesting results in terms of robustness. Based on the previous research result, the blinks rate is previously determined using the EAR threshold value 0.2. Due to the large number of individuals involved and the variation and features between subjects, such as natural eye openness, this approach was considered impractical for this study.
The most significant contributions made by this paper are as follows: (1) Blink types are automatically classified using a method that defines a new threshold based on the Eye Aspect Ratio value as a new parameter called Modified EAR. A detailed description of this method is provided in the paper. (2) Adjusted Eye Aspect Ratio for Strong Blink Detection based on facial landmarks was performed in this experiment. Then, we analyze and discuss in detail the experimental results with public datasets including the Talking Face and Eyeblink8 datasets. (3) We proposed a new Eye Video S1 dataset, and our dataset has the unique characteristics of people with small eyes and glasses. (4) Our experimental results show that using the proposed Modified EAR as a new threshold can improve blink detection results in the experiment.
Furthermore, this research work is organized as follows. Related work and the approach we intend to use in this study describes in the Materials and Methods section. Section 4 describes the experiment and results. A detailed description of the findings of our study is provided in Section 4. Finally, conclusions are drawn, and future work is proposed in Section 5.

Eye blink detection with facial landmarks
Eye blinking is a suppressed process that involves the rapid closure and reopening of the eyelid. Multiple muscles are involved in the blinking of the eyes. The orbicularis oculi and levator palpebrae superioris are the two primary muscles that regulate eye closure and opening. Blinking serves some important purposes, one of which is to moisten the corner of an individual's eye. Additionally, it cleans the cornea of the eye when the eyelashes are unable to catch all of the dust and debris that enter the eye. Everyone must blink to spread tears over the entire surface of the eyeball, and especially over the surface of the cornea. Blinking also performs as a reflex to prevent foreign objects from entering the eye. The goal of facial landmark identification is to identify and track significant landmarks on the face. Face tracking becomes strong for rigid facial deformation and not stiff due to head movements and facial expressions. Furthermore, facial landmarks were successfully applied to face alignment, head pose estimation, face swapping (Chen et al., 2019), and blink detection (Cao et al., 2021). Kim et al. (2020) implemented semantic segmentation to accurately extract facial landmarks. Semantic segmentation architecture and datasets containing facial images and ground truth pairs are introduced first. Further, they propose that the number of pixels should be more evenly distributed according to the face landmark in order to improve classification performance. Utaminingrum et al. (2021) suggested a segmentation and probability calculation for a white pixel analysis based on facial landmarks as one way to detect the initial position of an eye movement. Calculating the difference between the horizontal and vertical lines in the eye area can be used to detect blinking eyes. Another research study (Navastara, Putra & Fatichah, 2020) reported that the features of eyes we extracted using a Uniform Local Binary Pattern (ULBP) and the Eyes Aspect Ratio (EAR).
In (1) Face detection: Face detection is the first method that locates a human face and returns a value in x, y, w, h which is a rectangle.
(2) Face landmark: After getting the location of a face in an image, we have to through points inside the rectangle. This annotation is part of the 68-point iBUG 300-W dataset on which the Dlib face landmark predictor is trained. Whichever data set is chosen, the Dlib framework can be used to train form predictors on the input training data.

Eye Aspect Ratio (EAR)
Eye Aspect Ratio (EAR) is a scalar value that responds, especially for opening and closing eyes (Sugawara & Nikaido, 2014). A drowsy detection and accident avoidance system based on the blink duration was developed by Pandey & Muppalaneni (2021) and their work system has shown the good accuracy on yawning dataset (YawDD). To distinguish between the open and closed states of the eye, they used an EAR threshold of 0.3. Figure 2 depicts the progression of time it takes to calculate a typical EAR value for one blink. During the flashing process, we can observe that the EAR value increases or decreases rapidly. According to the results of previous studies, we used threshold values to identify the rapid increase or decrease in EAR values caused by blinking. As per previous research, we know that setting the threshold at 0.2 is beneficial for the work at hand. In addition to this approach, many additional approaches to blink detection using image processing techniques have been suggested in the literature. However, they have certain drawbacks, such as strict restrictions on image and text quality, which are difficult to overcome. Based on the previous research result, we selected EAR threshold of 0.2 and 0.3 in our experiment. EAR formula is insensitive to the direction and distance of the face, thus providing the benefit of identifying faces from a distance. EAR value calculates by substituting six coordinates around the eyes shown in Fig. 3 into Eqs.
Equation (1) describes the EAR equations, where P1 to P6 represent the 2D landmark positions on the retina. As illustrated in Fig. 3, P2, P3, P5, and P6 were used to measure eye height, while P1 and P4 were used to measure eye width. When the eyes are opened, the EAR of the eyes remains constant, but when the eyes are closed, the EAR value rapidly decreases to almost zero, as shown in Fig. 3B.

Modified Eye Aspect Ratio (Modified EAR)
Based on the fact that people have different eye sizes, in this study, we recalculate the EAR (Huda, Tolle & Utaminingrum, 2020) value used as a threshold. In this research, we proposed the modified eye aspect ratio (Modified EAR) for closed eyes with Eq. (3) and open eyes with Eq. (4).
From Eqs. (3) and (4) we calculate our Modified EAR in Eq. (5) Equation (6) depicts the EAR output range while the eyes are open and closed. When the eyes are closed, the EAR value will be close to 0, but the EAR value may be any integer larger than 0 when the eyes are open. Following the detection of the face, the eye area is identified using the facial landmarks dataset. We can identify 68 landmarks (Yin et al., 2020) on the face using this dataset. A corresponding index accompanies each landmark. The targeted area of the face is identified via the application of the index criteria. Point index for two eyes as follows: (1). Left eye: (37,38,39,40,41,42), (2). Right eye: (43, 44, 45, 46, 47, 48) (Ling et al., 2021;Tang et al., 2018). After extracting the eye region, it is processed for detecting eye blinks. The eye region discovery is made at the beginning stage of the system.

Eye blink detection flowchart
Our research detects the blinks with the help of two lines. Lines are drawn horizontally, and vertically splitting the eye. The act of temporarily closing the eyes and moving the eyelids is referred to as blinking. Blinking eyes is a rapid natural process. We can assume that the eye is closed/blinked when: (1) Eyeball is not visible, (2) eyelid is closed, (3) the upper and lower eyelids are connected. For an opened eye, both vertical and horizontal lines are almost identical, while for a closed eye, the vertical line becomes smaller or almost vanished. This research study sets a threshold value based on Modified EAR equations. If the EAR is smaller than the Modified EAR Threshold for 3 s, we can consider eyes blink. In our experiment, we implement three different threshold value 0.2, 0.3, and modified the EAR threshold for each video dataset.
TalkingFace dataset consists of one video recording of one subject talking in front of the camera and making different facial expressions. This video clip is captured with 25 fps with a resolution of 720 × 576 and contains 61 annotated blinks (Drutarovsky & Fogelton, 2015;Fogelton & Benesova, 2016).
Modified EAR threshold equation implemented for each dataset. After calculation, the EAR threshold for the Talking face dataset is 0.2468, Eyeblink Video S4 0.2923, Eyeblink Video S8 0.2105, and 0.2103 for Eye Video S1. The dataset information explains in Table 1.
Eye Video S1 dataset labeling procedure using Eyeblink annotator 3.0 by Fogelton & Benesova (2016). The annotation tool uses OpenCV 2.4.6. Eye Video S1 were captured at 29.97 fps and has a length of 1829 frames with 29.6 MB. Our dataset has the unique characteristics of people with small eyes and glasses. The environment is the people who drive the car. This dataset can be used for further research. It is difficult to find a dataset of people with small eyes, wearing glasses, and driving cars based on our knowledge. We collect the video from the car dashboard camera in Wufeng District, Taichung, Taiwan.

Tables 2 and 3 explains the statistics on the prediction and test set for each video dataset.
Talking Face dataset capture with 30 fps, 5,000 frames, and the duration is 1,667.67 s. Statistics on the prediction set show that the number of closed frames processed is 292, and the number of blinks is 42 for EAR threshold 0.2. However, statistics on the test set describe the number of closed frames as 153, and the number of blinks is 61. This experiment exhibits an accuracy of 96.85% and an AUC of 94.68%. The highest AUC score for the Talking Face dataset was achieved while using Modified EAR threshold 0.2468; it obtains 96.85%. Furthermore, Eyeblink8 dataset Video S4 processed 5,454 frames with 30 fps and duration of 181.8 s. The maximum AUC obtains while implementing our Modified EAR threshold was 0.2923; it achieved 91.17%. Moreover, Eyeblink8 dataset Video S8 contains 10,712 frames with 30 fps and durations 357.07 s. This dataset also got the best AUC when employed Modified EAR threshold 0.2105, it achieves 96.60%. Eyeblink8 dataset Video S8 exhibits the minimum result of 21.1% accuracy and 60.2% AUC while employed EAR threshold 0.2. In this study, we prioritize AUC because of some reasons. First, the AUC is scale-invariant and assesses how well the predictions are ordered rather than how well they are ordered in real numbers. Second, AUC is not affected by categorization limits. It assesses the prediction accuracy of the models, regardless of the categorization criteria used to assess them.
The talking Face video dataset exhibits 94% accuracy and 96.85% AUC. Followed by Eyeblink8, Video S8 achieves 95% accuracy and 96% AUC. Further Eyeblink8 Video S4 In Table 3, Eye Video S1 dataset processed 1829 frames with 29.97 fps and duration of 61.03 s. The maximum AUC 0.4931% and accuracy 93.88% were obtains while implementing our Modified EAR threshold 0.2103. Figures 5A-5C shows the confusion matrix for the Talking Face dataset. Figures 5D-5F explains the confusion matrix for the Eyeblink8 dataset Video S4. The confusion matrix for Eyeblink8 dataset Video S8 is drawn in Figs. 5G-5I. Figure 5I describes the false positive (FP) 2 out of 107 positive labels (0.0187%) and false negative (FN) 523 out of 10556 negative labels (0.0495%).
Tables 4 and 5 represents the evaluation performance of Precision, Recall, and F1 Score for each dataset in detail. The experimental results show that our proposed method, Modified EAR has the best performance compared to others. Furthermore, researchers have only used 0.2 or 0.3 as the EAR threshold, even though not all people's eye sizes are the same. Therefore, it is better to recalculate the EAR threshold to determine whether the eye is closed or open to identify the blink more precisely.
Based on Tables 4 and 5 we can conclude that, when we recalculate the EAR threshold value the experiment achieves the best performance for all datasets. The TalkingFace dataset applies an EAR threshold of 0.2468, Eyeblink8 Video S4 uses an EAR threshold of 0.2923, Eyeblink8 Video S8 uses 0.2105 as an EAR threshold, and Eye Video S1 uses 0.2103 as an EAR threshold value. Our research experiment processes video frame by frame and detects eye blink for every three frames shown in Fig. 6. The experiment results just show the beginning, middle, and finish frames of blinks. Figure 6 illustrates the Eye Video S1

DISCUSSION
In our research work, the AUC score is more important than accuracy because of several reasons as follow Lobo, Jiménez-valverde & Real (2008): (1) Our experiment is concerned with ranking predictions, not with producing well-calibrated probabilities.
(2) The video dataset is heavily imbalanced. It was discussed extensively in the research paper by Saito & Rehmsmeier (2015). The intuition is the following: the false-positive rate for highly imbalanced datasets is pulled down due to many true negatives.
(3) We concentrated on classes that were both positive and negative. If we are as concerned with true negatives as we are with true positives, it makes sense to utilize AUC. Figure 7 describes the EAR and Error analysis of the Talking Face video dataset. Consider that the optimal slope of the linear regression is m >= 0. Our experiment plots the whole data and obtains m = 0. Blinking is infrequent and has a negligible effect on the overall EAR measurement trend. Moreover, Cumulative error is meaningless for blinks due to its delayed impact. Additionally, errors perform more like properly distributed data than the EAR values explain in Fig. 7. In this paper, the performance of the proposed eye blink detection technique is evaluated by comparing detected blinks with ground-truth blinks using the two standard datasets described above. The output examples can be classified into three classes. True positive F 1 = 2 × Precision × Recall Precision + Recall (9) Figure 8 represents the first blink and second blink analysis for the Talking Face video Dataset. The lower bound is better for estimating blinks. The lower bound of calibration show 0.1723 for ear value. Detects 6 frames in details as follows: 4 frames for first blink and 2 frames for second blink. Furthermore, the lower bound of errors obtains −0.0817 error value. Detects 12 frames with 8 frames for the first blink and 4 frames for second blink. Also, analyzing with z_limit =2 on calibration does better than running on errors. The lower bound of calibration obtains a 0.1723 EAR value. It detects 12 frames with 8 frames for the first blink, 4 frames for the second blink. The lower bound of errors exhibits −0.0817 error value. It detects 12 frames with 8 frames for the first blink, 4 frames for the second blink. Using the mentioned dataset, the proposed method outperforms methods in previous research. The statistics are listed in Table 6. We obtain the highest Precision, 99%, for all datasets. Moreover, our proposed method achieves 97% Precision on Eye Video S1 dataset.

CONCLUSIONS
This paper proposes a method to automatically classify blink types by determining the new threshold based on the Eye Aspect Ratio value as a new parameter called Modified EAR. Adjusted Eye Aspect Ratio for strong eye blink detection based on facial landmarks. We analyzed and discussed in detail the experiment result with the public dataset and our dataset Eye Video S1. Our work proves that using Modified EAR as a new threshold can improve blink detection results.
In the future, we will focus on the dataset that has facial actions, including smiling and yawning. Both basic and adaptive models lack facial emotions such as smiling and yawning. Machine learning methods may be a viable option, and we will implement SVM in our future research.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This paper was supported by the Ministry of Science and Technology, Taiwan (MOST-110-2927-I-324-50, MOST-110-2221-E-324 -010, MOST-109-2622-E-324 -004) and the EU Horizon 2020 program RISE Project ULTRACEPT under Grant 778062. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.