Blink-To-Live eye-based communication system for users with speech impairments

Eye-based communication languages such as Blink-To-Speak play a key role in expressing the needs and emotions of patients with motor neuron disorders. Most invented eye-based tracking systems are complex and not affordable in low-income countries. Blink-To-Live is an eye-tracking system based on a modified Blink-To-Speak language and computer vision for patients with speech impairments. A mobile phone camera tracks the patient’s eyes by sending real-time video frames to computer vision modules for facial landmarks detection, eye identification and tracking. There are four defined key alphabets in the Blink-To-Live eye-based communication language: Left, Right, Up, and Blink. These eye gestures encode more than 60 daily life commands expressed by a sequence of three eye movement states. Once the eye gestures encoded sentences are generated, the translation module will display the phrases in the patient’s native speech on the phone screen, and the synthesized voice can be heard. A prototype of the Blink-To-Live system is evaluated using normal cases with different demographic characteristics. Unlike the other sensor-based eye-tracking systems, Blink-To-Live is simple, flexible, and cost-efficient, with no dependency on specific software or hardware requirements. The software and its source are available from the GitHub repository (https://github.com/ZW01f/Blink-To-Live).


Related work
Eye tracking technologies are utilized in many sciences such as cognition, psychology, computer, and medicine to digitize how people interact with their living environment 27,28 . Tracking human eyes, which implies recording their different movements, is essential for many pervasive applications such as eye-based communications and computer environment interactions 29,30 . Three types of eye-tracking approaches were introduced previously (see Fig. 1): one relies on devices attached to the eyes directly, such as special lenses or glasses with infrared sensors that track and record eye positions. The second approach targets the eyes as a source of electrical field that can be detected in the darkness even when the eye is closed and measures the electrical potentials by positioning some electrodes around human eyes, the example of this approach is the electrooculogram (EOG) technique 22,27 . Since the previous two mentioned approaches rely on some sensors such as infrared or electrodes, they can be grouped under the term sensor-based eye tracking technologies. The other eye-tracking methods rely on computer-vision techniques for detecting and tracking the human eyes in the captured video frames by a camera in real-time without direct contact with human eyes or using extra hardware sensors [31][32][33] .
One proposed human-computer interaction method is positioning nearly five electrodes around the patients' eyes to detect their blinks and movements to control the cursor and different desktop applications 34,35 ‫.‬ ‫‬ Some applications have a user interface with a keyboard to enable the patients to use their eyes to look at the intended letters and select based on eye blinks and movements for a predefined time slot (i.e., one second). Also, electromyography systems can detect facial muscle signals and control the mouse cursor and computer applications accordingly 36 . The eye gestures and signals from facial muscles are incorporated to increase the reliability of www.nature.com/scientificreports/ applications that support people with speech impairments in communication and cursor control. While combining these two techniques has high accuracy, it still suffers from a slow communication speed and high cost 22 . Direct-eye contact approach targets designing hardware devices such as eyeglasses with an infrared sensor that translates eye blinks into a sequence of alphabets using a Morse code with variations that the alphabets are displayed on the LCD screen 23 , or the synthesized phrases are displayed on a mobile phone screen 22 . EyeLive 37 is another eye-tracking system based on measuring eye reflections by the infrared sensor. The system has a user interface with a keyboard to let the patient selects the intended letters using their eye gestures. While the infrared sensor facilitates eye gesture detection by increasing the reflection light from the eyes, it has some limitations, such as interference with sunlight in outdoor usage, and the relative eye positioning for the infrared sensor will affect the tracking performance. The direct-eye contact approaches have an extra cost of using external hardware devices for eye tracking with less reliability and robustness according to inaccurate sensor measurements. Some machine learning techniques are introduced to detect and track the human eyes, such as principal component analysis, which detects the first eye's six principal components in the captured images. The eye's pupil position is classified using the artificial neural network model 38 . The system required a special hardware configuration such as a head-mounted camera and wearing glasses that complicate the communication process. Some machine learning models are used to predict the eye's directions, gaze angles, and movements, such as Adaptive Linear Regression, ALR 39 , K Nearest Neighbors, KNN 40 , and Random Forest, RF 41 , which suffer from inaccurate prediction accuracy due to insufficient training samples, parameters setting, noisy images, or complex feature extraction process 31 . Also, Support Vector Machine, SVM, is used to classify the five eye directions after the eye region is detected by modifying the characteristics of the active appearance model 42 . A deep learning Convolution Neural Network, CNN, with multi-layer architecture, is used to predict different eye gestures based on training the CNN with a large number of samples with known eye states as predicted labels, which can achieve good performance but require a large number of training samples, long training time, and tune the model's parameters accordingly 43 . The eye-tracking technology is applied in non-communication tasks such as understanding the factors that influence comprehension by investigating how developers comprehend query syntax and method syntax in Language-Integrated Query (LINQ), which is a popular technology for querying data in .NET programming languages such as C# and VB.NET 44 . Also, the eye-tracking technology is integrated with Virtual Reality (VR) head-mounted displays for rendering the VR senses, assisting the user interaction with the environment 45 or supporting the vision screening 46 .

Materials and methods
Blink-To-Live is a mobile application based on computer vision techniques for patients with motor neuron disorders such as ALS and PLS. These patients gradually develop speech impairment symptoms. In the final paralyzed stage, eye gestures are the only methods to initiate communication. The Blink-To-Live communication system relies on the four eye gestures: Left, Right, Up, and Blink (Table 1), defining the language's key alphabets for encoding more than 60 daily life communicating sentences, such as those presented in Tables 2 and 3.
As depicted in Fig. 2, Blink-To-Live has two basic system components: a mobile application developed by a Google flutter 47 framework and the other is a backend python module for video frames image analysis and processing. The patient's interactions with the Blink-To-Live system start with a caregiver's opening the phone camera to track the patient's eye gestures. Once the video frames are captured correctly in real-time, they are automatically sent to computer vision modules to detect and track the patients' eye movement states.
The first module is facial landmarks detection, which has two sub-modules: face detection and facial landmarks prediction. This module aims to detect the patient's face and extract the locations of 68 facial coordinates representing different facial structures such as eyes, mouth, nose, etc. The next module detects the patients' eyes and tracks their movements according to the Blink-To-Live four key alphabets: Left, Right, Up, and Blink. The patient's communicated speech is generated according to the predefined dictionary of sentences encoded using a combination of three eye movement states. Once the sentences are synthesized, a translation module displays the phrases in the patient's native language, such as Arabic, German, etc. Also, the text-to-speech module generates www.nature.com/scientificreports/ a corresponding lifelike speech that helps the patients to communicate easily and efficiently. The detailed implementations of each module will be discussed in the following sections.

Blink-To-Live eye-based communication language.
Blink-To-Live eye-based communication language has four eye alphabets (i.e., Blink (B), Left (L), Right (R), and Up (U)). We modified a previously proposed Blink-To-Speak eye language according to the four eye movements recognized by our system to simplify the patient communication process by expressing more statements with less sequence of eye gestures and transitions. In the Blink-To-Live eye-based communication language, each communicated sentence is expressed as a sequence of three eye states, and each state could be one of the four states (B, L, R, and U). The four defined states will generate exactly 64 daily life-communicated phrases for ALS/PLS patients (Tables 2 and 3). This configuration can be mathematically defined as follows: If you have n recognized eye states and a group of communicated sentences, each sentence will be expressed by k words ( k sequence of states), the total number of uniquely communicated sentences will be n k . By increasing the number of n recognized eye states and the k expressed words, more statements are added to the modified Blink-To-Speak language. We found that when n = 4 , k = 3 , the total number of defined statements equals 64, sufficient to express most of the required daily life communication speech for ALS/PLS patients with less training time. The same eye states (i.e., all three states are left, right, up, or blink) with no transition in between are devoted to the basic patient needs such as food, water, toilet, I am okay, etc. Also, the states with low transitions are dedicated to the most important command, such as changing a diaper, calling a relative, taking medication, etc. Usually, the sequence of the same eye states will be expressed faster by patients and take less training time than the sequence of eye states that transition from one eye state to another. The proposed Blink-To-Live eye-based communication language will be simpler, more flexible, and usable as the patient's eyes could be exhausted by a long sequence of eye gestures and transitions defined in the original Blink-To-Speak book. Table 4 shows the key differences between Blink-To-Speak 24 and Blink-To-Live systems. www.nature.com/scientificreports/ Facial landmarks detection module. Facial landmarks detect important parts of the face, such as the nose, eyes, eyebrows, mouth, etc. Our system's most important facial structure is the patient's eyes. The facial landmarks module in our proposed framework has two basic steps: detecting the face from the images collected from video frames and localizing the important facial structures on the face region of interest accordingly.
Face detection module. Face detection is accomplished with a pre-trained model called Histogram of Oriented Gradients with Linear SVM (HOG + SVM). HOG descriptor of a human face can be built by dividing the face image into small blocks, and for each block, the gradients (i.e., small changes in the pixel values in terms of x and y directions) are computed. Then, the histogram is generated for each block separately. Finally, the gradient vectors are normalized and combined into a single HOG feature descriptor fed into a linear SVM for face/nonface object classification 48 .
Facial landmarks detection module. Given that the face region is detected, the next step is detecting the facial landmarks by localizing and labeling the mouth, left and right eyebrows, left and right eyes, nose, and jaw. A set of manually labeled key facial structures in terms of x and y coordinates, along with the pixels' intensity values and the prior probabilities of the distance of the pixel values corresponding to the facial landmarks pairs, are fed into an ensemble model of regression trees 49 to train a model for facial landmark detector. The model is implemented in the dlib library 50 and trained on iBUG 300-W dataset 51 for estimating the locations of 68 facial landmark coordinates in terms of x and y values. Our proposed system used the dlib pre-trained model to detect facial landmarks on real-time images extracted from video frames (see Fig. 3).
Eye detection, tracking, and blinking module. Eye detection and tracking module. The patient's eyes are detected using the facial landmarks module, which locates the indexes of left and right eyes in the space of 68-(x, y) coordinates. Each eye is detected by 6 coordinates drawn on the created black mask of the same di- www.nature.com/scientificreports/  www.nature.com/scientificreports/ mension for each reading frame corresponding to a patient's image. The only white region on the created black mask surrounds the eye and expands using dilate morphological operation. The eyes are segmented through the bitwise-and operation with the created mask on the original image. All zeros pixel values are converted to 255 to localize the eyeball, the only black region left on the mask. The resulting eye mask is converted into grayscale to prepare the image for segmenting the eyeball from the eye and locating its center. We used a fixed threshold to create a binary mask to find the largest contour containing the eyeball and segment it accordingly. After the eyeball is detected, its position (in terms of x and y coordinates) is computed and returned as three values mapped to three directions: left (1), right (2), and up (3). The eye-down state is not detected in this version of the Blink-To-Live application since it conflicts with an eye-blinking state.
Eye blinking module. Each eye is detected by 6 coordinates using the facial landmarks module, and the relation between the eye height and width can be encoded by the ratio called Eye Aspect Ratio, EAR 52,53 , that is computed by the following equation: where p 1 , p 2 , p 3 , etc., are the coordinates of the eye's landmarks, as depicted in Fig. 4. EAR is always constant when the eye is open, and it has an approximate value of zero when it blinks. Consequently, the ratio can determine whether the patient is blinking. A specific threshold value, t , is computed (i.e. 0.2 in our application). Suppose the EAR value is decreased by less than 0.2 and then increased above 0.2, a single blink is detected and can be counted as many as blinks have occurred. The following equation explains how the eye's opening and closing states are detected based on comparing EAR with the t threshold value.
The normal blink lasts from 100 to 400 ms, which is very fast compared to the intended blink (i.e., lasts 800 ms). We used the approach explained in 53 to distinguish the normal blink from the patient's blink, which is used as an alphabet in the Blink-To-Live eye-based language. The solution implies inspecting the EAR value in the predefined number of video frames (i.e. 13 frames) and if the EAR is still less than 0.2, then it's the intended language blink. If the number of inspecting frames are less than 13 frames, the blink is very fast and might be a normal one considering that the camera captures 25 frames per second.
Eye speech generation and translation module. From the previously discussed modules, the four eye alphabets (i.e., Blink (B), Left (L), Right (R), and Up (U)) can be detected and registered in an eye tracking list. The eye tracking list stores different eye-recognized states and generates the corresponding encoded phrases according to Tables 2 and 3. A python library translate is used to translate these phrases into different native languages such as Arabic, German, etc., according to different patients' cultures and nationalities 54 . Further, the generated phrases text is converted to a lifelike synthesized speech using a Text-to-Speech module developed by Microsoft Azure 55 . The local registry keys of the Microsoft library for Text-to-Speech can also be used and imported directly into any python code.
Blink-To-live system deployment. Blink-To-live is a flutter-based mobile application for helping patients with speech impairments to communicate with their families and caregivers. A mobile phone camera will be opened, and a stream of video frames will be captured and sent to a backend model via a web socket with FastAPI 56 . The two-way communication channel between a client (mobile application) and server (backend model) is established using a web socket with FastAPI that quickly sends images in real-time without going through all HTTP protocol layers. The web socket will efficiently handle a backpressure problem that is resulted from receiving more video frames than the expected number to be handled by the backend model. Also, this problem occurred when the model was busy processing existing frames and running face/eye detection modules, predicting the eye movements, and translating the recognized states into a communicating speech while receiving new frames from the mobile application. A queue or buffer with a limited size will be created to solve this problem. When the queue is full, some video frames will be dropped without affecting the application's efficiency since its details can be restored or created virtually from the previously stored ones. Two methods are implemented concurrently and running in parallel: receive and process. Receive method is used to read newly captured frames encoded by raw bytes. In contrast, the process method detects face/eyes in the previously received frames, tracks the eye gestures, and sends the translated eye's speech back to the mobile application screen. www.nature.com/scientificreports/ Blink-To-Live system interaction with patients/caregivers. In this paper, we developed a mobile application called Blink-To-Live to help patients with speech impairments to communicate easily and efficiently. The patient's caregiver only needs to open the mobile phone camera to capture and track the patient's eye movements according to the Blink-To-Live four key alphabets: Left, Right, Up, and Blink. The patient's communicated speech is generated according to the predefined dictionary of sentences encoded using a combination of three eye movement states. Once the sentences are synthesized, a translation module displays the phrases in the patient's native language and the text-to-speech module generates a corresponding lifelike speech accordingly. Figure 4 shows different application screens, from the user's registration to the screen that displays the eye movements encoded phrases. In Fig. 5, users can register with their phone numbers and emails. Depending on the registration process, customized information, such as the patient's nationality, culture, etc., can be inferred. The following application screens (i.e., Fig. 5) show that when the camera is opened, the application starts to track the eye movements and display them on the screen, and after all eye gestures are recognized, their corresponding synthesized phrase, according to Tables 2 and 3, is displayed on the application screen and its lifelike speech is heard.
Also, the application has a learning module with animated graphs, which will help and motivate the patients and caregivers to learn more about the Blink-To-Live four eye movement alphabets and the corresponding eye-based generated speech. Further, the generated speech could be translated into different native languages according to the patient's registration information (see Figs. 6 and 7). Figure 8 shows the interactions between the Blink-To-Live mobile application and the patient as a sequence of exchange messages. When a user profile is created for a patient by his caregiver, the login information can be used to start the Blink-To-Live application. Depending on the paralyzed stage, the patient or his caregiver can open the phone camera, and the application starts to track the patient's eye and record each recognized movement in the eye tracking list. Every three recognized eye movements are translated into a corresponding phrase using the Blink-To-Live dictionary encoded by Tables 2 and 3. The encoded phrase is displayed on the application screen, and its corresponding voice is heard. The dictionary was encoded using three eye states to simplify communication without exhausting the patient's eye by making a long sequence of eye movements. After the patient's encoded speech is displayed on the screen, the eye tracking list is cleared to start recording a new sequence of three eye gestures. Suppose the patient accidently makes an erroneous eye movement not recognized by the Blink-To-Live system. In that case, the application's screen will not display the unrecognized state, and the patient will be assigned a five-second interval to express the correct one.
Different eye-tracking approaches are compared according to criteria such as communication speed, cost, caregiver dependency or special hardware devices for initiating the communication. Blink-To-Live follows an Indirect-eye contact tracking approach called a computer-vision-based eye-tracking approach. The comparison considered the results reported from different studies that evaluate different eye-tracking approaches for ALS patient communication 15,19,35,57,58 . The Blink-To-Live system does not rely on special hardware devices or sensors to initiate the patient's communication. The patients/caregivers can use their mobiles with supported cameras to track the patient's eyes, generate the intended speech, and translate it into the patient's native language. No eyeglasses, electrodes, or gaze-sensing screens are needed; hence our application has the lowest cost compared to the other proposed ones. Blink-To-Live suffers from a small delay since extensive computer vision backend modules process the video frames in real-time and send the results back to the mobile application, so the communication speed will be slow compared to the direct-eye contact tracking systems (see Table 5).

Experimental results
A prototype of the proposed Blink-To-Live system has been tested using 10 normal cases with different demographic characteristics such as age, gender, educational level, and technology awareness. The technology awareness is evaluated according to the age, previous experience, cultural background, and the level of education of each participant included in the test experiment. Each case had one-week training to learn the different eye language alphabets and their related phrases in Tables 2 and 3. Then, each case is asked to speak 27 phrases by expressing their corresponding eye language alphabets (i.e., left, right, up, and blink). Each case had five trials to speak each one of the tested phrases. Table 6 presents the recorded results, where each case had a coded symbol in the first column. The average communication speed, number of trials, and communication accuracy are reported in the following columns. The average communication speed is measured by the processing time required to recognize the eye's gestures and  www.nature.com/scientificreports/ display the corresponding phrase on the mobile screen. The communication accuracy is computed by counting the number of correctly spoken sentences using the eye alphabets out of the 27 tested sentences. Cases (P-01 to P-10) have 6 males and 4 females with ages ranging from 21 to 79. Cases P-01, P-02, P-04, P-06, P-08, and P-10 successfully express different eye language alphabets and correctly speak most of their related phrases in one trial (see Table 6). Cases P-01, P-02, P-04, and P-10, have younger ages compared to P-06 and P-08, with ages above 60. They all have a higher education level and technology awareness than the other participants. Cases P-03, P-07, and P-09, have intermediate educational level, and low technology awareness, so they needed more trials to express the eye language alphabets. Case P-05 has the worst performance as the educational level and technology awareness is very low compared to the others. The communication speed ranges from 15 to 25 s for expressing one sentence for all cases. This time will vary according to how many transition eye states are in the encoded phrase, the person's ability to move their eyes correctly without moving their heads according to the intended speech, and the internet connection speed between the mobile application and the backend system. In our experimental results, normal people with high educational levels, technology awareness, good eye language training, good internet connectivity, and stability achieve good communication results. With sufficient training time, people with low educational level and technology awareness can achieve good performance. Patients with similar experimental settings will need more training (expected 15 days) to achieve the same performance. The patient's dependency on his caregiver to open the phone camera relies on his ability to control his muscles according to different disease stages. Once the camera was opened to track the patient's eye, there was no need for a caregiver to assist with the patient's communication.  www.nature.com/scientificreports/ To further evaluate the Blink-to-Live system, we removed any variable external factors such as the network speed/bandwidth, the total number of frames sent from a sender (i.e., mobile) to a server (backend model), frame resolution, and test the system using its desktop version. Overall, we reached an average time of 3 s for most of the communicated sentences with the same eye movement states or at least one transition state. Sometimes, blinking states take more time to recognize by our system, and accordingly, their corresponding phrases have a long communication time compared to the other phrases with no blinking states. Phrases such as "I Want to Sleep", which is expressed by a sequence of [B L B], failed to be spoken by P-01 and P-04 since the system will not be able to recognize the third blinking state when a fast/slow transition was made between two different consecutive sates (see Fig. 9). Also, the sentences "Nose Block", "Palpitations", " My Computer is not Working", and "I Want to Pray", which are expressed by the consecutive sequence [L R R], [L R U], [U B R], [B R U], failed to be spoken by some participants as they made the fast transition between two consecutive states or moving their heads along with their eye's gestures. Hence, the Blink-To-Live system cannot recognize the sequence of intended eye alphabets to decode its corresponding phrase.
Since Blink-To-Live communication speed is affected by many external factors such as the network speed/ bandwidth, the total number of frames sent from mobile application to the server, and the frame resolution. The communication speed of the Blink-To-Live system, CS BTL , can be characterized mathematically as the following: where P BTL is the processing time of the sequence of three eye states on the server side and T BTL is the transmission time between the mobile application and the server.
On average, P BTL takes from three to four seconds, while T BTL can be defined in terms of the variables: D BTL : duration time of the inspected video clip for patient's eye movements,FR BTL : frame rate, F BTL : total number of frames will be sent from the mobile application to the backend system, R BTL : frame resolution characterized by a phone camera, X BTL : number of pixels in video frames, B BTL : number of bits in video frames, and N : network bandwidth in bps.
In the technical term, we compared two features-extractor approaches implemented in the dlib library. The first approach relies on CNN and has an execution time of 3.33 s for only a face detection module, while the Blink-To-Live-based approach has an execution time of 0.21 s (see Table 7).
Eye-based tracking applications such as Blink-To-Live have been introduced as assistive communication technology for patients with speech impairments. Two elements play a key role in the success of eye-based tracking technologies: the hardware device used to track the patient's eye and the software application used to process the collected data. The devices that track the patient's eye range from expensive gaze-sensing keyboards such as Tobii Dyanvox or Eye Transfer to special lenses, electrodes, or glasses with infrared sensors. The proposed Blink-To-Live system follows another approach for tracking the patient's eye using a simple camera without using expensive specialized devices or hardware sensors directly positioned to the human eyes.
The software applications that process the collected data rely on the machine and deep learning techniques for facial landmarks detection, eye identification, and tracking. The Blink-To-Live system used HOG + SVM, the feature extractor approach implemented in the dlib library, for facial landmarks identification and other computer   www.nature.com/scientificreports/ vision modules for tracking different eye movements and blinking states. The eye-down state is not detected in this version of the Blink-To-Live application since it conflicts with an eye-blinking state, which can be resolved in the future by using techniques such as reinforcement learning. Also, the Blink-To-Live system suffers from a small delay since extensive computer vision backend modules process the video frames in real-time and send the results back to the mobile application so that the communication speed will be slow compared to the other direct-eye contact tracking systems. Communication speed can be enhanced by developing fast processing backend services and web socket communication between the client and server applications.

Conclusion
Blink-To-Live is a simple and cost-efficient mobile application for speech-impairment patients who only have their eyes to initiate communication with their surrounding world. It relies on a set of computer vision modules and a modified version of the Blink-To-Speak language to translate different eye gestures into a set of daily life commands used by the patients to express their emotions and needs. The patients/caregivers will only use their phones with supported cameras to track different patients' eye movements. The synthesized eye-based speech will be displayed on the phone screen accordingly. The future improvements of our system are to increase communication speed by developing a fast processing backend system that utilizes modern real-time image analysis and processing approaches. Also, reinforcement learning algorithms can enhance eye detection, blinking, and tracking modules to achieve high-accuracy results and resolve conflicts among eye movement states. Further, each patient has a registered profile in our application. In the future, the application will learn his eye's attributes, movements, and blinking behavior, which can increase the communication speed, and customize the application according to different patient needs. The Blink-To-Live could be integrated easily with other systems relying on hardware devices to control the patient's living environment with eye gestures.

Data availability
Blink-To-Live is a free, open-source software released under the GNU GPL license and its source is the GitHub repository (https:// github. com/ ZW01f/ Blink-To-Live). www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.