Non-Intrusive Real Time Eye Tracking Using Facial Alignment for Assistive Technologies

Most affordable eye tracking systems use either intrusive setup such as head-mounted cameras or use fixed cameras with infrared corneal reflections via illuminators. In the case of assistive technologies, using intrusive eye tracking systems can be a burden to wear for extended periods of time and infrared based solutions generally do not work in all environments, especially outside or inside if the sunlight reaches the space. Therefore, we propose an eye-tracking solution using state-of-the-art convolutional neural network face alignment algorithms that is both accurate and lightweight for assistive tasks such as selecting an object for use with assistive robotics arms. This solution uses a simple webcam for gaze and face position and pose estimation. We achieve a much faster computation time than the current state-of-the-art while maintaining comparable accuracy. This paves the way for accurate appearance-based gaze estimation even on mobile devices, giving an average error of around 4.5° on the MPIIGaze dataset (Zhang et al., 2019) and state-of-the-art average errors of 3.9° and 3.3° on the UTMultiview (Sugano et al., 2014) and GazeCapture (Krafka et al., 2016; Park et al., 2019) datasets respectively, while achieving a decrease in computation time of up to 91%.


Non-Intrusive Real Time Eye Tracking Using
Facial Alignment for Assistive Technologies C. Leblond-Menard and S. Achiche Abstract-Most affordable eye tracking systems use either intrusive setup such as head-mounted cameras or use fixed cameras with infrared corneal reflections via illuminators. In the case of assistive technologies, using intrusive eye tracking systems can be a burden to wear for extended periods of time and infrared based solutions generally do not work in all environments, especially outside or inside if the sunlight reaches the space. Therefore, we propose an eye-tracking solution using state-of-theart convolutional neural network face alignment algorithms that is both accurate and lightweight for assistive tasks such as selecting an object for use with assistive robotics arms. This solution uses a simple webcam for gaze and face position and pose estimation. We achieve a much faster computation time than the current state-of-the-art while maintaining comparable accuracy. This paves the way for accurate appearance-based gaze estimation even on mobile devices, giving an average error of around 4.

I. INTRODUCTION
A SSISTIVE robotic arms (ARA) are known to be greatly beneficial to people with upper-limb disabilities [5], [6], [7], [8], and [9]. Therefore, great efforts are put into the human-computer interfaces (HCI) to control these robots. Indeed, most commercially available ARAs come with joystick-based control devices, but those require significant amount of time to learn how to be used and to accomplish simple tasks, even for typically developed users [10]. Even though control filters have been implemented to help reduce movements caused by involuntary jerks from the user [5], the joystick based HCIs are generally much more frustrating to use than more automated HCI, depending on the user's impairment [6]. In some cases, potential users might have specific disabilities which make the use of a joystick via their hands impossible. As pointed by [11], the limits of the control interfaces currently available for commercial ARAs creates a situation where the people that should benefit the most of the ARAs are unable to control them due to their severe motor disability. Thus, the research work presented in [6] reports that a vision-based interface with autonomous path planning would lead to a significant improvement in user workload in grasping and pick-and-place tasks. Indeed, the user gaze in grasping tasks has been proven to be a large part of predicting intent and point of contact [12]. As such, current trends include using a camera, either grayscale/color or combined with a depth sensor, to detect objects on the scene presented in front of the user and a HCI to make decisions on what and how to manipulate these objects [13], [14], [15], [16], [17], [18]. Recent work has promoted the use of eye tracking as a HCI as demonstrated by [12], [14], [19], [20], [21], [22], [23], [24], [25], [26], and [27], albeit with significant limitations that will be further discussed in the next section.

A. Eye Tracking for Assistive Robotic Arms Control
Several extensive literature reviews are available that describe some of the current and previous state-of-the-art methods of eye tracking and gaze estimation, with a recent example being [28]. An even newer literature review has recently been published with comparative results in terms of angular gaze accuracy for deep learning based methods [29]. An overview of all the available methods is out of the scope of this paper.
One of the most popular and commercially available eye tracking techniques relies on using infrared (IR) corneal reflection aptly named Pupil Center Corneal Reflection method (PCCR). This method relies on an IR illuminator producing a distinctive reflection on the user eye's cornea for which the position can be compared with the more easily identifiable iris center (because it usually appears as a deep black circle). While this method produces accurate results [28], its reliance on IR reflection makes it particularly prone to error when exposed to sunlight, as [14] experienced. Otherwise, other accurate methods rely on the user wearing a headmounted eye tracker or other intrusive systems when used in controlling ARAs.
Other methods have been recently proposed that make use of a single monocular camera to estimate the gaze direction [4], [30], [31], [32], [33], but those are, in our experience, either imprecise or ill-suited for embedded systems due to This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ their computing performance requirements, especially in the case where no person-specific calibration frames are used. Therefore, in this paper a novel non-intrusive real-time gaze estimation technique for use with ARAs that work reliably both indoors and outdoors while requiring as low a power as necessary when running on embedded systems without requiring person-specific calibration. The performance of the developed method is then compared to openly available stateof-the-art algorithms.

B. Objectives
Given the context of using ARAs, we therefore have the following research objective: Develop an open-source real-time gaze estimation model which can run on mobile and embedded devices in both indoor and outdoor environment using a single camera and without person-specific calibration. Indeed, if this objective is completed, the resulting model could be used in systems such as [13] and [14] to provide a more versatile solution at a low cost.

C. Contributions
The contributions of this paper are as follows: • We demonstrate that using state-of-the-art real-time face detection and iris alignment convolutional neural network (CNN) based algorithms, a gaze estimation system can provide an increased accuracy in estimating the gaze angles comparable to the current state-of-the-art while requiring no person-specific calibration and having a low performance requirement such as running in real-time on mobile devices.
• We bring forth a new framework of eye tracking tailored for robotics controlled that works reliably both indoors and outdoors while providing real-time performance, over 20 frames per second even on a single CPU core, and without being intrusive to the user. This model also outputs facial landmarks useful for assistive tasks as well.
II. RELATED WORK As introduced earlier, several methods that make use of a single camera aimed at the user's face have been recently described in the literature. As a comparison basis, we take particular interest in gaze estimation datasets that include full images of the user's face, as this allows for a more realistic scenario in the context of assistive devices as the face position must be located with respect to the camera to give an accurate estimation of the gaze angles.
One of the most widely used datasets for gaze estimation proposing full facial images is MPIIFaceGaze [34]. MPI-IFaceGaze is a modified subset of the widely used MPIIGaze dataset [35] that includes complete faces instead of the eye region only found in the original dataset. It consists of 37,667 images and their corresponding gaze data taken across 15 participants. A recent paper described a new dataset called RT-GENE [31], but this dataset corresponds to scenarios where the user is not the main object of the picture and thus the faces are generally far from the camera. As this does not represent a typical scenario for gaze estimation control of ARAs, this dataset was not used in the context of this paper.
On the MPIIGaze dataset, the currently most accurate method available in the literature is called FAZE [4] and uses a combination of deep learning models to estimate the gaze direction and head pose of the user. This method implements a metalearned model to generate the weights of a gaze estimation network from only a few calibration frames (32 and less). As reported by the authors [4], the accuracy when using no calibration frames is 5.23 • . The authors note that a real-time demo is available while providing no further description of computational performance.
Furthermore, the authors of the RT-GENE dataset proposed a gaze estimation model based on first extracting the facial landmarks using Multi-Task Cascaded Convolutional Networks [36] then correcting for the face image for the pose perspective against averaged face pose landmarks. The corrected eye patches are then extracted from the image and fed to VGG-16 networks, in an ensemble or not [37], one for each eye patch, to estimate the gaze direction. The reported gaze estimation error is 4.3 • on MPIIGaze using an ensemble of 4 models.
Most other recent deep learning-based methods have achieved an angular accuracy ranging from 4.1 • to 7.3 • , as described in [29].
In 2019, a gaze estimation dataset named GazeCapture generated from phone and tablet gaze estimation trials was used [3] for angular gaze estimation [4]. While the main use of the dataset is to train gaze estimation models for estimating a gaze point on a mobile device's screen, thus the accuracy values generally given are in pixels, it is possible to use the dataset for gaze angles estimation by converting the pixel points to 3D points and then gaze angles, as is done in [4].
Moreover, another dataset that is widely used for gaze estimation tasks is the UTMultiview dataset [2]. This dataset offers a larger variety of head poses and gaze directions by using synthesized images from reconstructed faces using an array of cameras.

A. Workflow and Implementation Details
The overall workflow proposed here can be separated in four distinct sections, which will be described here. These sections are: 1) Face detection 2) Head pose correction 3) Eye patch extraction 4) Iris detection and gaze estimation As such, the input of the workflow is an image containing the face of the user. This image should be a color image and contains enough details to distinguish the iris from the sclera and eye contour. The output is composed of the gaze angles, pitch for the up-down eye motion and yaw for the left-right motion, and the gaze origin is the center point between the eyes. To ensure the accessbility of this gaze estimation method, we ensure that no step requires person-specific calibration, rather relying on existing large dataset on which to train on once prior to using this model. Initially, a face detection algorithm is run on the image to detect the most prominent face in it. From that detection, we then must extract the location of the eyes, which is generally done through facial landmark extraction (generally named facial alignment). Given the position of the eyes, we can then extract a pair of image patches of each eye. These image patches are then used as the input of the gaze estimation algorithm, which outputs the gaze angles (direction). Since the position of the user's head should be known to infer the gaze point of gaze from the gaze direction, the facial landmarks can be used to compute the origin of the gaze vector.
Moreover, to reduce the influence of the head pose on the variability of the input images, we removed the roll component of the head pose and used the roll-removed head pose yaw and pitch angles as input to our model. Indeed, the world coordinates gaze direction − → g is a function of the head pose matrix H and the eye gaze direction with respect to the face − → g e : Note that the vectors and matrices given here are in homogeneous coordinates. Since − → g and − → g e are directions, the last component of the homogeneous vectors is 0. The method of obtaining the head pose H and applying the perspective transform will be discussed in Section III-A.2.
1) Face Detection: In general, the approximate face location must be known beforehand to crop the image around the face center and perform the facial feature alignment and iris detection. Once a general location of the face is known, it is possible to skip the face detection step and use the general region of the previously detected face to perform the facial alignment step, especially if a confidence score is available from the alignment step which is true in our case. This thus saves time in processing a frame in a continuous video stream. A simple static threshold can be used such that if the confidence score falls below a fixed value, the face detection step is run again. For every face alignment step, we compute the center position of the face and check how far it moved from the last frame, and then re-centering the face-cropped image by moving the center by how far the face alignment center moved.
It should be noted that in the comparative results of this paper, this strategy is not used, since the datasets used for comparision are not videos, but rather distinct images and as such this strategy cannot work.
One of the first successful real-time face tracking algorithm was described by Viola and Jones using Haar-like cascades [38] and is still used. More precise and stable real-time algorithms have been proposed afterward, including histogram of oriented gradients (HOG) with support vector machines (SVM) and linear binary pattern (LBP) cascades, with HOG being generally the most accurate of these methods [39]. Recently, breakthroughs in small optimized CNNs led to very accurate face detection yet real-time performance on embedded computers and portable devices. One such stateof-the-art model, named BlazeFace, was developed and trained by Google Research with very high precision and fast inference time on mobile devices [40]. Based on MobileNetV1/V2 [41], [42], the architecture implements further optimization including increasing the receptive field size by using 5 × 5 kernels over 3 × 3 which are cheaper than adding layers with more computationally expensive pointwise convolutions.
Whereas our initial work was based on a modified HOG algorithm [43] parallelized and ported to CUDA, the recently release BlazeFace algorithm performs about as fast in our experience while providing increased accuracy comparable to state-of-the-art real-time models according to [40]. This is thus the chosen algorithm for our workflow. Furthermore, this algorithm outputs the location of 6 facial landmarks (the two eye centers, the two ear centers, the nose center, and the mouth center) which we can use to compute and correct for the head roll by assuming the left and right landmarks should be on a horizontal line.
2) Head Pose Correction: As stated previously, we can correct for the head pose by using the detected facial landmarks and applying a perspective transform to the eye image patches as a form of normalization, as was suggested in [1]. This allows the network to train on estimating the gaze direction without having to directly account for the head pose variation in the image. The pose normalization technique we used is based on the one described in [31].
As such, this requires several facial landmark points to be known on which to find the optimal series of rotation to align the actual head pose with a reference (sometimes called "canonical") head pose.
An effective real-time method of facial alignment recently published is called FaceMesh [44] and is part of the same augmented reality framework as BlazeFace developed by Google. It uses a straightforward residual neural network that outputs a set of heatmaps for each facial feature (in this case 468 points) on which subpixel maximum estimation is done to find the input image location of the landmarks. This aligner is trained to not only output the 2D pixel location of each landmark, but also a depth value associated with each landmark that corresponds to the difference with the average depth of the face, while keeping the same scale (or aspect ratio) as the horizontal coordinates.
The face alignment model is learned from deforming a reference model with dimensions given in centimeters but projected to the image according to the camera model i.e., the camera matrix and distortion coefficients. Therefore, we can iteratively find the projected 3D transformations that best fit the transformation from the reference model and the computed landmarks points through a Procrustes analysis, also known as an orthogonal Procrustes problem, as used by the team at [45] and described in [46].
Using this method, we thus compute the head pose with respect to the camera, the previously described H . We can then find the gaze vector origin which is assumed to be the eye landmarks center point by using the canonical model's eye center − → e c and head pose matrix H : Here, − → e is the eye center position with respect to the camera. Since the vectors here are positions, their last component is 1.
Given the direction to − → e and head pose H , we can find the perspective transformation matrix W that derotates the image as to make the head appear upright and that aligns the camera view axis with the eye position e while reprojecting the image as if the eye was at a new distance ∥e∥ * and a new camera focal length l * . This methods was first proposed by [2] and then revisited by [47], the latter being the normalization technique used in this paper. By choosing the ratio between the new eye distance ∥e∥ * , the new camera focal length l * and the width and height of the normalized image, the resulting normalized image will have the de-rolled eye centered in the image with a specific and constant scale. In our case, the values used for ∥e∥ * and l * are 600 millimeters are 650 pixels respectively. The distance ∥e∥ * of 600 millimeters is generally suggested by the literature [2] whereas the focal length l * is generally found by trial and error until a value is found that has the right scaling ratio as to make the eye appear the choosen size for the model and input image resolution.
This removes parts of the perspective variance due to the head pose roll and camera parameters from the image and thus reduces the complexity of the problem to be learned by the gaze estimation model [1], [4], [31].
3) Eye Patch Extraction: From the facial landmarks, eye image patches must be extracted for the iris detection and gaze estimation part of our workflow as seen on Fig. 2.
To do so, we use the average position in 3D of the eye from the previous step perform the perspective transformation normalization described in [31]. This method corrects for the head pose by warping the image as to align the eye patch normal vector to the vector between the camera's position and the eye position in 3D and re-projecting the image at a constant distance. This constant distance is set as to allow for a 25% margin on each side of the eye patch image, as is required by the iris detection and gaze estimation model that will be discussed in the following section.

4) Iris Detection and Gaze Estimation:
It is intuitive to think that the eye region landmarks and the iris center contain enough information about the user's gaze to estimate its direction. Indeed, the literature seems to point to a clear consensus on the subject [30], [31], [33], [48].
As such, we start from a model made to obtain the position of the iris given a cropped image of the eye on the right side from the user's perspective. For this model, we used as basis the very lightweight (few learned parameters) method recently described in [49] to align the eye region and find the iris center and contour landmarks.
While our initial work was based on an explicit relation in position between the eye region features (namely the lower sclera contour) and the iris center, where the vector component in the difference in position was used as the input of a basic polynomial regression or SVR regression to keep the computational burden to a minimum. This unfortunately in our experience lead to poor accuracy when compared to the state of the art, even when corrected for the head pose.
Considering the architecture of the aforementioned CNN model, we can infer that there is a lot of information encoding the position of the eye region landmarks and iris in the latent space between the backbone and the landmark heads (see Fig. 2). Therefore, we ended up modifying the iris detection model by getting rid of the iris and eye landmarks splits of the original model and adding a new split to estimate the gaze direction. By using the initial model's weight and biases of the backbone, we can ensure that the model has a headstart in learning eye-specific encodings. We then concatenate the computed features of the head with the eye poses (face pose at each eye) as an input to a final, fully connected set of layers. The final output if this set of layers, a multi-layer perceptron (MLP), is thus the estimated gaze angle.
The backbone and split are made of Irisblocks [49] which are themselves based on Blazeblocks [40] which are themselves a modification of Residual blocks [50] where a convolution and activate layer is followed by a depthwise convolution followed by a normal convolution while the residual feed forward is an identity function or a max pool layer when the stride of the convolution layers lead to a reduction in the width and height dimensions. The sum of the convolutions and the residual are then passed through an activation function, PReLU in this case. In the specific case where the number of channels of the output of the block is higher than the number of input channels, the residual is padded to ensure the same number of channels when it is added to the output of the convolutions. See Figure 3.
Furthermore, we can double the model to make use of images from both eyes using a horizontally flipped image of the left eye and concatenate the feature vectors of each model split as the input for the gaze estimation head to improve the accuracy at the cost of increased computational complexity, as seen in Fig. 2 and 4. By reusing the same backbone weights but horizontally flipping the left eye image, we found that our model often achieves better generalization, i.e. lower evaluation error.
Also, as suggested in [31], the head pose angles obtained from the Procrustes analysis and used for image normalization can be added as input to the model to account for some change in appearance of the eye due to the pose differences between frames. As such, we directly concatenate the head pose angles to the other features before the final fully connected layers. 5) Other Implementation Details: All the used deep learning models were implemented in PyTorch [51] and PyTorch Lightning [52], allowing them to run on both the CPU and GPU interchangeably. As such, our method is written in Python and PyTorch and makes the use of OpenCV [53] just for some basic image processing functions.
Although we train our modified version of the iris landmarks localization model proposed in [49], our workflow uses both the BlazeFace [40] model and the FaceMesh [44] model as trained by and available in Mediapipe [45], albeit with the models converted to PyTorch while keeping the same parameter values. The original model was implemented for TensorFlow-Lite [54] inference.

B. Training and Validation
Given our gaze estimation method is based on the iris landmarks localization model from [49], we start by re-using the model's parameters that correspond to the unmodified model backbone as the backbone of our model. This allows us to re-use the learned feature extraction for the eye and iris landmarks to help convergence of the optimization.
Training is done on the MPIIGaze [1] dataset by following the splits given in [4] and [31]. As such, for the MPIIGaze dataset, we use a leave-one-out approach for the test set. We trained our model using the Ranger21 [55] optimizer, which implements recent progress on Rectified Adam [56] and other methods that rely less on fine-tuning hyperparameters. The batch size is 32, the learning rate is set to the default 0.001 and the β 1 and β 2 parameters are set to 0.8 and 0.7 respectively to slow down the training speed. We also use a weight decay penalty of 0.001. The loss function is the mean square errors between the output gaze angles and the ground truth gaze angles as given by the direction between the eye center points (between the eyes) and the gaze targets of the dataset.

C. Performance Comparison Methodology
In order to allow for a common-ground comparison between the different algorithms tested, the PyTorch version of the tested methods was preferred if other implementations were available. Furthermore, we did not make use of optimized runtimes such as TensorRT that would provide an increase in inference speed for the same reason. All algorithms are run from Python and the inference time is taken from the start of the computation, just after the image frame is loaded in memory, to the end of the gaze estimation after the gaze angles are obtained, without any user interface processing or drawing functions being run.
The error metric e used is the arc cosine of the cosine similarity measure, given by computing the angle between the real gaze direction − → g given by the annotations of the dataset and the estimated gaze direction − → g e given by the gaze angles, defined as: g∥∥⃗ g e ∥ This metric has interesting properties for comparison, as it is always positive and thus the average corresponds to the angular error between the directions, as opposed to the mean angular error in the pitch and yaw angles separately.

IV. RESULTS
The results are presented in three distinct tables. Table I presents the results as either given by the original authors or in our case using the dataset supplied head pose information for the MPIIGaze [1] dataset. Table II presents the results for the GazeCapture dataset [3] using the annotations and evaluation split defined by [4]. Table III on the other hand contains the results for the UTMultiview dataset [2] using the 3-fold evaluation split as used in [2] and [57]. In our case, since our model uses both eyes images as input, we rendered the same head poses for each sample (−36 to 36 • ), but rendering both eyes instead of one at a time. This needs to be done as the left eye and right eye images of the pre-rendered images given with the dataset are not matched to one another. Table IV contains the results we obtained running the implementation of the compared algorithms as distributed by the original authors but modified to remove all drawing and interface-related parts of the original code to allow for a better comparison in performance. All implementations are available on the authors Github repositories [57], [58]. The trained weights used are those made available by the original authors and we use a 7500 images subset of the MPIIFaceGaze dataset [34] to evaluate on.  I  ACCURACY FOR THE MPIIGAZE DATASET   TABLE II  ACCURACY FOR THE GAZECAPTURE DATASET   TABLE III  ACCURACY FOR THE UTMULTIVIEW DATASET   TABLE IV  PERFORMANCE TIMINGS (FULL DETECTION AND GAZE ESTIMATION) We decided not to include accuracy figures of this evaluation since it is not clear on what dataset the provided weights were trained on for the FAZE model. [4] In Table IV The time performance metric (inference time) is given as an average in three scenarios: using GPU acceleration, without GPU acceleration but using all CPU cores and without GPU acceleration and using a single CPU core. This gives a better idea of the available performance in power constrained scenarios such as in mobile applications and embedded systems.

A. Interpretation
As the results show, our model reached accuracy levels comparable to the current state-of-the-art while being significantly faster, especially when running only on the CPU. Given the original goal of using this gaze estimation model for assistive technologies such as controlling an assistive robotic arm directly mounted on a motorized wheelchair, this performance improvement is significant. We thus achieve a reduction of computing time of 39% against the fastest method in the worst-case scenario and up to 91% against the slowest in the best-case scenario.
Moreover, as we can see from the frame times, our GazeIrisLandmarks model requires lower computation time due to the low number of parameters. In fact, as can be seen from the multi-core and single core frame times, the overhead of splitting the task on multiple cores makes the overall computation almost as slow as running it on a single core. This is confirmed by the model size used, with ours being 3.8 megabytes, whereas the RT-GENE 4 models ensemble is 1312 megabytes in total and the FAZE model is 27.5 megabytes.
As for the accuracy of our model, Tables I, II and III show that our model can reach accuracy figures comparable to the current state-of-the-art, being better in certain datasets and slightly worse in others.

B. Limitations and Further Work
Throughout our research, we have found that performance and computational requirements are rarely discussed extensively, let alone analyzed quantitatively, especially when comparing gaze estimation techniques. There is currently no benchmark or standardized way of comparing computational requirements of gaze estimation methods. This thus requires making the original model code available online for comparisons to be made and thus limits the number of models that can be compared.
Indeed, because performance numbers such as inference time depend on the hardward on which the numbers are produced, comparison must often be relative, where performance numbers for all compared methods must be run on the same computer as is the case for this paper.
It should also be noted that advances in technologies and new dedicated hardware, such as inference hardware now found in some smartphones could favor some algorithms over others, which is a limitation of the comparison method used in this paper.
Moreover, as has been demonstrated by [4], deep learning models can be adapted to leverage metalearning frameworks to improve accuracy of models by using calibration with very few samples. Given the context of assistive technologies, very long calibration procedures might be off-putting or just not possible but acquiring just a few calibration samples might be a good way to improve person specific accuracy and is therefore considered as a possible path of improvement. The drawback of such an approach is the added hyperparameter tuning required on top of the very high training computational requirements since higher order gradients are required [4].
Finally, the work presented in this paper focuses on appearance-based gaze estimation methods using singular cameras. Other methods exist, such as infrared reflectionsbased PCCR methods [28] which are often found in commercial eye trackers or calibration-based methods [33].
Some examples of the limitations of appearance-based gaze estimation methods without calibration includes lower accuracy when compared to PCCR methods and training dataset biases, where for example the ethnicity of the subjects in the training dataset limits the attainable accuracy with some users [59].
Furthermore, the domain of gaze and head pose angles for which an appearance-based model achieves accurate results is in our experience limited by the domain on which it is trained. As such, use cases for appearance-based models should take into consideration the training dataset's gaze and head pose angles domain and the typical use case angles to ensure they remain within the dataset's domain.

VI. CONCLUSION
We have shown that leveraging existing very small pretrained models for eye region landmarks recognition and modifying the structure to access the latent information within the model can lead to accuracy comparable to the current state-of-the-art while significantly reducing the computational requirements. While we achieve a 4.5 • average error which is similar to the current state-of-the-art for appearance based gaze angle estimation for the MPIIGaze dataset [29], we see a decrease in computation time ranging from 39% up to 91% against the current appearance-based gaze estimation methods publicly available. We also achieve a state-of-the-art 3.9 • average error on the UTMultiview dataset [2] and 3.3 • error on the GazeCapture dataset [3], [4].
This also allows the model to produce face and eye landmarks that can be used for other vision-based assistive tasks as well without needing further computation because these features are computed as part of the gaze estimation workflow. Furthermore, we show that a holistic approach describing a complete workflow such as proposed in this paper leads to improved accuracy when translated to realworld scenarios due to the high dependency of certain models to accurate and manually obtained annotations.
An implementation of our method, including a trained model, training description and real-time demo is available at: https://github.com/cedriclmenard/fastgaze ACKNOWLEDGMENT We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC).