Learning a 3D Gaze Estimator With Adaptive Weighted Strategy

As a method of predicting the target’s attention distribution, gaze estimation plays an important role in human-computer interaction. In this paper, we learn a 3D gaze estimator with adaptive weighted strategy to get the mapping from the complete images to the gaze vector. We select the both eyes, the complete face and their fusion features as the input of the regression model of gaze estimator. Considering that the different areas of the face have different contributions on the results of gaze estimation under free head movement, we design a new learning strategy for the regression net. To improve the efficiency of the regression model to a great extent, we propose a weighted network that can adjust the learning strategy of the regression net adaptively. Experimental results conducted on the MPIIGaze and EyeDiap datasets demonstrate that our method can achieve superior performance compared with other state-of-the-art 3D gaze estimation methods.


I. INTRODUCTION
The gaze vector can be speculated from the pupil to the target's attention. It has been increasingly important as a non-verbal cue in many fields, including marketing and consumer research [1], [2], human-computer interaction [3]- [5], medical care [6]- [8], aviation and vehicle driving [9], and criminal investigation [10]- [12]. However, the existing gaze estimation systems often have the following defects: redundancy calibration process, low tolerance to head movement, limitation of lighting conditions and complex system settings, which limit the commercial promotion of gaze estimation.
In order to reduce the influence of the above-mentioned defects, there have been an increasing number of methods proposed for gaze estimation, which can be roughly classified into two major categories: model-based and appearance-based methods.
The model-based gaze estimation [13], [14] method uses the fitting model to estimate gaze direction relying on The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Olague .
invariant facial features, such as pupil center [15], iris outline [16] and corneal infrared reflection [17]. Guestrin [18] established an eye model based on the pupil center, fixation point and eye center, and only one camera with two infrared light sources can complete the eye line estimation. Hennessey et al. [19] considered the influence of various head postures, relying on complex and detailed calibration steps to complete the gaze estimation of free head movement. To simplify the calibration procedure, Shih and Liu [20] proposed an improved Le Grand model and combined with the head attitude compensation model. By solving the linear equation to estimate the optical axis, this method can use two cameras to achieve the purpose of single point calibration and update the mapping function dynamically. Zhou et al. [21] developed a binocular model-based gaze tracking method, proposed an improved iris center localization method based on gradient characteristics, and simplified the individual calibration process requiring only one calibration point.
The appearance-based gaze estimation method extracts input features from the human eye appearance images and realizes gaze estimation by establishing a mapping relationship between input features and gaze direction. Different from the model-based gaze estimation methods, the appearance-based methods usually only need a single camera to capture the user's eye images. Common input features include complete face image, human eye image, color opponency and histogram extracted from eyes. There are many kinds of mapping relationships, including k-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machine (SVM), Gaussian Process (GP) and Artificial Neural Network (ANN). Zhang et al. [22] first extracted three low-dimensional features from the eye images, including the color opponency, gray scale intensities and direction information, and then used a KNN classifier with k = 13 to learn the mapping from image features to gaze direction. Wang et al. [23] added the depth feature to the traditional gaze estimation and applied the RF regression based on cluster-to-classify node splitting rules. Kacete et al. [24] used RF regression to estimate the gaze vector from the high dimensional data with the face information. The RF could do parallel processing as well as the training speed is relatively fast. Wu et al. [25] located the eye region by modifying the characteristics of active appearance model and used SVM to classify the five gaze directions.
Recently, with the development of machine learning and the support of massive data, extensive learning-based gaze estimation methods have been presented. These methods such as Convolutional Neural Network (CNN)-based methods, have great potential to handle the challenges faced by traditional methods, including redundancy calibration process, complex head postures, and limitation of lighting conditions. Zhang et al. [26] built a novel in-the-wild dataset and employed the CNN to learn the mapping from the head pose and eye images to gaze angles. Krafka et al. [27] introduced an eye tracking method for mobile devices, which used face image, individual eye and face grid as input. Zhang et al. [28] used a spatial weights CNN to encode face images and flexibly suppressed or enhanced the information of different face regions. Cheng et al. [29] proposed a concept of two-eye symmetry to predict the 3D gaze direction, and designed an evaluation network to adaptively adjust the regression network according to the performance of the eyes. Palmero et al. [30] used face, eye region and face landmarks as separate information flows in CNN to estimate gaze in static images. The learning features of all frames were input into a many-to-one recurrent module sequentially, and the 3D gaze vector of the last frame was predicted. Fischer et al. [31] recorded a new dataset of different head postures to improve the robustness of gaze estimation, applied semantic image inpainting to the area covered by glasses to eliminate the obtrusiveness of the glasses and built a bridge between training and test images. Yu et al. [32] introduced a constrained landmark-Gaze model to get the relation of eye landmark locations and gaze directions. Park et al. [33] used single eye image as input and simplified the task of 3D gaze estimation. They mapped the appearance of the eye to the intermediate pictorial representation, which was easier to learn the end-to-end model. In [34], [35], the authors introduced a hybrid-model that used CNN to map image to eye landmarks and then mapped eye landmarks to eye gaze. Wang et al. [36] proposed to combine adversarial learning and Bayesian inference into a unified framework. They also added an antagonistic component to traditional CNN-based gaze estimators so they could learn features that respond to the gaze.
In order to further utilize the powerful function of CNNs and improve the accuracy of gaze vector prediction, we propose an adaptive weighted 3D gaze estimation method. The main contributions of this paper are listed as following.
(1) We improve the Itracker model [27] to predict singleframe gaze. The face gird in the conventional Itracker model can be used to locate face position to supply the location information for 2D gaze estimation, while this branch is removed in the improved model because it is useless in our 3D gaze estimation. To further improve the performance of the model, we concatenate the facial stream, left eye stream and right eye stream to obtain their joint characteristics.
(2) During the process of model training, the face, left eye and right eye images have different influence on the final result. We propose a new loss function for the improved regression model. Based on the traditional regression loss function, we add the weight function of the three regional images.
(3) We propose a weighted network to judge the contribution of face, left eye and right eye images on the results of gaze estimation. According to the errors between the predicted value and the real value, the corresponding weight will be obtained. The adaptive weighting is realized by adjusting the strategy of regression model by weight value.

II. PROPOSED GAZE ESTIMATION METHOD
In this section, we present an adaptive weighted gaze estimation method. Firstly, the regression function of 3D gaze estimation is introduced. Then, the steps of data preprocessing are stated. Finally, the network architecture and the steps of adaptive weighted implementation are detailed. The overall architecture is shown in Fig. 1.

A. 3D GAZE ESTIMATION
Based on the image of eye appearance, a regression function f is constructed to establish the mapping relationship between image I and 3D gaze vector g, where g = f (I ). At present, FIGURE 1. The overall architecture of proposed 3D gaze estimation method. VOLUME 8, 2020 various regression models have been used in the gaze estimation methods, such as Neural Network, RF regression, GP regression, and SVM regression. We use the CNN to solve the problem because the regression of gaze estimation is usually highly nonlinear. With the development of deep neural networks, designing an efficient network architecture with large training dataset can solve this complex regression problem simply.

B. DATA PREPROCESSING
The results of gaze estimation are significantly affected by the head pose. Similar to [32], we normalize the image data to weaken the influence of this factor. The basic concept of data preprocessing is shown in Fig. 2. The data normalization method is to make a perspective transformation on the original image, so that the training model can be complete for gaze estimation in a specific virtual space. The method transforms the original image and the normalized image to satisfy the following three conditions. 1) The center of the face reference point is located at a fixed distance d from the center of the normalized image.
2) The horizontal direction of the head is parallel to the X -axis of the normalized image. 3) The face always has the same size in the normalized image.
We place the face reference point in the center of the image at a fixed distance from the camera. Assume that a(a x , a y , a z ) is the face reference point under the camera space. The first condition is satisfied by setting the z-axis of the virtual space be v z = a z / a z . To satisfy the second condition, the y-axis of the virtual space has to be defined as where h x is the rotation matrix of head pose in x-axis. The remaining x-axis of the visual space then can be computed by v x = v z × v y . Using these vectors, the rotation matrix can be defined as The transformation matrix is then defined as M = SR, where scaling matrix S to satisfy the third condition can be defined as S = diag(1, 1, d/ a 2 ).
We use the warp matrix W to transform the human face into an image plane of a specific camera space. Let W = C a MC −1 v , where C a is the internal parameter matrix of the original camera and C v is the internal parameter matrix of the virtual camera. In addition, the original gaze label also needs to be converted during the training stage by g v = Rg a ,where g v and g a represent the normalized gaze label and initial gaze label respectively. In the test phase, g a = R −1 g v is used to convert the virtual camera space to the original camera space for each prediction result.
The proposed data normalization method can cope with the influence of the difference of cameras in the real world on the prediction accuracy. This operation will not have any impact on the experimental process, but it should be noted that the accuracy of the internal parameter value of the camera is closely related to the presentation of the final sight vector result.

C. REGRESSION NETWORK ARCHITECTURE
In this paper, we propose an adaptive weighted regression model for the appearance-based gaze estimation. In practice, we observed that the left eye, right eye and face images have different contributions on the accuracy of regression in different scenes. The different image areas cannot achieve the same accuracy value. Therefore, when training a gaze regression model, it is better to rely on the high-quality images to train a more effective model. This model is composed of a main network and a sub-network. The main network performs the regression prediction from image to gaze vector, and the sub-network performs the adjustment of the Loss function of main network to achieve the purpose of adaptive adjustment.
The proposed network learns a regression model to predict the ground truth of gaze vector with left eye, right eye and face images. The overall structure is shown in Fig. 3.
In [27], the author used the face, both eyes and face grid separately into a branch of the network, and finally mapped the merged features which extracted from each branch to the ultimate 2D gaze point on the screen. Since the method in [27] predicts the gaze point on the screen, it not only needs to obtain the gaze vector, but also needs the face grid to provide the position information of the head position in the camera space. However, we mainly consider how to predict the gaze vector effectively. Therefore, we remove the face grid in our architecture. To realize the concept of adaptive weighted, the separate features and joint features of the face and the two eyes should be extracted and utilized.
As shown in Fig. 4, the regression network is a six-stream convolutional neural network. We use the reduced version of the convolution layers of a Lexnet as the basic network of each branch. Considering that when the eyeball rotates, many areas of the face will have big or small changes. In order to realize the self-adaptive adjustment of spatial weight, this paper adds three fused features to the basic features of face, left and right eyes. The fused features of face, left eye, and right eye are input as a single branch in the network. The first three streams are designed to extract the 64-dimensional deep features from the face, left eye and right eye respectively, and the last three streams are used to produce a joint 64-dimensional deep features. These six streams are then combined through a FC layer, and the dropout layer is used to prevent over-fitting problem. Finally, the corresponding gaze vectors are obtained through a 6-dimentiaonl FC layer.
The face and the both eyes can play different roles in the training network. If one of the areas is more likely to achieve   a smaller error, then we should expand its weight in the optimization of the network. Following this idea, we propose a new strategy to optimize the network.
We first calculate the angular error of the currently predicted 3D gaze direction of the face and the both eyes.
where f (I ) represents the predicted value of the gaze vector (the gaze regression), and g represents the ground truth of the gaze vector. We then calculate the weighted average error of the three errors.
e = λ f · e f + λ l · e l + λ r · e r where λ f , λ l , and λ r determine the errors of the face, the left eye and the right eye, respectively. If the image of which region is more likely to produce smaller errors, the weight of the network should be increased when optimizing the network. With this concept in mind, we propose the following formula to set the weights.
1/e l 1/e f + 1/e l + 1/e r λ r = 1/e r 1/e f + 1/e l + 1/e r (5) Considering that the error between the predicted value and the actual target value in the images of the three regions will be different, we calculate the mean square error between the predicted value and the target value.
LR = MSE + 3 e f · e r · e l e r · e f + e l · e f + e l · e r (7)

D. WEIGHTED NETWORK
As mentioned above, the regression network can predict the gaze vector by the high-quality face and eye images. We then design the weighted network to learn the selection of the regression network and show its dependence on different regional characteristics in the optimization process. As shown in Fig. 5, the network is a three-stream convolutional neural network. Each stream extracts 64-dimensional deep features from the face, left eye and right eye respectively. A simplified version of the Alexnet [37] is the basic network of each branch followed by a 3-dimensional fully connected layer. Finally, the Softmax regressor is used to get the probability bias vector [p f , p l , p r ] T of the corresponding face and both eyes.
In order to train the weighted network to predict the choice of regression network, we set the following Loss function.
where p f is the probability that the regression network depends on the face region in the prediction process. p l and p f are the probabilities that depend on the left eye and the right eye respectively. During training, the ground truth of p is determined by the gaze vector error from regression network. Taking the P tf as an example, P tf is set to be 1 if e f < e l and e f < e r , and P tf is set to be 0 in other cases. In other words, when the error of the face in the regression network is the smallest, we should choose to maximize p f to learn the fact to realize the adjustment of the regression network. Similarly, p tl is set to be 1 when e l is the minimum; otherwise p tl is 0. When the value of e r is the minimum, P tr is set to be 1; otherwise P tr is 0.
The aim of the weighted network is to adjust the regression network to improve the accuracy of gaze estimation. For this purpose, the Loss function of the regression network is adjusted to e f · e r · e l e r · e f + e l · e f + e l · e r where w is to balance the weight between the learning of left eye, right eye and face. The gaze vector depends on the input images of the regression network. If the ground truth of gaze vector (g f , g l , g r ) are approximately the same, we should not increase the weight of any area in the learning of the regression network. When the gaze vector (g f , g l , g r ) are greatly different, we prefer to train a certain region with small error in the regression network. The adaptive adjustment is realized by determining the output (p f , p l , p r ) of the weighted network. In an ideal situation, p f , p l , p r can present extreme values of 0 or 1, allowing the network to select areas that can generate small errors and have high image quality for training to improve the accuracy of the results. In the actual training process, p f , p l , p r will only be a value between 0 and 1. The calculation is stated as follows.
where a = 1 if e f < e l and e f < e r , otherwise a = 0; b = 1 if e r < e l and e r < e f , otherwise b = 0. During the experiment, w is the decimal number between 0 and 1.

III. EXPERIMENTAL EVALUATION
To verify the effectiveness of the proposed 3D gaze estimation method, we evaluate it on two publicly available datasets: MPIIGaze [28] and EyeDiap [38]. Firstly, we cross-validate the method to demonstrate the performance of our algorithm. Then, we perform ablation experiments to evaluate the contributions of different regional images on the network. Next, we do experiments with different resolutions to show the robustness of the proposed network. Finally, we evaluate the effectiveness of the weighted network on the gaze vector. In this paper, we use the angle difference between the prediction vector and the ground truth vector to represent the accuracy of gaze estimation.

A. DATASETS
The MPIIGaze dataset consists of 213,659 images of 15 participants, including various illumination conditions, eye appearance and head posture. It's worth noting that we need to do normalization for the images and data of the MPIIGaze. We use the center of six facial markers provided in the dataset as the starting point of the gaze vector. The starting point of the gaze vector is also the facing point of the virtual camera in the normalization process. To reduce the influence of illumination difference, each input image is equalized by the adaptive histogram. To facilitate comparison with other stateof-the-art gaze estimation methods, we perform leave-oneperson-out cross-validate on participants in the same way.
The EyeDiap dataset is a video data set of 16 participants, including various illuminations, scenes and head postures. We select an image per 15 frames from each video clip and filter out the frames that satisfy the following conditions: (1) participants do not look at the screen; (2) the annotations are not provided correctly; (3) the gaze angle violates the physical constraints (where: elevation angle theta (ϕ) <= 40 • , azimuth angle phi (θ) <= 30 • ).
Similar to the MPIIGaze, the EyeDiap also needs to apply normalization firstly. We use the midpoint of two iris centers provided in the dataset as the origin of the gaze vector. We apply the adaptive histogram equalization to reduce illumination changes. The gaze targets on this data set are divided into two categories: screen targets and floating targets. To facilitate comparison, we use only screen targets for evaluation and divide 14 participants into four groups for leave-one-group-out cross-validation.

B. CROSS PERSON/GROUP EVALUATION
The proposed method is compared with the state-of-the-art 3D gaze estimation methods on the MPIIGaze and the Eyediap datasets. Tables 1 and 2 show the comparison results on both datasets, respectively. According to the comparison results, our method achieves superior performance both in the MPIIGaze and the Eyediap datasets. Fig. 6 and Fig. 7 show part of the prediction results of our method on the MPI-IGaze and EyeDiap datasets, where the green and red lines represent the prediction results and ground truth of gaze vector respectively. Our method is robust that can maintain high prediction accuracy under the circumstance of various illumination difference and large head postures.

C. NETWORK EVALUATION
To verify the role of each module in the network, the network is divided into monocular module, face module, face + monocular module, binocular module and VOLUME 8, 2020  face + binocular module for evaluation. Fig. 8(a) shows the evaluation results of each module on the MPIIGaze dataset. It can be seen from the figure that the order of contributions on the final estimation accuracy from large to small is monocular, face, face + monocular, binocular, and face + binocular.
The contribution of a single face branch is greater than that of a single eye branch on the final estimation results. However, the binocular module is more suitable for the learning strategy of regression + weighted network, and the final accuracy of the face + binocular module after adding the face module is the best. Similarly, Fig. 8(b) shows the evaluation results of each module on the EyeDiap dataset. Face branch plays a better role in the prediction of line of sight than eye branch, but the function of face + binocular module is better than binocular module and face + monocular module, which is more conducive to the expression + weighted network.

D. RESOLUTION EVALUATION
Gaze estimation method often requires that the model can maintain high accuracy in a certain distance. Although our data are normalized before model training to reduce the resolution difference caused by images with different distances, the loss of some useful information cannot be avoided. This phenomenon may lead to a decline in our forecast results.
Therefore, we need to evaluate the influence of images with different resolutions on our method. In order to simulate this environment, images of 224 × 224 are downscaled to 168 × 168 and 112 × 112 respectively. In order to facilitate the comparison, the final input size of images with different resolutions needs to be consistent, so the two types of images are returned to 224 × 224 through upscaling. We conduct experiments on the MPIIGaze and EyeDiap datasets, and the results are shown in Fig. 9. Our method can maintain good accuracy even when the distance is twice as long as the original distance.

E. WEIGHTED NET EVALUATION
The proposed weighted network is the key technology of the proposed method. In this section, we evaluate the contribute of adding the weighted network to the regression network. We compare the experimental results before and after adding the weighted network, and the comparison experiment is completed on the MPIIGaze datasets. We perform leaveone-person-out cross-validation for the regression network and the adaptive weight adjustment network respectively. As shown in Table 3, gaze estimation results for all the 15 subjects in the MPIIGaze dataset are illustrated. The table shows the gaze vector error of each subject from the starting point of left eye, right eye and face to the target point under the RW-net and R-net. After joining the weight adjustment network, the prediction results generally have significant improvement. However, a few of them have not been improved and Figure 10 shows some evaluation results. Through the comparative analysis of the images, we can see that the negative impact of the weight adjustment network is mainly affected by the illumination. Due to the influence of illumination and other factors, the overall quality of the captured image is low, and the R-net cannot effectively evaluate the weight of regional image features. In the RW-net, it is more likely to use the average error of three regions to calculate the loss value rather than the weighted error of the three regions, which is in conflict with the idea of selecting the region with small error for training in the actual situation, so as to have a negative impact on the final accuracy.

IV. CONCLUSION
This paper has proposed an adaptive weighted 3D gaze estimation method based on deep learning. The method needs to maintain accuracy over a certain distance. We have evaluated the proposed model at different resolutions, and the results have showed that the proposed network has good robustness for images with different resolutions. In order to improve the prediction accuracy, this paper has proposed the weighted network to adjust the regression network. Based on the concept that the weight of which part is weighted based on the small error, the weight adjustment network can adapt the strategy well. Compared with the existing latest line-of-sight estimation methods, our method has significantly improved the accuracy. However, through experimental comparison, the training results of the weighted network under different lighting conditions are not good. Future work will consider how to improve the role of weighted network more effectively and consider using more advanced network structure to further improve its performance.