An integrated neural network model for eye-tracking during human-computer interaction

: Improving the efficiency of human-computer interaction is one of the critical goals of intelligent aircraft cockpit research. The gaze interaction control method can vastly reduce the manual operation of operators and improve the intellectual level of human-computer interaction. Eye-tracking is the basis of sight interaction, so the performance of eye-tracking will directly affect the outcome of gaze interaction. This paper presents an eye-tracking method suitable for human-computer interaction in an aircraft cockpit, which can now estimate the gaze position of operators on multiple screens based on face images. We use a multi-camera system to capture facial images, so that operators are not limited by the angle of head rotation. To improve the accuracy of gaze estimation, we have constructed a hybrid network. One branch uses the transformer framework to extract the global features of the face images; the other branch uses a convolutional neural network structure to extract the local features of the face images. Finally, the extracted features of the two branches are fused for eye-tracking. The experimental results show that the proposed method not only solves the problem of limited head movement for operators but also improves the accuracy of gaze estimation. In addition, our method has a capture rate of more than 80% for targets of different sizes, which is better than the other compared models.


Introduction
Human-computer interaction is the way that people exchange information with a system. The system can be a wide variety of machines, computer systems, and software [1]. Early human-computer interaction was mediated by machine language, and the interaction was accomplished through manually inputting machine language instructions to exchange information. With the development of computer and communication technology, there are more and more ways of human-computer interaction, including speech recognition, gesture recognition, and eye-tracking [2][3][4]. Human-computer interaction methods based on eye-tracking are wildly used in various fields because of the characteristics of real-time performance and flexibility [5][6][7][8][9]. Eye-tracking is the method that estimates the gaze point or direction of the eye by tracking the movement of the eye. In human-computer interaction, gaze control is a flexible method to enable communication with computers [10].
Eye-tracking has always been a hot topic in machine vision technology [11]. Gaze-tracking methods fall into two main categories: model-based methods and appearance-based methods [12][13][14].
Model-based methods generally use special equipment to collect images, detect eye features by image analysis, and then use these features to build models to estimate gaze. In the model-based approach, the popular sight features include the pupil, iris, canthus and corneal reflection points. The specific applications of these features include using the radius and center of the pupil to estimate gaze through a geometric model [15,16] and using corneal reflection points to estimate the gaze [17,18]. The pupil-canthus method is used to estimate the fixation point of users [19,20]. The model-based methods must ensure the quality of the acquired image to obtain an accurate gaze estimation. The accuracy of gaze estimation will be affected by image resolution, noise and illumination conditions. Therefore, to get an accurate and reliable gaze estimation model, the hardware must be equipped with high-quality cameras and special devices such as narrow-angle lenses and external lighting to extract adequately accurate and detailed edges or feature points. But in the wild, because of the influence of the head pose or light conditions, the method based on the model yields a high error rate [21]. In addition, it is necessary to analyze the prior knowledge of the eye model to establish a good line-of-sight estimation model. However, this method of establishing a good model based on prior knowledge is a challenging task [22]. In contrast to the model-based approach, the appearance-based approach directly estimates gaze by analyzing eye images. The specific process is as follows. First, collect the face or eye image of the tested person with the label. Then select the training sample as the input image data, fit the relationship between the human eye appearance and the fixation direction or fixation point through the training sample and finally input the test image sample to determine the gaze direction or fixation point of the corresponding area. This method uses a mass of statistical data to learn the invariance of appearance differences [23]. And it does not require the manual design of features, as it automatically extracts image features from the data, so it has good robustness.
Deep learning has aroused increasing research interest in recent years [24,25]. With the continuous development of deep learning theory, the gaze estimation methods based on appearance have been increasingly widely used [26][27][28][29]. In numerous approaches based on appearance, deep learning networks, especially convolutional neural networks (CNNS), exhibit good performance, to a certain extent, improving the accuracy of the gaze to estimate. In most of these studies, they used a front-facing camera to take an image of a human eye or face. To get a complete picture of a face or human eye, one must limit the movement of the head and narrow the field of vision. However, this approach is inapplicable to aircraft cockpit scenarios with multiple screens. This is because, during a flight, the objects which need to be viewed are not concentrated on one screen but spread out across multiple screens. Therefore, to ensure that the flying personnel are not subject to the rotation angle of the head during gaze interaction, this paper proposes an eye-tracking method that uses multiple cameras to record images. This method can ensure that the complete frontal face image can be collected when the pilot turns their head to look at any target on the screen. Then, a CNN and transformer hybrid network model are applied to detect the fixation position of flight personnel in the process of human-computer interaction, using the frontal face image corresponding to each screen as input.

Related work
With the rapid progress of artificial intelligence technology, the traditional human-computer interaction cannot adapt to the multimodal human-machine intelligent environment for the efficient transmission of information. Therefore, it is of great significance to study how to actualize intelligent human-computer interaction. Eye-tracking provides a feasible solution for intelligent interaction. For example, Zhang et al. [30] proposed a multi-device gaze estimation algorithm based on a CNN for specific users. In this algorithm, cameras are installed on five devices, such as mobile phones, tablet computers, and smart TVs to collect the face image dataset of user interaction with the device. When training the CNN, it uses the encoder of a specific device and the shared feature extraction layer to process the image and gives the gaze estimation of the decoder of each device. Li et al. [31] designed an eye-tracking method for gaze control for surgical robots. In this approach, the direction of movement of the surgical robot or area is decided by the by the user's point of gaze. They used images collected by a single camera training a CNN to get the user's gaze position. Finally, the user can control the surgical robot to move in nine directions according to the eye gaze information. Lorenz and Thomas [32] developed an eye-tracking system for detecting human interaction intentions. It uses two continuous cascaded convolutional networks to extract face features and estimate the head pose to determine eye fixation direction. Robots can judge human intentions based on line of sight. Kim et al. [33] developed an interactive system that can control devices through the user's gaze and simple gestures. The system's gaze estimation module uses a video stream recorded by a camera. It detects the user's facial image in the video to get information feature vectors, including the head pose. Then, these feature vectors are fed into the CNN to train the user's gaze estimation model. Luo et al. [34] developed a human-computer interaction control system for wheelchairs using eye-movement tracking and blink detection. It first extracts the pupil feature of the eye through binarization of the human eye image and then obtains the movement trajectory of the eye. Then, the eye movement tracker locates the eye's gaze direction. At the same time, the convolutional neural grid detects the open and closed states of human eyes to judge whether the user blinks. Finally, the system operates according to the user's gaze direction or blink movement to control the operation of the electric wheelchair.
All of the above gaze-based interaction methods show good performance for specific applications. However, the deflection of the user's head will result in a tendency to decrease the accuracy of gaze estimation because the image collected by only one camera cannot contain full-face or complete eye information. Therefore, we have constructed a multi-camera system to study the method of gaze interaction without restricting head movement. In this paper, we mainly design a hybrid network of a CNN and a transformer for gaze estimation to improve the reliability of gaze interaction, aiming at the problem of eye-tracking in the process of visual target acquisition by flight personnel in the cockpit scene should not be more than 4 levels. The fond of heading and subheadings should be 12-point normal Times New Roman. The first letter of headings and subheadings should be capitalized.

Introduction of the method of this paper
The steps of this approach are shown in Figure 1. As the subjects look at the target on one of the three screens, the three cameras will get the initial set of images. Each group of images contains one front and two side images of subjects. First, select the image taken by the frontal camera from the initial set of images, and then detect the facial landmarks of the subjects by using the facial feature point localization network. Each group of images contains one front and two side images of subjects. According to the facial landmarks, set the face ROI and obtain the frontal face image. Then, the human face image is input into the hybrid network of the CNN and transformer built to track the number of screens watched by the pilot in the cockpit and the pilot's fixation position.

Image processing
Pick the image of the frontal camera from the images of the three cameras through the preprocessor, using a facial feature point localization network [35] to extract the face image. This facial feature point localization network, based on the hourglass network [36] architecture used for human posture estimation, replaces the original bottleneck block of the hourglass network with layered, parallel, and multi-scale blocks [37], and then it carries out landmark localization of the face. Obtain the corresponding ROI by using face contour facial landmarks and cutting out the face image. The size of all cropped face images is 224×224. Figure 1 already shows an example of the result of processing a set of images.

A network model for eye tracking
To realize the eye-tracking of the aircraft cockpit scene, we design a eye-tracking model based on a deep learning network. The model comprises a vision transformer (VIT), a feature pyramid network (FPN) and fully connected layers. Figure 2 shows the model framework. The loss function of the model adopts the Minkowski distance, which is defined as where , ∈ = , = ( (1), (2), . . . , ( )) , = ( (1), (2), . . . , ( )) . is a variable parameter. The formula shows that the distance metric of Minkowski distance has tremendous flexibility. It can iterate over P to find the most suitable distance metric for practical applications. After several experimental trials, the value of P in this paper was calculated to be 4.

Vision transformer
A transformer is a new network model that uses the self-attention mechanism to extract intrinsic features [38]. Because the transformer advanced performance in natural language processing, Dosovitskiy et al. [39] attempted to use a standard transformer for image classification and called the network a vision transformer. VIT introduces the concept of an image patch to transform the image into sequence data that the transformer structure can process. Since the input to the standard transformer must be a one-dimensional token embedding sequence, VIT first segments the image into fixed-size patches and generates a linear embedding sequence of these patches. Then the sequence can be used as the input to the transformer. This process is as follows.
Assum image ∈ × × such that (H, W) is the resolution of an image and C is the number of channels. F is divided into N flattened 2D patches ∈ ×( 2 • ) , where = / 2 . We map each patch into a D-dimensional embedding vector via a learnable projection matrix E and add before the D-dimensional embedding vector. is also a D-dimensional learnable embedding vector, which can better represent global information. After that, add the location code which indicates the location information of the patch. We get the following patch embeddings The patch embeddings are input to the encoder of VIT and are processed sequentially by LayerNorm (LN), multihead attention mechanism (MSA) and multilayer perceptron (MLP). The processing equations are (3) to (4).
Apply Layernorm before the multi-headed attention mechanism module and the multi-layer perceptron module and apply residual connectivity after the multi-headed attention mechanism module and the multi-layer perceptron module.

Feature pyramid network
The FPN is a CNN for detecting multi-scale targets [40]. The FPN combines the fine-grained spatial information of shallow feature maps with the semantic information of deep feature maps. It dramatically improves the performance of target detection. The core structure of FPNs contains bottom-up pathways and top-down pathways.
The bottom-up pathway is the forward process of the CNN. In the forward process, the size of the feature map changes after passing through some layers, while it does not change when passing through some other layers. The layers that do not modify the size of the feature map are grouped into one stage so that each extracted feature is the output of the last layer of each step, thus forming a feature pyramid. Specifically, it serves to output the features of the last residual structure in the five stages of the residual neural network. Then, the feature map is up-sampled by a top-down pathway so that the up-sampled feature map has the same size as the feature map of the next layer. The feature maps generated by the bottom-up way are C1, C2, C3, C4, and C5 in Figure 2. The feature maps generated by the top-down path are P2, P3, P4, and P5 in Figure 3. Figure 3 shows the experimental environment of the flight simulation platform. The flight simulation platform comprises a six-axis full-motion platform, three displays, flight joysticks, data measurement instruments, and a mainframe. This study builds a system for capturing targets with gaze during human-computer interaction based on a simulated flight platform. Figure 4 shows the structure of the system.  The system comprises a head motion sensor, three industrial digital cameras, eight infrared light sources, and visual target calibration software. The head motion sensor measures the subject's head posture data, and three industrial black-and-white digital cameras acquire frontal and side images of the subjects. Infrared light sources ensure that the captured images are not affected by external ambient lighting. The function of the visual target calibration software is to record where the target appears during the simulated flight.

Experimental process
To collect head and eye movement data during human-computer interaction, 12 graduate students with normal vision, aged 21-25 years, were recruited as subjects. All subjects had no neurological or psychiatric disorders history and signed an informed consent form before the experiment. In addition, this study has passed the review of the ethics committee of the unit.
The equipment needed calibration before the experiment. Each subject completed 10 sets of experiments, each lasting 30 minutes. Figure 5 shows the experimental process. First, the subjects adjust their sitting posture and wear the head motion sensor. Then, the user opens the visual target calibration software, and a red circle will randomly pop up on the display screen of the simulated flight platform every 10 seconds. The red circle is the target that the subject needed to capture. When it appears, the subject looks at the center of the red circle and presses the space key to indicate that the subject has obtained the target. At this time, three cameras will take an image of the subject while capturing the objective. The head motion sensor also saved the subject's head posture data. The visual target calibration software recorded the coordinates of the center point of the red circle target. Throughout the experiment, the subject's head was able to rotate and capture targets anywhere on the three displays.

Analysis of results
The face images captured by the pre-processed frontal camera were input into the proposed eye-tracking model. The features of face images were extracted using a VIT and FPN, respectively, then, these features were fused through a fully connected layer. The final output was the screen number and the coordinates of the gazing point on the screen. Figure 6 shows the structure of the model. We used classification accuracy as a rubric for screen number prediction. We used the angular error between the true and the predicted gaze positions as the evaluation indicator for gaze estimation. We randomly selected 5000 groups of images from the collected dataset for analysis. The epochs for each experiment were 500. The learning rate for the first 270 epochs was 10 -3 , while the learning rate for the last 230 epochs was 10 -4 . The batch size for each training set was 16. We used simple cross-validation and 10-fold cross-validation to group the sample data for training when dividing the training and test sets, respectively. In the simple cross-validation method, the first 80% of the data set was used as the training set, and the remaining data as the test set. The 10-fold cross-validation divided the dataset into 10 parts, with nine parts used as the training set and one as the testing set. We counted the test results of both methods, as shown in Table 1. Table 1 shows that the 10-fold cross-validation can improve the gaze estimation accuracy and outperforms the simple cross-validation method without considering the time consumed by the model training. Therefore, we chose 10-fold cross-validation for grouping the dataset in this paper. A single transformer network and a single FPN network were used as control groups for comparison with the proposed hybrid transformer and FPN parallel networks. The comparison results for the screen number prediction accuracy are shown in Figure 7(a). Figure 7(b) shows the comparison result for the gaze's angular error. In Figure 7(a), the classification accuracy of our proposed hybrid network is higher than that of other single networks. In Figure 7(b), the angular error of the proposed hybrid network is smaller than that of the single network. Therefore, Figure 7 shows that our constructed transformer and FPN hybrid parallel network outperform the single network.
To further compare the performance of the gaze estimation model in this paper, the CANet model [41] and the MCSANet model [42] were also used on the dataset of this paper. The prediction accuracy of the screen number and the error of gaze estimation for these three models are shown in Figure 8. The results in Figure 8 show that compared with the CANet and the MCSANet, the accuracy of the screen number prediction obtained by this method was the highest, and the angular error of gaze estimation was the lowest, which can better confirm the point of view during human-computer interaction.   The purpose of eye-tracking in this study was to evaluate the effectiveness of subjects' target capture by the proposed model for a flight cockpit scenario with multiple screens. The evaluation metric for target capture is the percentage of red-circled targets captured. Since the target is a circle, we specify that the subject captures the target if the error value of the gaze estimation is less than the radius of the circle target. Otherwise, it means that the subject did not acquire the target. Figure 9 shows the gaze interaction application scenario. The background of figure 9 is the cockpit of an aircraft in a flight simulation game. The display in the picture is the virtual integrated control panel (ICP). In the figure are red circles of different sizes of targets. Each red circle represents a button in the ICP. The user selected three of the red circle targets. These selected targets were numbered I, II, and III. The radius of Target I was 20 pixels. It was set to represent the button for the mode selection function. The radius of Target II was 40 pixels. It was selected to represent the button that implements the communication control function. The radius of Target III was 60 pixels. It was set to represent the button that completes the message input. Target I, Target II, and Target III were used as the objects to be captured by the subjects. The results of using different models for target capture are shown in Figure 10.  Figure 10 shows the capture rates of the five models for three targets. The success rates of all five models on target tracking tended to increase with increasing target size. Compared to other models, the eye-tracking model constructed in this paper can capture all types of targets effectively, with capture rates above 80%. For Target I,I.e. the smallest size, the capture rate of MCSANet was only 17%. Moreover, only the CANet model and our model had a capture rate of more than 50%. In particular, our model had the highest success rate of 80% for capturing Target I. These results indicate that the model proposed in this paper has low eye-tracking errors and can obtain good results when capturing small targets. For Target III with the largest size,the FPN, CANet, and our method each had a success rate of over 60%. However, only the eye-tracking model we built had the Target III capture rate exceeding 90%, showing the optimal performance. In conclusion, the comparison results show that the eye-tracking model established in this paper is more stable for the acquisition of different targets, which is better than other models.

Discussion and conclusion
Aiming at the target capture function in the human-computer interaction process in the flight cockpit scene, this paper presented a hybrid network combining a CNN and transformer for eye tracking. To improve the gaze estimation accuracy, cameras were installed on three display screens in the simulated flight cockpit to capture images containing the subjects' faces. First, all images captured by the frontal camera were selected and cropped to obtain the subject's face image. The advantage of using three cameras is that it removes the limitation of the subject's head rotation angle and expands the subject's field of view.
Then, inspired by previous studies using frontal face images for gaze estimation, we input the cropped frontal face images into the proposed eye-tracking model to predict the gaze position of the subjects. To test the model performance presented in this paper, we compared it with various models. We concluded that the transformer and FPN hybrid parallel network could improve gaze estimation accuracy.
Finally, we applied both the present model and other models to the target capture task in a simulated flight cockpit scenario and found that the performance of our model is superior. The experiments and models designed in this study achieved excellent results on the target capture task for human-computer interaction and achieved the desired goals. However, there are still some problems in the experiment, such as insufficient population distribution of subjects and insufficient ability of real-time target acquisition. Subsequent research will focus on the two main requirements of the extensiveness of the tested population and the real-time nature of target capture. At the same time, we will continue to optimize the neural network model and reduce its complexity.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.